#  Salary Prediction 

This notebook serves as the entry point for our salary prediction pipeline.

It reads the input dataset, performs cleaning and transformation, trains a model, and evaluates its performance.
The core idea is to build a simple but modular pipeline which is flexible enough to incorporate aditional features and models.

---

##  Table of Contents

1. [Imports](#imports)  
2. [EDA – Raw Data Check](#eda-raw)  
3. [EDA – Visual Exploration](#eda-visual)  
4. [Preprocessing](#preprocessing)  
5. [Feature Transformation](#features)  
6. [Model Training & Evaluation](#model)
7. [Predicting on New Input Data](#7-predicting-on-new-input-data)


###  1. Imports

All core libraries (Pandas, NumPy, Seaborn...) and our own modular code from `src/`.


In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os


from src.preprocessing import prepare_data
from src import eda
from src.features import transform_features
from src.model import split_data, train_model, evaluate_model
from src.predict import predict_from_csv



###  2. EDA – Raw Data Check

Before running the pipeline, we quickly check for:
- Consistency between the two CSV files (people & salary)
- Null values
- General data structure

This is optional but helps ensure the data looks clean enough to proceed.


In [None]:
pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None)  
pd.set_option('display.width', None)  
pd.set_option('display.expand_frame_repr', False)  

#Load raw data (people and salary info)
df_people = pd.read_csv("data/people.csv")
df_salary = pd.read_csv("data/salary.csv")

#Check consistency in both df
eda.check_id_consistency(df_people, df_salary)

#Check how many nulls are present in people.csv
eda.check_nulls(df_people, name="people.csv")


df_merged = df_people.merge(df_salary, on="id", how="left")

# Check how many rows of salary has nulls.
eda.count_salary_nulls(df_merged)

# Count and display rows that have at least one null value
eda.count_rows_with_any_null(df_merged, name="merged df")

#Print shape, types and head of the merged dataset
eda.print_df_overview(df_merged, name="merged df")



###  3. EDA – Visual Exploration

Here we look at the distribution of variables like Age, Salary, and Years of Experience.

Also checks how many job titles appear more than a threshold.  
It helps us decide which ones to group under "Other".


In [None]:
#Reload data for visualization purposes
df_people = pd.read_csv("data/people.csv")
df_salary = pd.read_csv("data/salary.csv")

df_merged_exp = df_people.merge(df_salary, on="id", how="left")


# Drop null rows and apply log transform to salary (based on earlier EDA)
df_clean_exp = df_merged_exp.dropna().copy()
df_clean_exp["Salary_log"] = np.log(df_clean_exp["Salary"])

#Plotting different data distributions.
eda.plot_distributions(df_clean_exp)
eda.count_job_titles(df_clean_exp, threshold=6)



### 4. Preprocessing

Loads and merges both datasets, removes null rows, and adds a log-transformed Salary column.

Everything here is done through `prepare_data()` inside `src/preprocessing.py`.


In [None]:
# Load and clean data (null removal + log transform on Salary)

df_clean = prepare_data("data/people.csv", "data/salary.csv")
#df_clean.head()


###  5. Feature Transformation

We apply:
- One-hot encoding to `Education Level`
- Job Title grouping (threshold-based)

Handled by `transform_features()` inside `src/features.py`.


In [None]:

job_title_threshold = 3  #Minimum count to keep job title (else grouped as "other")
X, y = transform_features(df_clean, job_threshold=job_title_threshold)



###  6. Model Training & Evaluation

We train a Linear Regression model and evaluate its performance using:
- MAE & RMSE
- 95% confidence intervals (via bootstrap)
- Comparison with a DummyRegressor

All metrics are printed.

In [None]:

#Split data into training and testing sets
X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.2, random_state=42)

#Train linear regression model
model = train_model(X_train, y_train)

# Evaluate performance with MAE, RMSE and 95% CI ( with bootstrap)
evaluate_model(model, X_test, y_test)


### 7. Predicting on New Input Data

If a CSV file containing new records is found (named `predict_sample.csv`), we apply the trained model to generate predicted salaries.

This allows the user to calculate salaries using different independant variables.

If the file is not found, the block is skipped safely.


In [None]:

prediction_file = "data/predict_sample.csv"                     

# Check if file exists
if os.path.exists(prediction_file):
    print("Prediction file found. Running predictions\n")

    new_predictions = predict_from_csv(prediction_file, model, job_threshold=job_title_threshold)

    # Show results
    print("Predicted salaries for new input:\n")
    print(new_predictions[["Age", "Education Level", "Job Title", "Years of Experience", "Predicted Salary"]])

else:
    print(f"File '{prediction_file}' not found. Skipping prediction block.")



