## Linear regression exercises
We will use the [Kaggle dataset about gender pay gap](https://www.kaggle.com/datasets/mohithsairamreddy/salary-data?resource=download).
In Week 1, we learned how to open Kaggle dataset.
Perform the necessary EDA steps and a meaningful linear regression test that you will interpret.


In [72]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm

In [73]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mohithsairamreddy/salary-data")

print("Path to dataset files:", path)

Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/mohithsairamreddy/salary-data/versions/4


In [74]:
# Open your filepath as done last time and visualize it as a df
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,90.0,True,-1.620859
1,28.0,Female,Master's,Data Analyst,3.0,65.0,False,-5.620859
2,45.0,Male,PhD,Senior Manager,15.0,150.0,True,11.379141
3,36.0,Female,Bachelor's,Sales Associate,7.0,60.0,False,2.379141
4,52.0,Male,Master's,Director,20.0,200.0,True,18.379141


### Preprocessing
Some hints for text cleaning (Source: [This tutorial](https://medium.com/@evelyn.eve.9512/gender-pay-gap-comparisons-with-regression-analysis-45223cd3ed13))
<br> <br>
1. `pd.get_dummies()`: for linear regression, you need numerical variables. This method is useful to handle categorical variables. It creates a column for each value, and assigns value 1 (if it corresponds) or 0 (if it does not) to it.
For `gender`, this dataset only has two entries, so we can map it to one single column, which we will call male and identify with True=1 and False=0.

In [75]:
df['Male'] = pd.get_dummies(df['Gender'], drop_first=True)['Male']
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,90.0,True,-1.620859
1,28.0,Female,Master's,Data Analyst,3.0,65.0,False,-5.620859
2,45.0,Male,PhD,Senior Manager,15.0,150.0,True,11.379141
3,36.0,Female,Bachelor's,Sales Associate,7.0,60.0,False,2.379141
4,52.0,Male,Master's,Director,20.0,200.0,True,18.379141


2. It makes more sense to visualize the age as "difference to the mean age", because age=0 is not relevant **to this specific analysis**.
> The step before is necessary, because linear regression needs numerical values. This step is not, and needs to be evaluated depending on your needs.

In [76]:
df['C_Age'] = df['Age'] - df['Age'].mean()
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,90.0,True,-1.623022
1,28.0,Female,Master's,Data Analyst,3.0,65.0,False,-5.623022
2,45.0,Male,PhD,Senior Manager,15.0,150.0,True,11.376978
3,36.0,Female,Bachelor's,Sales Associate,7.0,60.0,False,2.376978
4,52.0,Male,Master's,Director,20.0,200.0,True,18.376978


3. We can divide the salary by 1000 to facilitate its visualization by avoiding huge numbers.
> Also not necessary!

In [77]:
df['Salary'] = df['Salary'] / 1000
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,0.09,True,-1.623022
1,28.0,Female,Master's,Data Analyst,3.0,0.065,False,-5.623022
2,45.0,Male,PhD,Senior Manager,15.0,0.15,True,11.376978
3,36.0,Female,Bachelor's,Sales Associate,7.0,0.06,False,2.376978
4,52.0,Male,Master's,Director,20.0,0.2,True,18.376978


Remember to check for empty values, and **drop them** or **replace them with the mean**, depending on how many they are and how meaningful the mean is.

In [78]:
# Let's check for empty values
df.isna().sum()
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Male,C_Age
0,32.0,Male,Bachelor's,Software Engineer,5.0,0.09,True,-1.623022
1,28.0,Female,Master's,Data Analyst,3.0,0.065,False,-5.623022
2,45.0,Male,PhD,Senior Manager,15.0,0.15,True,11.376978
3,36.0,Female,Bachelor's,Sales Associate,7.0,0.06,False,2.376978
4,52.0,Male,Master's,Director,20.0,0.2,True,18.376978


In [79]:
# We can drop them
df.dropna(inplace=True)

In [80]:
X = df[["Age"]]
y = df["Salary"]

In [81]:
# Sanity check
X.isna().sum()

Age    0
dtype: int64

In [82]:
# Split dataset in train and test set

In [83]:
# Fit the model


In [84]:
# Display the model parameters


In [85]:
# Predict values for the test set

# Get residuals

In [86]:
# Plot the linear regression line

In [87]:
# Calculate R²


### Homework (partially taken from Chapter 3 of "An Introduction to Statistical Learning with Applications in Python")
Fit a single linear regression modle between Age and Salary, and a multiple linear regression model between variables of your choice between Age, Gender, Years of Experience, and Salary (at least 3 - it is multiple!). <br><br> Follow last week's tutorial and perform first some EDA, then fit the model, evaluate it, and plot the results. <br><br>
**For each model, answer the following questions:**
- Is there a relationship between the predictor(s) and the response?
- How strong is the relationship between the predictor(s) and the response? Write down how much each confidence metric is and (shortly) what they mean.
- What is the predicted salary associated with the age of 30?
- What are the associated 95 % confidence and prediction intervals?