In [25]:
#importing required libraries

import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, accuracy_score

In [26]:
df = pd.read_csv('Salary_Data.csv')

In [27]:
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [28]:
df.isnull().sum()

Age                    2
Gender                 2
Education Level        3
Job Title              2
Years of Experience    3
Salary                 5
dtype: int64

In [29]:
df = df.dropna()

In [30]:
df.isnull().sum()

Age                    0
Gender                 0
Education Level        0
Job Title              0
Years of Experience    0
Salary                 0
dtype: int64

In [31]:
df.describe()

Unnamed: 0,Age,Years of Experience,Salary
count,6698.0,6698.0,6698.0
mean,33.623022,8.095178,115329.253061
std,7.615784,6.060291,52789.792507
min,21.0,0.0,350.0
25%,28.0,3.0,70000.0
50%,32.0,7.0,115000.0
75%,38.0,12.0,160000.0
max,62.0,34.0,250000.0


In [32]:
X_reg = df.drop("Salary", axis=1)
y_reg = df["Salary"]

In [33]:
# Perform one-hot encoding for categorical variables
data_encoded = pd.get_dummies(df, columns=["Gender", "Job Title", "Education Level"])

In [34]:
# Split into features (X) and target variable (y) for regression
X_reg = data_encoded.drop("Salary", axis=1)
y_reg = data_encoded["Salary"]

In [35]:
# Split into features (X) and target variable (y) for classification
salary_threshold = data_encoded["Salary"].mean()
data_encoded["Salary_Class"] = np.where(data_encoded["Salary"] >= salary_threshold, "High", "Low")
X_cls = data_encoded.drop(["Salary", "Salary_Class"], axis=1)
y_cls = data_encoded["Salary_Class"]

In [36]:
# Train-test split for regression
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

In [37]:
# Train-test split for classification
X_cls_train, X_cls_test, y_cls_train, y_cls_test = train_test_split(X_cls, y_cls, test_size=0.2, random_state=42)

In [38]:
# Overfitted Regression Model
reg_model_overfit = DecisionTreeRegressor(max_depth=10)
reg_model_overfit.fit(X_reg_train, y_reg_train)
y_reg_pred_overfit = reg_model_overfit.predict(X_reg_test)

In [39]:
# Underfitted Regression Model
reg_model_underfit = LinearRegression()
reg_model_underfit.fit(X_reg_train, y_reg_train)
y_reg_pred_underfit = reg_model_underfit.predict(X_reg_test)

In [40]:
# Well Balanced Regression Model
reg_model_well_balanced = DecisionTreeRegressor(max_depth=5)
reg_model_well_balanced.fit(X_reg_train, y_reg_train)
y_reg_pred_well_balanced = reg_model_well_balanced.predict(X_reg_test)

In [41]:
# Overfitted Classification Model
cls_model_overfit = DecisionTreeClassifier(max_depth=10)
cls_model_overfit.fit(X_cls_train, y_cls_train)
y_cls_pred_overfit = cls_model_overfit.predict(X_cls_test)

In [58]:
# Underfitted Classification Model
clf_underfit = GaussianNB()
clf_underfit.fit(X_cls_train, y_cls_train)
y_cls_pred_underfit = clf_underfit.predict(X_cls_test)

In [59]:
# Well Balanced Classification Model
cls_model_well_balanced = DecisionTreeClassifier(max_depth=5)
cls_model_well_balanced.fit(X_cls_train, y_cls_train)
y_cls_pred_well_balanced = cls_model_well_balanced.predict(X_cls_test)

Overfitted Model:

For the overfitted model, a decision tree regressor with a high maximum depth is chosen for regression, and a decision tree classifier with a high maximum depth is chosen for classification. The overfitted model is selected to showcase the consequences of overfitting. By allowing the model to become overly complex, it memorizes the training data and captures noise, resulting in poor generalization on unseen data. This helps highlight the trade-off between model complexity and generalization performance.

Underfitted Model:

For regression , a simple linear regression model is chosen. The underfitted model is selected to illustrate the scenario where the model is not complex enough to capture the underlying patterns in the data. It results in high bias and poor performance both on the training and test data. The underfitted model is intentionally chosen to showcase the importance of having sufficient complexity in the model to accurately represent the relationships in the data.

For classification , a naive Bayes classifier is chosen for classification. Naive Bayes is a simple and computationally efficient algorithm that makes strong assumptions about the independence of features. It is known to have a bias towards simplicity, which can lead to underfitting. The underfitted model is selected to demonstrate the limitations of an overly simple model that fails to capture complex relationships in the data.

Well-Balanced Model:

In both regression and classification cases, a decision tree regressor/classifier with an optimal maximum depth or other hyperparameters is used to create a well-balanced model.

The well-balanced model aims to generalize well to new, unseen data by capturing the essential patterns and relationships in the training data without being overly complex. It strikes a balance between capturing enough complexity to represent the underlying patterns and avoiding excessive complexity that may lead to overfitting.

In [44]:
# Regression Model Evaluation
reg_mse_overfit = mean_squared_error(y_reg_test, y_reg_pred_overfit)
reg_mse_underfit = mean_squared_error(y_reg_test, y_reg_pred_underfit)
reg_mse_well_balanced = mean_squared_error(y_reg_test, y_reg_pred_well_balanced)

print("MSE Overfit:", reg_mse_overfit)
print("MSE Underfit:", reg_mse_underfit)
print("MSE Well Balanced:", reg_mse_well_balanced)

MSE Overfit: 146235410.71747223
MSE Underfit: 3.364795322751358e+27
MSE Well Balanced: 398563161.5193589


In [60]:
# Classification Model Evaluation
cls_accuracy_overfit = accuracy_score(y_cls_test, y_cls_pred_overfit)
cls_accuracy_underfit = accuracy_score(y_cls_test, y_cls_pred_underfit)
cls_accuracy_well_balanced = accuracy_score(y_cls_test, y_cls_pred_well_balanced)

In [61]:
print("Accuracy Score of Overfit:",cls_accuracy_overfit )
print("Accuracy Score of Underfit:",cls_accuracy_underfit )
print("Accuracy Score of Well Balance:",cls_accuracy_well_balanced  )

Accuracy Score of Overfit: 0.9686567164179104
Accuracy Score of Underfit: 0.7089552238805971
Accuracy Score of Well Balance: 0.926865671641791


In [62]:
# Scatter plot comparing actual and predicted salaries for the regression models
fig_reg = px.scatter(x=y_reg_test, y=y_reg_pred_overfit, labels={'x': 'Actual Salary', 'y': 'Predicted Salary'},
                     title='Overfitted Regression Model')
fig_reg.show()

In [63]:
fig_reg = px.scatter(x=y_reg_test, y=y_reg_pred_underfit, labels={'x': 'Actual Salary', 'y': 'Predicted Salary'},
                     title='Underfitted Regression Model')
fig_reg.show()

In [64]:
fig_reg = px.scatter(x=y_reg_test, y=y_reg_pred_well_balanced, labels={'x': 'Actual Salary', 'y': 'Predicted Salary'},
                     title='Well Balanced Regression Model')
fig_reg.show()

In [65]:
# Bar plot comparing actual and predicted salary classes for the classification models
fig_cls = px.histogram(data_encoded, x="Salary_Class", color="Salary_Class", title="Actual Salary Classes")
fig_cls.show()

In [66]:
df_pred = X_cls_test.copy()
df_pred["Salary_Class_Pred"] = y_cls_pred_well_balanced

In [67]:
fig_cls = px.histogram(df_pred, x="Salary_Class_Pred", color="Salary_Class_Pred", title="Well Balanced Classification Model")
fig_cls.show()

In [68]:
df_pred["Salary_Class_Pred"] = y_cls_pred_overfit

In [69]:
fig_cls = px.histogram(df_pred, x="Salary_Class_Pred", color="Salary_Class_Pred", title="Overfitted Classification Model")
fig_cls.show()

In [70]:
df_pred["Salary_Class_Pred"] = y_cls_pred_underfit

In [71]:
fig_cls = px.histogram(df_pred, x="Salary_Class_Pred", color="Salary_Class_Pred", title="Underfitted Classification Model")
fig_cls.show()

Regularization is a technique used in machine learning to prevent overfitting and improve the generalization of models. It involves adding a regularization term to the loss function, which penalizes complex models by discouraging large coefficient values. Regularization helps to control the model's complexity and reduces the risk of overfitting by balancing the fit to the training data and the ability to generalize to unseen data.

There are two commonly used types of regularization:

1. L1 Regularization (Lasso Regularization):
   L1 regularization adds a penalty term to the loss function proportional to the absolute value of the coefficients. It encourages sparsity in the model by driving some coefficients to exactly zero. L1 regularization can be useful for feature selection, as it automatically selects the most relevant features and eliminates irrelevant ones.

2. L2 Regularization (Ridge Regularization):
   L2 regularization adds a penalty term to the loss function proportional to the square of the coefficients. It encourages smaller and more evenly distributed coefficient values. L2 regularization helps to reduce the impact of individual features and prevents excessive reliance on any particular feature.

Bias-Variance trade-off is a fundamental concept in machine learning related to model performance:

- Bias refers to the error introduced by approximating a real-world problem with a simplified model. High bias models are often too simplistic and unable to capture the underlying patterns in the data. They may underfit the data and have poor performance both on the training set and new, unseen data.

- Variance refers to the sensitivity of a model to fluctuations in the training data. High variance models are overly complex and highly sensitive to the training data. They tend to overfit the training data, memorizing noise and performing poorly on new, unseen data.

The bias-variance trade-off suggests that there is a trade-off between the model's ability to fit the training data (low bias) and its ability to generalize to new data (low variance). By regularizing the model, we can control its complexity and strike a balance between bias and variance. Regularization helps to reduce variance by shrinking the coefficients and prevents overfitting, but it increases bias to some extent.

Finding the right balance between bias and variance is crucial. Models with high bias may need more complexity, while models with high variance may require regularization to reduce overfitting. It is important to evaluate the model's performance on a separate test set or through cross-validation to ensure it has a good balance between bias and variance, leading to better generalization and predictive power.