![iut](https://github.com/Hexanol777/STEM-Salaries-Case-Study/tree/main/Phase%201/stock_image/IUT200.png)
<hr style="margin-bottom: 40px;">

<img src="https://github.com/Hexanol777/STEM-Salaries-Case-Study/tree/main/Phase%201/stock_image/reg.png"
    style="width:400px; float: right; margin: 0 40px 40px 40px;"></img>

# STEM Jobs Salaries

## Regression

#### During the regressionphase of our data analysis, We used a variety of machine learning algorithms, including ElasticNet and Lasso regression, throughout the regression and model training stage of our data analysis to create models that could forecast the base income of STEM professionals based on a collection of features. These models were developed using the Scikit-Learn toolkit, and our preprocessed dataset served as their training ground. To make sure that our models weren't overfitting the data, we also used cross-validation. Metrics like mean squared error (MSE) and R-squared score were used to assess how well the models performed. 

[Link to the Data used in this Notebook](https://drive.google.com/file/d/1IhXv0qcq7YFfBxc0BQB1-z74wF40ZnZn/view?usp=share_link)

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Importing Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

%matplotlib inline

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Loading and Sampling the Data:

In [None]:
Data = pd.read_csv(
    'data/jobs_with_country_codes.csv',
    parse_dates=['Timestamp'])
Data.head()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Linear Regression:

In [None]:
# Define the features and target variable
features  = ['YearsOfExperience', 'YearsAtCompany', 'IsUS', 'IsCA', 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree']
target = 'BaseSalary'

# Split the dataset into training and testing sets
train_data = Data.sample(frac=0.8, random_state=1)
test_data = Data.drop(train_data.index)


# Train the model
model = LinearRegression().fit(train_data[features], train_data[target])

# Predict the target variable for the testing set
predictions = model.predict(test_data[features])


# Calculate the mean squared error and R-squared score
mse = mean_squared_error(test_data[target], predictions)
r2 = r2_score(test_data[target], predictions)

print('Mean squared error: {:.2f}'.format(mse))
print('R-squared score: {:.2f}'.format(r2))
print('Coefficients used for this model are as : \n',model.coef_)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Polynomial Regression



In [None]:
from sklearn.preprocessing import PolynomialFeatures

# select the features and target variable
X = Data[['YearsOfExperience', 'YearsAtCompany', 'IsUS', 'IsCA', 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree']]
y = Data['BaseSalary']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# train a linear regression model on the polynomial features
model = LinearRegression()
model.fit(X_train_poly, y_train)

# make predictions on the test set
y_pred = model.predict(X_test_poly)

# evaluate the model using mean squared error and r-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean squared error:', mse)
print('R-squared score:', r2)
print('Coefficients used for this model are as : \n',model.coef_)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Ridge Regression



In [None]:
from sklearn.linear_model import Ridge

# Select features and target
X = Data[['YearsOfExperience', 'YearsAtCompany', 'IsUS', 'IsCA', 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree']]
y = Data['BaseSalary']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the Ridge Regression model
ridge = Ridge(alpha=0.1, random_state=42)
ridge.fit(X_train, y_train)

# Predict the target variable for the test set
y_pred = ridge.predict(X_test)

# Evaluate the model using Mean Squared Error (MSE) and R-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error: {:.2f}".format(mse))
print("R-squared score: {:.2f}".format(r2))
print('Coefficients used for this model are as : \n',ridge.coef_)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Lasso Regression



In [None]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# Select the features and target
features = ['YearsOfExperience', 'YearsAtCompany', 'IsUS', 'IsCA', 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree']
target = ['BaseSalary']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(Data[features], Data[target], test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

# Train the model
lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000)
lasso.fit(X_train_poly, y_train)

# Make predictions on the test set
y_pred = lasso.predict(X_test_poly)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean squared error:', mse)
print('R-squared score:', r2)
print('Coefficients used for this model are as : \n',lasso.coef_)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## ElasticNet Regression



In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# define the features and target variable
features = ['YearsOfExperience', 'YearsAtCompany', 'IsUS', 'IsCA', 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree']
target = 'BaseSalary'

# create X and y arrays
X = Data[features]
y = Data[target]

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# set hyperparameters alpha and l1_ratio
alpha = 0.5
l1_ratio = 0.5

# define the ElasticNet model with default hyperparameters
model = make_pipeline(PolynomialFeatures(2), ElasticNet(alpha=alpha, l1_ratio=l1_ratio))

# fit the model on the training data
model.fit(X_train, y_train)

# make predictions on the test data
y_pred = model.predict(X_test)

# print the model coefficients
coefs = model.named_steps['elasticnet'].coef_
intercept = model.named_steps['elasticnet'].intercept_
equation = "y = {:.2f}".format(intercept)
for i, coef in enumerate(coefs):
    if i == 0:
        equation += " + {:.2f}".format(coef)
    else:
        if coef >= 0:
            equation += " + {:.2f}x{}".format(coef, i)
        else:
            equation += " - {:.2f}x{}".format(abs(coef), i)


# calculate and print evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error: {:.2f}".format(mse))
print("R-squared score: {:.2f}".format(r2))
print("Model equation:\n", equation)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## More Feature Scaling?

By adding the encoding column `'IsUS'`, `'IsCA'`, `'IsID'`, `'IsIN'`, `'IsDE'`, `'IsMale'`, and `'IsFemale'` to our dataset, we were able to perform scaling on our data, which helped to decrease the mean squared error (MSE) of our model and increase the R-squared score. This is because scaling the data helps to normalize the values within the dataset, reducing the effect of outliers and making it easier for our model to learn patterns in the data.

In [None]:
from sklearn.preprocessing import StandardScaler

dataframe = Data
X_s = dataframe[['BaseSalary', 'StockGrantValue', 'Bonus']].values

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X_s)

# Replace the original columns with the scaled columns in the dataframe
dataframe[['BaseSalary', 'StockGrantValue', 'Bonus']] = X_scaled

# select the features and target variable
X = Data[['YearsOfExperience', 'YearsAtCompany', 'StockGrantValue', 'Bonus', 'IsUS', 'IsCA'
          , 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree']]

y = Data['BaseSalary']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# train a linear regression model on the polynomial features
model = LinearRegression()
model.fit(X_train_poly, y_train)

# make predictions on the test set
y_pred = model.predict(X_test_poly)

# evaluate the model using mean squared error and r-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean squared error:', mse)
print('R-squared score:', r2)
print('Coefficients used for this model are as : \n',model.coef_)


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Even More Scaling?

Adding the company encode column only provided us with a mere 0.02 R2-score increase, which isn't worth it for the sake of model complexity.

In [None]:
data_scaled = pd.read_csv(
    'data/jobs_with_country_codes Scaled.csv',
    parse_dates=['Timestamp'])
data_scaled.head()

In [None]:
from sklearn.preprocessing import StandardScaler

X_s = data_scaled[['BaseSalary', 'StockGrantValue', 'Bonus']].values

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X_s)

# Replace the original columns with the scaled columns in the dataframe
data_scaled[['BaseSalary', 'StockGrantValue', 'Bonus']] = X_scaled


# select the features and target variable
X = data_scaled[['YearsOfExperience', 'YearsAtCompany', 'StockGrantValue', 'Bonus', 'IsUS', 'IsCA'
          , 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree'
          , 'IsSE', 'IsPM', 'IsSEM', 'IsDS', 'IsHE', 'IsAmazon', 'IsMicro', 'IsGoogle', 'IsFacebook', 'IsApple']]

y = data_scaled['BaseSalary']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# train a linear regression model on the polynomial features
model = LinearRegression()
model.fit(X_train_poly, y_train)

# make predictions on the test set
y_pred = model.predict(X_test_poly)

# evaluate the model using mean squared error and r-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean squared error:', mse)
print('R-squared score:', r2)
print('Coefficients used for this model are as : \n',model.coef_)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Cross Validation?

Cross-validation is a statistical method used to estimate the performance of a machine learning model. The process involves partitioning a dataset into k equally sized parts or folds, where k is typically set to 5 or 10. One of the folds is held out as the validation set, while the other k-1 folds are used to train the model. This process is repeated k times, with each of the k folds used exactly once as the validation data.

In [None]:
from sklearn.preprocessing import StandardScaler

X_s = data_scaled[['BaseSalary', 'StockGrantValue', 'Bonus']].values

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X_s)

# Replace the original columns with the scaled columns in the dataframe
data_scaled[['BaseSalary', 'StockGrantValue', 'Bonus']] = X_scaled


# select the features and target variable
X = data_scaled[['YearsOfExperience', 'YearsAtCompany', 'StockGrantValue', 'Bonus', 'IsUS', 'IsCA'
          , 'IsID', 'IsIN', 'IsDE', 'IsMale', 'IsFemale', 'Masters_Degree', 'Bachelors_Degree'
          , 'IsSE', 'IsPM', 'IsSEM', 'IsDS', 'IsHE', 'IsAmazon', 'IsMicro', 'IsGoogle', 'IsFacebook', 'IsApple']]

y = data_scaled['BaseSalary']

# create polynomial features
poly = PolynomialFeatures(degree=2)


# train a linear regression model on the polynomial features
model = LinearRegression()

# Perform k-fold cross validation
k = 5  # Number of folds
scores = cross_val_score(model, X, y, cv=k, scoring='neg_mean_squared_error')

# Calculate the mean squared error for each fold
mse_scores = -scores  # Convert negative scores to positive
mean_mse = np.mean(mse_scores)

# Print the mean squared error for each fold and the overall mean squared error
print('MSE Scores:', mse_scores)
print('Mean MSE:', mean_mse)

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
