# Regression MSA Part 2



# Explaining the variables
Work_year: This feature refers to the number of years an individual has been in the workforce. It is a numeric variable.

Experience_level: This categorical variable indicates the individual's experience level. This gives an idea about the expertise and seniority of the individual in their field.

Employment_type: This is also a categorical variable which describes the type of employment the individual holds.

Job_title: This is another categorical feature which gives us insight into the individual's job title.

Salary: This is a continuous numeric variable that indicates the salary earned by the individual. 

Salary_currency: This is another categorical variable that represents the currency the salary of employees is paid in. The currency range from 'USD', to'GBP'.

Salary_in_usd: This numeric variable represents the individual's salary in USD. It is calculated based on the original salary and the exchange rate from the salary currency to USD.

Employee_residence: This categorical variable indicates the country where the employee lives and works from.

Remote_ratio: This is a numerical variable being either 100 or 0.

Company_location: This is a categorical variable indicating the location of the company where the individual is employed, such as the US.

Company_size: This is a categorical variable which indicates the company's size. 

These are all the features given to me to find my target, the 'salary_in_usd' feature, as this feature does not need any conversion (i.e., converting currencies).

## Importing libaries and data_salaries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor


# --- Load data ---
df_salaries = pd.read_csv('data_salaries.csv')



## setting and spliting the data

In [None]:
# For numerical features
numerical_features = df_salaries.select_dtypes(include=[np.number])
numerical_summary = numerical_features.describe().transpose()

# For categorical features
categorical_features = df_salaries.select_dtypes(include=['object'])
categorical_summary = categorical_features.describe(include=['object']).transpose()

df_encoded_residence = pd.get_dummies(df_salaries['employee_residence'])
df_encoded_job_title = pd.get_dummies(df_salaries['job_title'])

# Drop the original 'employee_residence', 'job_title', and 'Profession' columns
df_salaries = df_salaries.drop(columns=['employee_residence', 'job_title'])

# Join the encoded data frame with the original ones
df_salaries = df_salaries.join(df_encoded_residence)
df_salaries = df_salaries.join(df_encoded_job_title)

# Convert ordinal variables to numerical
df_salaries['company_size'] = df_salaries['company_size'].map({'S': 0, 'M': 1,'L':2})
df_salaries['experience_level'] = df_salaries['experience_level'].map({'SE': 0, 'MI':1,'EN':2,'EX':3})
df_salaries['employment_type'] = df_salaries['employment_type'].map({'FT': 0, 'CT':1,'FL':2,'PT':3})
# Function to remove outliers
def remove_outliers(df):
    numeric_cols = df.select_dtypes(include=[np.number])
    mean = numeric_cols.mean()
    std = numeric_cols.std()
    is_outlier = (np.abs(numeric_cols - mean) > 3 * std).any(axis=1)
    return df[~is_outlier]
df_salaries = remove_outliers(df_salaries)

In the code above I decided it was best to do one-Hot Encoding, by doing so I can convert categorical variables into a binary format and appends them to the main data frame, while removing the original columns.
Ordinal Conversion by doing this I was able to convert them into numerical form so the model can use them.
Outlier Removal was also performed to make the model more accurate. 

The model chosen was RandomForestRegressor and it works by creating multiple decision trees from the training data provided by MSA, the trees work by using bootstrapping and random feature selection. Each tree in the forest makes a prediction for a given data, and the final prediction is the average of all trees' generated predictions. This method is excellent as it reduces overfitting while also being able to handle mixed feature types and missing data which can be a common problem in any data set, but maybe computationally intensive if using a lot of data.


In [None]:
# Chosing which features not to target for the model
X = df_salaries.drop(columns=['salary_in_usd','salary_currency','company_location','work_year'])
y = df_salaries['salary_in_usd']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
rf = RandomForestRegressor(n_estimators=100, random_state=777)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Print the Root Mean Squared Error
print("Root Mean Squared Error: ", np.sqrt(mean_squared_error(y_test, y_pred)))
min_salary = df_salaries['salary_in_usd'].min()
max_salary = df_salaries['salary_in_usd'].max()

print("Minimum salary: ", min_salary)
print("Maximum salary: ", max_salary)

# Calculate the Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: ", mae)

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print("R-squared: ", r2)

# define a range of seeds
seeds = range(10)

mae_values = []
r2_values = []

# loop over the seeds
for seed in seeds:

    # Split the dataset into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

    # Train the model
    rf = RandomForestRegressor(n_estimators=100, random_state=777)
    rf.fit(X_train, y_train)


    # Make predictions
    y_pred = rf.predict(X_test)

    # Calculate the Mean Absolute Error and R2 score
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # append the scores to their respective lists
    mae_values.append(mae)
    r2_values.append(r2*1000)

Data Splitting: I began by dividing the data into training and test sets for my salary prediction project.

Prediction: After training the model on the training data, I made predictions on the test data to assess how well the model generalizes to unseen data.


Seed Variation: To ensure the true accuracy I conducted a seed variation analysis meaning it runs through random seeds. By changing the random seed for data splitting, I observed and plotted the changes in the model's performance metrics. This analysis showed that the model's performance remained consistent across different train-test splits, indicating its robustness and accuracy.

In [None]:
# plot the results
plt.plot(seeds, mae_values, label='Mean Absolute Error')
plt.plot(seeds, r2_values, label='R-squared')
plt.xlabel('Seed')
plt.ylabel('Value')

for i, txt in enumerate(mae_values):
    plt.text(seeds[i], mae_values[i], f'{txt:.2f}', ha='center')

for i, txt in enumerate(r2_values):
    plt.text(seeds[i], r2_values[i], f'{txt:.2f}', ha='center')

plt.legend()
plt.title('The difference between the Root Mean Error and the R-squared over 10 seeds')
plt.show()


Understanding MAE: Mean Absolute Error is a commonly used metric in regression analysis. The model calculates MAE. The MAE value provides penalty by comparing the actual value to the predicted one. It is also a good indication of the model's predictions. The MAE It had a range from 1059-1694, which is very low as the range of earnings is 5132-324000.

Overall, the combination of model training, cleaning, and seed variation helps to has provided valuable insights into the performance of this salary prediction model, helping me make informed decisions about its suitability for real-world applications.

Evaluation: To measure the model's performance, I calculated several key metrics, including the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2). By running the code with 10 random seeds I was able to produce a pattern showing the difference over the seeds, proving that the model is robust and fits multiple seeds. I also times the R2 values by 1000 to see the difference more clearly, as it was to small to visualise the numbers without it.

- The MAE value of 1546 indicates that, on average, my model's predictions were about $1546 away from the actual salary values.

- The R-squared (R2) score of 0.98 suggests that approximately 98% of the variation in salary could be explained by the features in the model. R2 is a measure of how well the model fits the data as predicted a higher values leads to a better fit.

# Summary 

In this part of MSA, the objective was to predict salary in USD based on various features provided in the data set, using the model of our choice. First, the data was preprocessed by turning the data into numerical form. It was also required to perform one-hot encoding for categorical variables into numeric values so the model could read it. The preprocessed data was then split into a training set and a test set the split being 30/70.

A RandomForestRegressor model was trained on the training data using 100 estimators. The model was then used to make predictions based on the data provided, and its performance was later then evaluated using metrics that are used to test regression data, such as the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2). The MAE value of 1323 indicates that the model's predictions were, on average, about $1323 away from the actual values. The R2 score of 0.98 implies that 98% of predictions were correct, demonstrating outstanding model performance.

Furthermore, a seed variation analysis was conducted to test the stability of the model's performance. The random seed for data splitting was varied, and the changes in the model's performance metrics were observed, and there was a slight variation which is to be expected. This analysis showed the model's robustness to different training and test set configurations.

Additionally, feature importances from the RandomForestRegressor model were analysed to gain deeper insights into the factors influencing salary predictions. This analysis offered valuable information from job seekers to employees wanting to know how much people should earn. Moreover, exploring other machine learning algorithms or fine-tuning model parameters might present opportunities for further enhancing the predictive performance of the salary prediction model; also, by increasing the amount of data, the model would be more accurate, but it would take longer as the model will take longer to compute.

Examining the feature in the data helped shed light on which features are most influential in predicting salary, potentially informing job seekers or recruiters about the most significant factors they should look for in employees and how much they are "worth". Further tuning the model parameters or trying other machine learning algorithms could improve the model's performance.
