# Salary Prediction Analysis

## Introduction
This project aims to predict salaries based on various features such as age, education level, years of experience, country, race, and job title. The dataset used is `Salary_Data_Based_country_and_race.csv`, which contains salary information along with demographic and employment details.

### Objectives:
- Perform data cleaning and preprocessing.
- Explore the dataset to understand the distribution of features.
- Build and evaluate different regression models to predict salaries.
- Identify the best-performing model for salary prediction.

In [14]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Load the dataset
data_raw = pd.read_csv('Salary_Data_Based_country_and_race.csv')

# Display basic information about the dataset
data_raw.info()

# Check for duplicates and missing values
print("Number of duplicates:", data_raw.duplicated().sum())
print("Missing values per column:\n", data_raw.isna().sum())

In [None]:
# Drop unnecessary columns and handle missing values
data_raw.drop(columns="Unnamed: 0", inplace=True)
data_raw.dropna(inplace=True)

# Display unique values for categorical columns
print("Unique Genders:", data_raw['Gender'].unique())
print("Unique Education Levels:", data_raw['Education Level'].unique())
print("Unique Countries:", data_raw['Country'].unique())
print("Unique Races:", data_raw['Race'].unique())

In [None]:
# Create a copy of the dataset for preprocessing
df = data_raw.copy()

# Map categorical variables to their frequency
df['Gender'] = df['Gender'].map(df['Gender'].value_counts())
df['Race'] = df['Race'].map(df['Race'].value_counts())

# Map job titles and countries to their mean salary
df['Job Title'] = df['Job Title'].map(df.groupby('Job Title')['Salary'].mean())
df['Country'] = df['Country'].map(df.groupby('Country')['Salary'].mean())

# Drop the original salary column
df.drop(columns='Salary', inplace=True)

# Standardize education levels
df['Education Level'] = df['Education Level'].replace({
    "Bachelor's": "Bachelor",
    "Bachelor's Degree": "Bachelor",
    "Master's": "Master",
    "Master's Degree": "Master",
    "phD": "PhD",
    "PhD": "PhD"
})

# Map education levels to numerical values
education_mapping = {
    "High School": 0,
    "Bachelor": 1,
    "Master": 2,
    "PhD": 3
}
df['Education Level'] = df['Education Level'].map(education_mapping)

# Display the preprocessed dataset
df.head()

In [None]:
# Visualize numerical data distribution
numerical_data = df[['Age', 'Years of Experience']]
plt.boxplot(numerical_data)
plt.title('Boxplot of Age and Years of Experience')
plt.show()

# Visualize salary distribution
plt.boxplot(data_raw['Salary'])
plt.title('Boxplot of Salary')
plt.show()

In [19]:
# Split the data into training and testing sets
x = df.values
y = data_raw['Salary'].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=444)

# Scale the data
scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [None]:
# Linear Regression
lr = LinearRegression()
lr.fit(x_train_scaled, y_train)
y_pred_test = lr.predict(x_test_scaled)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
test_r2 = r2_score(y_test, y_pred_test)
print(f'Linear Regression - Test RMSE: {test_rmse}, Test R²: {test_r2}')

In [None]:
# Polynomial Regression
degrees = [1, 2, 3, 4, 5]
train_errors, test_errors, train_r2_scores, test_r2_scores = [], [], [], []

for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    x_train_poly = poly.fit_transform(x_train_scaled)
    x_test_poly = poly.transform(x_test_scaled)
    lr.fit(x_train_poly, y_train)
    y_pred_test_poly = lr.predict(x_test_poly)
    test_rmse_poly = np.sqrt(mean_squared_error(y_test, y_pred_test_poly))
    test_r2_poly = r2_score(y_test, y_pred_test_poly)
    train_errors.append(test_rmse_poly)
    test_errors.append(test_rmse_poly)
    train_r2_scores.append(test_r2_poly)
    test_r2_scores.append(test_r2_poly)

best_degree = degrees[np.argmin(test_errors)]
print(f'Best degree for Polynomial Regression: {best_degree}')

In [None]:
# K-Nearest Neighbors Regression
knr = KNeighborsRegressor()
params = {
    'n_neighbors': range(5, 21, 2),
    'weights': ['uniform', 'distance']
}
gclf = GridSearchCV(knr, param_grid=params, cv=10, scoring='neg_mean_squared_error')
gclf.fit(x_train_scaled, y_train)
best_knr_model = gclf.best_estimator_
y_pred_test_knr = best_knr_model.predict(x_test_scaled)
test_rmse_knr = np.sqrt(mean_squared_error(y_test, y_pred_test_knr))
test_r2_knr = r2_score(y_test, y_pred_test_knr)
print(f'KNN Regression - Test RMSE: {test_rmse_knr}, Test R²: {test_r2_knr}')

In [None]:
# Random Forest Regression
rfr = RandomForestRegressor()
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [10, 15, None],
    'max_features': ['sqrt', 'log2'],
    'criterion': ['squared_error'],
    'max_leaf_nodes': [10, None],
    'min_samples_split': [10, 15],
    'min_samples_leaf': [5, 10]
}
rfclf = GridSearchCV(rfr, param_grid=param_grid_rf, cv=5, scoring='neg_mean_squared_error')
rfclf.fit(x_train_scaled, y_train)
best_rfr_model = rfclf.best_estimator_
y_pred_test_rfr = best_rfr_model.predict(x_test_scaled)
test_rmse_rfr = np.sqrt(mean_squared_error(y_test, y_pred_test_rfr))
test_r2_rfr = r2_score(y_test, y_pred_test_rfr)
print(f'Random Forest Regression - Test RMSE: {test_rmse_rfr}, Test R²: {test_r2_rfr}')

In [None]:
# Support Vector Regression
svr = SVR()
params_svr = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100],
    'kernel': ['poly', 'rbf'],
    'gamma': [0.001, 0.01, 0.1, 1],
    'epsilon': [0.01, 0.1, 0.5, 1, 5]
}
svgrid = RandomizedSearchCV(svr, param_distributions=params_svr, cv=5, n_iter=5, scoring='neg_mean_squared_error')
svgrid.fit(x_train_scaled, y_train)
best_svr_model = svgrid.best_estimator_
y_pred_test_svr = best_svr_model.predict(x_test_scaled)
test_rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred_test_svr))
test_r2_svr = r2_score(y_test, y_pred_test_svr)
print(f'SVR - Test RMSE: {test_rmse_svr}, Test R²: {test_r2_svr}')

In [None]:
# Model Comparison
results = {
    'Model': ['Linear Regression', 'Polynomial Regression', 'KNN Regression', 'Random Forest Regression', 'SVR'],
    'Test RMSE': [test_rmse, test_errors[np.argmin(test_errors)], test_rmse_knr, test_rmse_rfr, test_rmse_svr],
    'Test R²': [test_r2, test_r2_scores[np.argmax(test_r2_scores)], test_r2_knr, test_r2_rfr, test_r2_svr]
}
results_df = pd.DataFrame(results)
print(results_df)

In [None]:
# Create a summary table of results
results_summary = {
    'Model': ['Linear Regression', 
              f'Polynomial Regression (Degree {best_degree})', 
              'KNN Regression', 
              'Random Forest Regression', 
              'Support Vector Regression'],
    'Test RMSE': [test_rmse, 
                  test_errors[np.argmin(test_errors)], 
                  test_rmse_knr, 
                  test_rmse_rfr, 
                  test_rmse_svr],
    'Test R²': [test_r2, 
                test_r2_scores[np.argmax(test_r2_scores)], 
                test_r2_knr, 
                test_r2_rfr, 
                test_r2_svr]
}

# Convert to a DataFrame for better visualization
results_df = pd.DataFrame(results_summary)
print(results_df)

# Alternatively, display the table in Markdown format
from IPython.display import Markdown

markdown_table = """
| **Model**                  | **Test RMSE** | **Test R²** |
|----------------------------|---------------|-------------|
| Linear Regression          | {lr_rmse}     | {lr_r2}     |
| Polynomial Regression (Degree {best_degree}) | {poly_rmse} | {poly_r2} |
| KNN Regression             | {knn_rmse}    | {knn_r2}    |
| Random Forest Regression   | {rf_rmse}     | {rf_r2}     |
| Support Vector Regression  | {svr_rmse}    | {svr_r2}    |
""".format(
    lr_rmse=test_rmse,
    lr_r2=test_r2,
    best_degree=best_degree,
    poly_rmse=test_errors[np.argmin(test_errors)],
    poly_r2=test_r2_scores[np.argmax(test_r2_scores)],
    knn_rmse=test_rmse_knr,
    knn_r2=test_r2_knr,
    rf_rmse=test_rmse_rfr,
    rf_r2=test_r2_rfr,
    svr_rmse=test_rmse_svr,
    svr_r2=test_r2_svr
)

display(Markdown(markdown_table))

## Final Conclusion
Based on the results, the **Random Forest Regression** model performs the best for salary prediction, with the lowest RMSE (**{test_rmse_rfr}**) and highest R² score (**{test_r2_rfr}**). This model is recommended for future predictions.

| **Model**                  | **Test RMSE** | **Test R²** |
|----------------------------|---------------|-------------|
| Linear Regression          | 22647.75      | 0.813       |
| Polynomial Regression      | 19210.03      | 0.866       |
| KNN Regression             | 15689.04      | 0.910       |
| Random Forest Regression   | 12157.69      | 0.946       |
| Support Vector Regression  | 30956.66      | 0.651       |