<a href="https://colab.research.google.com/github/Pavel-Zinkevich/Employee_salary/blob/main/Employee_salary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 💼 Salary Prediction Project

A machine learning project for predicting employee salaries based on various features such as age, experience, job title, department, education level, and location.

## 🔍 What This Project Does

This notebook demonstrates how to:

- ✅ Prepare and preprocess employee data
- ✅ Train and evaluate **Linear Regression** and **XGBoost Regressor** models
- ✅ Compare their performance using **MSE** and **R²**
- ✅ Build **interactive widgets** to predict salary from user input
- ✅ Visualize salary distribution across locations with **Plotly graphs**

#Libraries and data import

In [6]:
#!pip install opendatasets

In [1]:
#import opendatasets as od
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import ipywidgets as widgets
from IPython.display import display, clear_output

import kagglehub
import os

In [2]:
# Download latest version
path = kagglehub.dataset_download("gmudit/employer-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/gmudit/employer-data?dataset_version_number=1...


100%|██████████| 163k/163k [00:00<00:00, 44.0MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/gmudit/employer-data/versions/1





In [3]:
print(os.listdir(path))
csv_path = os.path.join(path, "Employers_data.csv")

['Employers_data.csv']


In [7]:
#od.download("https://www.kaggle.com/datasets/gmudit/employer-data")

In [4]:
#df = pd.read_csv("employer-data/Employers_data.csv")
df = pd.read_csv(csv_path)

📊 Employee Salary Dataset
This synthetic dataset has been created for educational purposes and is ideal for exploring regression modeling. It includes realistic employee information with consistent relationships between features such as education level, job title, experience, and salary.

🧾 Dataset Summary
Property	Description
Rows	10,000
Target Variable	Salary (or another column as needed)
Use Case	Regression, EDA, feature engineering, model evaluation, fairness analysis

📌 Features

| Column            | Type        | Description                                                                 |
|-------------------|-------------|-----------------------------------------------------------------------------|
| `Employee_ID`     | Integer     | Unique identifier for each employee                                         |
| `Name`            | String      | Full name (gender-aware generation)                                         |
| `Gender`          | Categorical | Male or Female                                                              |
| `Age`             | Integer     | Age of the employee (based on education level and job title)                |
| `Education_Level` | Categorical | One of: High School, Bachelor, Master, PhD                                  |
| `Experience_Years`| Integer     | Number of years of professional experience                                  |
| `Department`      | Categorical | Business unit (e.g., HR, Engineering, Marketing, etc.)                      |
| `Job_Title`       | Categorical | Role of the employee (e.g., Analyst, Engineer, Manager, etc.)              |
| `Location`        | Categorical | Work location (e.g., New York, San Francisco, etc.)                         |
| `Salary`          | Integer     | Annual salary in USD — target variable for regression     |

✅ Characteristics
- ✅ No missing values — all entries are complete  
- 📈 Realistic correlations, such as:
  - 🎓 Higher education → 💼 Higher job levels → 💰 Higher salaries  
  - 🧑‍🎓 Interns are younger and earn less  
  - 🧠 PhDs tend to be older and hold senior roles  
  - 🌍 Salary varies across departments and locations

🎯 Applications
- This dataset is well-suited for:

- 🔢 Regression modeling (Linear, XGBoost, Random Forest, etc.)

- 🛠️ Feature engineering and selection

- 🧠 Categorical variable encoding (one-hot, label encoding, etc.)

- 🔍 Hyperparameter tuning

- ⚖️ Bias & fairness analysis (e.g., gender pay gap)

- 📊 Exploratory Data Analysis (EDA) and visualization



In [None]:
df.head()

In [None]:
df.Location.value_counts()

#Data manipulation and data visualisation

In [None]:
age_counts = df['Age'].value_counts().sort_index()


fig = px.bar(
    x=age_counts.index,
    y=age_counts.values,
    labels={'x': 'Age', 'y': 'Count'},
    title='Age Distribution',
    color_discrete_sequence=['#00CC96']
)


fig.update_layout(
    xaxis_title='Age',
    yaxis_title='Count',
    plot_bgcolor='white',
    title_font=dict(size=20),
    xaxis=dict(showgrid=True, gridcolor='lightgrey'),
    yaxis=dict(showgrid=True, gridcolor='lightgrey')
)

fig.show()

**📊 Age Distribution Analysis**

The age distribution of employees is skewed toward younger individuals, with the highest concentration between the ages of 23 and 30. Notably:

- 📈 Ages 24–26 and 30 have the highest counts, each with over 500 employees.

- 👥 The majority of employees fall between 23 and 40 years old.

- 📉 After age 40, the distribution gradually declines, with significantly fewer employees over 50.

- 🧓 Employees aged 55 and above represent a small portion of the dataset.

In [None]:
fig = px.histogram(df, x='Salary', nbins=20, title='Salary Distribution',
                   labels={'Salary': 'Salary'},

                   color_discrete_sequence=['#636EFA'])


fig.update_layout(
    bargap=0.1,
    plot_bgcolor='white',
    title_font=dict(size=20),
    xaxis=dict(title='Salary', showgrid=True, gridcolor='lightgrey'),
    yaxis=dict(title='Count', showgrid=True, gridcolor='lightgrey'),
)

fig.show()

**💰 Salary Distribution Analysis**

The histogram reveals a right-skewed distribution of salaries, indicating most employees earn on the lower to middle end of the salary spectrum.

Key Observations:
- 🏆 The most common salary range is $ 70,000 - 80,000, with the highest frequency (over 1,400 records).

- 📊 A large number of employees also earn between $ 60,000 - 90,000.

- 💼 Salaries above $150,000 are less frequent but still present in notable amounts.

- 📉 Very high and very low salaries are rare



In [None]:
education_counts = df.groupby(['Department', 'Education_Level']).size().reset_index(name='Count')

fig = px.bar(
    education_counts,
    x='Department',
    y='Count',
    color='Education_Level',
    title='Education Level by Department',
    labels={'Count': 'Count', 'Department': 'Department', 'Education_Level': 'Education Level'},
    color_discrete_sequence=px.colors.qualitative.Pastel,
)


fig.update_layout(
    barmode='stack',
    xaxis_title='Department',
    yaxis_title='Count',
    plot_bgcolor='white',
    title_font=dict(size=20),
    xaxis=dict(tickangle=45, showgrid=False),
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    legend_title_text='Education Level'
)

fig.show()

**🎓 Education Level by Department**

This stacked bar chart displays the distribution of education levels (Bachelor, Master, PhD) across various departments.

Key Insights:
- 👨‍💻 Engineering has the highest number of employees with Bachelor's degrees, indicating a strong foundation in undergraduate education for technical roles.

- 🧮 Finance, HR, and Product departments have a higher proportion of employees with Master’s degrees, reflecting the need for specialized knowledge in these fields.

- 📈 PhD holders are relatively evenly distributed, but HR and Product show a slightly higher presence of doctorate-level employees.

- 📊 In all departments, the Master's degree is the most common education level, followed by Bachelor, then PhD.

In [None]:
job_counts = df['Job_Title'].value_counts().reset_index()
job_counts.columns = ['Job_Title', 'Count']

fig = px.bar(
    job_counts,
    x='Job_Title',
    y='Count',
    title='Number of Employees per Job Title',
    color='Job_Title',
    color_discrete_sequence=px.colors.qualitative.Dark2
)

fig.update_layout(
    xaxis_title='Job Title',
    yaxis_title='Count',
    plot_bgcolor='white',
    title_font=dict(size=20),
    xaxis=dict(tickangle=45, showgrid=False),
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    showlegend=False
)

fig.show()

**💼 Number of Employees per Job Title**

This bar chart presents the distribution of employees across different job titles.

Key Insights:
- 👔 Managers form the largest group in the dataset, with over 3,000 employees. This suggests that many roles may have managerial responsibilities or titles.

- 📊 Analysts are the second most common, showing the importance of data and business analysis across departments.

- 🧑‍💼 Executives follow, representing a significant portion of leadership or senior decision-making roles.

- 🛠️ Engineers are fewer in number compared to managerial and analytical roles, despite being essential to technical departments.

- 🎓 Interns are the least represented, indicating either a limited internship program or a focus on full-time positions.

In [None]:
df["Gender_encoded"]=(df['Gender']=="Male").astype(int)

In [None]:
df_dummies = pd.get_dummies(df.drop(columns = ["Name", "Employee_ID", "Gender"]), drop_first=True)

In [None]:
df_dummies.head()

#Model construction

In [None]:
def model_fit(df, model):
    x = df.drop(columns=['Salary'])
    y = df['Salary']
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print('MSE:', mse)
    print('R2:', r2)
    residuals = y_test - y_pred
    plt.figure(figsize=(7, 4))
    sns.histplot(residuals, bins=40, kde=True)
    plt.axvline(0, color='red', linestyle='--')
    plt.title('Distribution of Residuals')
    plt.xlabel('Error (Actual - Predicted)')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

    return y_pred, mse, r2, model

#Fit and prediction

In [None]:
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     'n_estimators': [100, 200, 300],
#     'max_depth': [3, 5, 7],
#     'learning_rate': [0.01, 0.05, 0.1],
# }

# grid = GridSearchCV(xgb.XGBRegressor(objective='reg:squarederror'), param_grid, scoring='neg_mean_absolute_error')
# grid.fit(X_train, y_train)

# print("Best params:", grid.best_params_)
# best_params = grid.best_params_

# Best params: {'learning_rate': 0.05, 'max_depth': 3, 'min_child_weight': 5, 'n_estimators': 300, 'subsample': 1}

In [None]:
%%time
best_params = {
    'learning_rate': 0.04,
    'max_depth': 3,
    'min_child_weight': 20,
    'n_estimators': 500,
}
model_fit(df_dummies, xgb.XGBRegressor(objective='reg:squarederror', **best_params))

In [None]:
%%time
_, _,_,model_trained = model_fit(df_dummies, LinearRegression())

| Model           | MSE (Mean Squared Error) | R² (Coefficient of Determination) | Training Time (approx.) |
|-----------------|--------------------------|----------------------------------|------------------------|
| XGBRegressor    | 17,974,718               | 0.9915                           | ~0.8 sec               |
| LinearRegression| 17,375,021               | 0.9918                           | ~0.27 sec              |

Model Comparison Conclusion:

The Linear Regression model slightly outperformed the XGBoost model on this dataset, achieving a lower Mean Squared Error (MSE) and a marginally higher R² score. The XGBoost model was trained using optimized hyperparameters selected via a parameter grid search (param_grid), demonstrating strong performance as well.

Linear Regression trains faster and is simpler, making it a good choice for interpretability and quick results. XGBoost, being a more complex and flexible model, may offer better results on more complex datasets.

In summary, while Linear Regression performed better here, XGBoost remains a powerful alternative worth exploring depending on the problem complexity and computational resources.


#New Data Entery

In [None]:
categorical_features = ['Gender', 'Department', 'Job_Title', 'Education_Level', 'Location']
numeric_features = ['Age', 'Experience_Years']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])

model = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])


X = df.drop(columns=['Salary', 'Name', 'Employee_ID'])
y = df['Salary']
model.fit(X, y)


new_example = pd.DataFrame({
    'Age': [24],
    'Gender': ['Female'],
    'Department': ['Product'],
    'Job_Title': ['Executive'],
    'Experience_Years': [2],
    'Education_Level': ['Bachelor'],
    'Location': ['Austin']
})

predicted_salary = model.predict(new_example)
print("Estimated Salary : $", predicted_salary[0])


In [None]:
age_slider = widgets.IntSlider(min=18, max=65, step=1, value=24, description='Age:')
gender_dropdown = widgets.Dropdown(options=['Male', 'Female'], description='Gender:')
department_dropdown = widgets.Dropdown(options=df['Department'].unique(), description='Department:')
job_title_dropdown = widgets.Dropdown(options=df['Job_Title'].unique(), description='Job Title:')
experience_slider = widgets.IntSlider(min=0, max=40, step=1, value=1, description='Experience (years):')
education_dropdown = widgets.Dropdown(options=df['Education_Level'].unique(), description='Education:')
location_dropdown = widgets.Dropdown(options=df['Location'].unique(), description='Location:')

output = widgets.Output()

def predict_salary(change=None):
    with output:
        clear_output()

        input_data = pd.DataFrame({
            'Age': [age_slider.value],
            'Gender': [gender_dropdown.value],
            'Department': [department_dropdown.value],
            'Job_Title': [job_title_dropdown.value],
            'Experience_Years': [experience_slider.value],
            'Education_Level': [education_dropdown.value],
            'Location': [location_dropdown.value]
        })

        predicted_salary = model.predict(input_data)

        print(f"Estimated Salary : ${predicted_salary[0]:,.2f}")

age_slider.observe(predict_salary, names='value')
gender_dropdown.observe(predict_salary, names='value')
department_dropdown.observe(predict_salary, names='value')
job_title_dropdown.observe(predict_salary, names='value')
experience_slider.observe(predict_salary, names='value')
education_dropdown.observe(predict_salary, names='value')
location_dropdown.observe(predict_salary, names='value')

display(age_slider, gender_dropdown, department_dropdown, job_title_dropdown,
        experience_slider, education_dropdown, location_dropdown, output)

predict_salary()