# üíº Salary Dataset ‚Äì Exploratory Data Analysis & Regression Modeling

---

## üßë‚Äçüíª Designed & Prepared By
**Muhammad Anas**

---

## üìÅ Dataset Metadata

- **Dataset Name:** Salary Dataset
- **Source:** Kaggle
- **Problem Type:** Regression
- **Target Variable:** `Salary`
- **Total Rows:** ~6,700
- **Total Columns:** 6
- **Data Format:** CSV
- **Use Case:** Predicting employee salary based on experience and demographic features

---

## üìä Dataset Description

This dataset contains information about employees, including their age, gender, education level, job role, and years of experience.  
The objective is to predict the **salary** of an employee using these features.

The dataset is suitable for:
- Exploratory Data Analysis (EDA)
- Regression modeling
- Decision Tree Regressor training
- Feature engineering practice

---

## üóÇÔ∏è Column Descriptions

| Column Name | Description |
|------------|-------------|
| **Age** | Age of the employee (in years) |
| **Gender** | Gender of the employee |
| **Education Level** | Highest education qualification |
| **Job Title** | Employee job designation |
| **Years of Experience** | Total professional experience (in years) |
| **Salary** | Annual salary of the employee (Target Variable) |

---

## üéØ Problem Statement

The goal of this project is to build a **Decision Tree Regressor model** that can accurately predict an employee‚Äôs salary based on their age, experience, education level, and job title.

---

## üîç Initial Observations

- The dataset contains both **numerical** and **categorical** features
- Categorical variables require encoding before model training
- Salary is a continuous variable, making this a **regression problem**
- Feature scaling is not mandatory for Decision Tree models

---

## ‚öôÔ∏è Model Used

### üå≥ Decision Tree Regressor

A Decision Tree Regressor predicts continuous values by learning decision rules from data features.  
It splits the data recursively to minimize prediction error such as **Mean Squared Error (MSE)**.

---

## üìà Evaluation Metrics

The model performance is evaluated using:
- **Mean Squared Error (MSE)**
- **Mean Absolute Error (MAE)**
- **Root Mean Squared Error (RMSE)**
- **R¬≤ Score**

---

## ‚úÖ Conclusion

The Salary Dataset provides a clear and practical example for applying **EDA and regression techniques**.  
Decision Tree Regressor is effective in capturing non-linear relationships between features and salary when properly tuned.

---

‚ú® *This notebook is created for learning and practicing regression modeling using the Salary Dataset.*


In [1]:
#loading the important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load dataset
df=pd.read_csv("F:\Anas_Data\Data_Science\Raw_Datasets\Salary_Data.csv")
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [4]:
#checking the shape of dataset
shape=df.shape
print(f"The dataset contains {shape[0]} rows and {shape[1]} columns.")

The dataset contains 6704 rows and 6 columns.


In [6]:
#checking for the missing values
missing_values=df.isnull().sum()
missing_values.sort_values(ascending=False)

Salary                 5
Education Level        3
Years of Experience    3
Age                    2
Gender                 2
Job Title              2
dtype: int64

In [7]:
#removing the missing values
df=df.dropna()

In [9]:
#checking for missing values after removing
missing_values_after=df.isnull().sum()
missing_values_after.sort_values(ascending=False)

Age                    0
Gender                 0
Education Level        0
Job Title              0
Years of Experience    0
Salary                 0
dtype: int64

In [16]:
categorical_cols = df.select_dtypes(include=['object']).columns
categorical_cols


Index(['Gender', 'Education Level', 'Job Title'], dtype='object')

In [17]:
#encoding the categorical columns
encoder=LabelEncoder()
for col in categorical_cols:
    df[col]=encoder.fit_transform(df[col])
    

In [None]:
#checking the first five rows of the dataset after encoding
df.head()


Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,1,0,175,5.0,90000.0
1,28.0,0,3,18,3.0,65000.0
2,45.0,1,5,144,15.0,150000.0
3,36.0,0,0,115,7.0,60000.0
4,52.0,1,3,25,20.0,200000.0


In [19]:
# split the dataset into features and target variable
X = df.drop('Salary', axis=1)
y = df['Salary']

In [20]:
#split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
#train the model 
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

0,1,2
,"criterion  criterion: {""squared_error"", ""friedman_mse"", ""absolute_error"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in the half mean Poisson deviance to find splits. .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 0.24  Poisson deviance criterion.",'squared_error'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. For an example of how ``max_depth`` influences the model, see :ref:`sphx_glr_auto_examples_tree_plot_tree_regression.py`.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [25]:
#predicting the test set results
y_pred = model.predict(X_test)
y_pred

array([150000.,  75656., 100000., ..., 105000.,  45000.,  60000.],
      shape=(1340,))

In [24]:
#checking the accuracy of model 
accuracy = model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy*100:.2f}%")

Model Accuracy: 97.10%


In [23]:
#checking the evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse= np.sqrt(mse)
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2 Score: {r2}")

Mean Absolute Error: 2932.3448821131306
Mean Squared Error: 82697284.06234702
Root Mean Squared Error: 9093.80470773081
R^2 Score: 0.9709982584866951
