In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


In [2]:
file_path = r"C:\Users\Student\Downloads\Salary Data.csv"
data = pd.read_csv(file_path)

print(data.head())


    Age  Gender Education Level          Job Title  Years of Experience  \
0  32.0    Male      Bachelor's  Software Engineer                  5.0   
1  28.0  Female        Master's       Data Analyst                  3.0   
2  45.0    Male             PhD     Senior Manager                 15.0   
3  36.0  Female      Bachelor's    Sales Associate                  7.0   
4  52.0    Male        Master's           Director                 20.0   

     Salary  
0   90000.0  
1   65000.0  
2  150000.0  
3   60000.0  
4  200000.0  


In [4]:
data.columns

Index(['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience',
       'Salary'],
      dtype='object')

In [3]:
print("Initial Data:")
print(data.head())

Initial Data:
    Age  Gender Education Level          Job Title  Years of Experience  \
0  32.0    Male      Bachelor's  Software Engineer                  5.0   
1  28.0  Female        Master's       Data Analyst                  3.0   
2  45.0    Male             PhD     Senior Manager                 15.0   
3  36.0  Female      Bachelor's    Sales Associate                  7.0   
4  52.0    Male        Master's           Director                 20.0   

     Salary  
0   90000.0  
1   65000.0  
2  150000.0  
3   60000.0  
4  200000.0  


In [44]:
print("\nDataset Information:")
print(data.info())


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
None


In [45]:
print("\nMissing Values:")
print(data.isnull().sum())


Missing Values:
Id                       0
EmployeeName             0
JobTitle                 0
BasePay                609
OvertimePay              4
OtherPay                 4
Benefits             36163
TotalPay                 0
TotalPayBenefits         0
Year                     0
Notes               148654
Agency                   0
Status              148654
dtype: int64


In [5]:
data['Years of Experience'].fillna(data['Years of Experience'].median(), inplace=True)  # Using median for robustness
data['Salary'].fillna(data['Salary'].mean(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Years of Experience'].fillna(data['Years of Experience'].median(), inplace=True)  # Using median for robustness
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Salary'].fillna(data['Salary'].mean(), inplace=True)


In [6]:
data['Age'].fillna(data['Age'].mode()[0], inplace=True)  # Most frequent value
data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)
data['Education Level'].fillna(data['Education Level'].mode()[0], inplace=True)
data['Job Title'].fillna(data['Job Title'].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].mode()[0], inplace=True)  # Most frequent value
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the int

In [7]:
print("\nMissing Values After Imputation:")
print(data.isnull().sum())

# Display the cleaned dataset (optional)
print("\nCleaned Dataset:")
print(data.head())


Missing Values After Imputation:
Age                    0
Gender                 0
Education Level        0
Job Title              0
Years of Experience    0
Salary                 0
dtype: int64

Cleaned Dataset:
    Age  Gender Education Level          Job Title  Years of Experience  \
0  32.0    Male      Bachelor's  Software Engineer                  5.0   
1  28.0  Female        Master's       Data Analyst                  3.0   
2  45.0    Male             PhD     Senior Manager                 15.0   
3  36.0  Female      Bachelor's    Sales Associate                  7.0   
4  52.0    Male        Master's           Director                 20.0   

     Salary  
0   90000.0  
1   65000.0  
2  150000.0  
3   60000.0  
4  200000.0  


In [9]:
data['TotalCompensation'] = data['Salary'] 

In [10]:
data = pd.get_dummies(data, columns=['Job Title', 'Education Level'], drop_first=True)

In [12]:

numerical_cols = ['Years of Experience', 'Salary']  # Adjust this based on your dataset
scaler = StandardScaler()

data[numerical_cols] = scaler.fit_transform(data[numerical_cols])
print("\nData after Scaling:")
print(data.head())


Data after Scaling:
    Age  Gender  Years of Experience    Salary  TotalCompensation  \
0  32.0    Male            -0.769440 -0.220147            90000.0   
1  28.0  Female            -1.075664 -0.740475            65000.0   
2  45.0    Male             0.761682  1.028639           150000.0   
3  36.0  Female            -0.463215 -0.844540            60000.0   
4  52.0    Male             1.527243  2.069294           200000.0   

   Job Title_Accountant  Job Title_Administrative Assistant  \
0                 False                               False   
1                 False                               False   
2                 False                               False   
3                 False                               False   
4                 False                               False   

   Job Title_Business Analyst  Job Title_Business Development Manager  \
0                       False                                   False   
1                       False          

In [14]:
X = data.drop(['TotalCompensation'], axis=1)
y = data['TotalCompensation']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [16]:
print("\nShapes of the Train and Test Sets:")
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)



Shapes of the Train and Test Sets:
X_train: (300, 179)
X_test: (75, 179)
y_train: (300,)
y_test: (75,)


In [41]:
print("\nProcessed Data Summary:")
print(data.describe())


Processed Data Summary:
            BasePay   OvertimePay      OtherPay      Benefits      TotalPay  \
count  1.486540e+05  1.486540e+05  1.486540e+05  1.486540e+05  1.486540e+05   
mean  -2.600234e-17  9.865595e-17  3.059099e-17  9.789118e-17 -2.508461e-16   
std    1.000003e+00  1.000003e+00  1.000003e+00  1.000003e+00  1.000003e+00   
min   -1.558023e+00 -4.422897e-01 -1.329039e+00 -1.869018e+00 -1.492304e+00   
25%   -7.610324e-01 -4.422889e-01 -4.528992e-01 -1.989493e-01 -7.640884e-01   
50%   -2.889764e-02 -4.422889e-01 -3.521913e-01  5.430481e-16 -6.615046e-02   
75%    6.626619e-01 -3.556711e-02  7.287813e-02  6.315014e-01  6.150586e-01   
max    5.927097e+00  2.095878e+01  4.921953e+01  5.341156e+00  9.755700e+00   

       TotalPayBenefits           Year  Status  TotalCompensation  
count      1.486540e+05  148654.000000     0.0       1.486540e+05  
mean       1.269526e-16    2012.522643     NaN       1.269526e-16  
std        1.000003e+00       1.117538     NaN       1.0000

# Relevance of the Salary Dataset to the Two-Pot System

The Salary dataset is highly relevant to the Two-Pot system, which aims to balance financial security for employees and promote sustainable retirement savings. This relevance can be understood through several key aspects:

## 1. Understanding Financial Security

The dataset contains vital information about employee salaries, which is a significant factor in determining financial security. By analyzing salary trends, organizations can identify whether employees earn enough to cover their basic needs and contribute to retirement savings. This is crucial for assessing whether the Two-Pot system can effectively support the financial well-being of employees.

## 2. Insights into Salary Disparities

By examining factors such as age, gender, education level, and job title, the dataset can help identify potential salary disparities within the workforce. These insights are essential for the Two-Pot system's objective to promote equity in compensation. Addressing disparities can lead to more fair and equitable financial conditions for all employees, thereby enhancing their ability to save for retirement.

## 3. Evaluating Career Advancement and Savings Potential

The dataset includes information on years of experience, which can impact salary levels. Understanding how experience correlates with compensation can inform policies that encourage employees to invest in their professional development. The Two-Pot system encourages employees to prioritize their savings, and knowing how their earnings grow with experience can motivate them to contribute more towards their retirement funds.

## 4. Informing Policy Decisions

Data-driven insights from the Salary dataset can assist policymakers in making informed decisions regarding salary structures, benefits, and incentives that align with the Two-Pot system’s goals. By understanding how various demographic factors influence earnings, policymakers can implement strategies that support better savings behaviors among employees.

## 5. Enhancing Employee Engagement and Satisfaction

When employees feel they are fairly compensated based on their qualifications and contributions, it can lead to increased job satisfaction and engagement. This positive sentiment can translate into better financial decisions, including increased contributions to retirement savings. The Two-Pot system seeks to enhance financial literacy and encourage proactive savings behaviors among employees, and fair compensation is a foundational aspect of this effort.

## Conclusion

In summary, the Salary dataset serves as a crucial resource for evaluating the effectiveness of the Two-Pot system in promoting financial security and equity among employees. By leveraging the insights drawn from this dataset, organizations can work towards creating a more equitable compensation structure that aligns with the objectives of the Two-Pot system, ultimately leading to a more financially secure workforce.
