In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("Salaries.csv")

In [None]:
df.head()

In [None]:
df.info()

---
# Data Cleaning and Analysis Summary
---

## Data Type Conversion
The following columns were converted to numeric data types:
- `BasePay`
- `OvertimePay`
- `OtherPay`
- `Benefits`

Any non-numeric values were coerced into NaN (missing values).

In [None]:
df[["BasePay", "OvertimePay", "OtherPay", "Benefits"]] = df[["BasePay", "OvertimePay", "OtherPay", "Benefits"]].apply(pd.to_numeric, errors='coerce')

In [None]:
df.info()

## Summary Statistics Before Data Cleaning
```
- BasePay: Mean = 66,325.45, Min = -166.01, Max = 319,275.01
- OvertimePay: Mean = 5,066.06, Min = -0.01, Max = 245,131.88
- OtherPay: Mean = 3,648.77, Min = -7,058.59, Max = 400,184.25
- Benefits: Mean = 25,007.89, Min = -33.89, Max = 96,570.66
- TotalPay: Mean = 74,768.32, Min = -618.13, Max = 567,595.43
- TotalPayBenefits: Mean = 93,692.55, Min = -618.13, Max = 567,595.43
```
Some negative values indicate potential data errors or deductions in pay.

In [None]:
df.describe()

## Missing Values Before Handling
```
- BasePay: 609 missing values
- OvertimePay: 4 missing values
- OtherPay: 4 missing values
- Benefits: 36,163 missing values
- Notes: 148,654 missing values (entirely missing)
- Status: 110,535 missing values
```
Columns `Notes` and `Status` had excessive missing values and were dropped.

In [None]:
df.isnull().sum()

## Data Cleaning Actions
- **BasePay**: Missing values filled with the mean (66,325.45).
- **OvertimePay & OtherPay & Benefits**: Missing values filled with 0.
- **Dropped Columns**: `Status` and `Notes`.

After this cleaning, there were no missing values left in the dataset.

In [None]:
df["BasePay"] = df["BasePay"].fillna(df["BasePay"].mean())

In [None]:
df[["OvertimePay","OtherPay","Benefits"]] = df[["OvertimePay","OtherPay","Benefits"]].fillna(0)

In [None]:
df.drop(columns=["Status"],inplace=True)

In [None]:
df.drop(columns=["Notes"],inplace=True)

---

## Ensuring Correct Calculation of TotalPay and TotalPayBenefits

To ensure that total compensation is correctly computed, we calculate two important columns.

In [None]:
df["TotalPay"] = df["BasePay"]+df["OvertimePay"]+df["OtherPay"]

In [None]:
df["TotalPayBenefits"] = df["TotalPay"]+df["Benefits"]

## Summary Statistics After Data Cleaning
```
- BasePay: Mean = 66,325.45, Min = -166.01, Max = 319,275.01
- OvertimePay: Mean = 5,065.92, Min = -0.01, Max = 245,131.88
- OtherPay: Mean = 3,648.67, Min = -7,058.59, Max = 400,184.25
- Benefits: Mean = 18,924.23, Min = -33.89, Max = 96,570.66
- TotalPay: Mean = 75,040.04, Min = -618.13, Max = 567,595.43
- TotalPayBenefits: Mean = 93,964.27, Min = -618.13, Max = 567,595.43

In [None]:
df[["BasePay","OvertimePay","OtherPay","Benefits","TotalPay","TotalPayBenefits"]].describe()

---
## Distribution analysis
---

## Histogram of BasePay, OvertimePay, TotalPay

### BasePay Distribution
- The histogram shows a right-skewed distribution, meaning most employees have lower base pay, with fewer employees earning significantly higher salaries.
- The peak frequency is in the lower salary range, gradually decreasing as salaries increase.
- There are some negative values that may need further investigation.

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6))
sns.histplot(df['BasePay'], color='#00BFFF', kde=True, linewidth=2, edgecolor='white', alpha=0.8)
plt.title('Distribution of BasePay', fontsize=16, fontweight='bold', color='white')
plt.xlabel('BasePay', fontsize=14, fontweight='bold', color='white')
plt.ylabel('Frequency', fontsize=14, fontweight='bold', color='white')
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.tight_layout()
plt.savefig('basepay_distribution.png', dpi=300, bbox_inches='tight')

plt.show()

### OvertimePay Distribution
- The majority of employees have little to no overtime pay, as indicated by the high concentration of values near zero.
- A long right tail shows that some employees receive substantial overtime pay.

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6))
sns.histplot(df['OvertimePay'], bins=20, kde=True, color='#00FF7F', linewidth=2, edgecolor='white', alpha=0.8)
plt.title('Distribution of Overtime Pay', fontsize=16, fontweight='bold', color='white')
plt.xlabel('Overtime Pay', fontsize=14, fontweight='bold', color='white')
plt.ylabel('Frequency', fontsize=14, fontweight='bold', color='white')
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.tight_layout()
plt.savefig('overtimepay_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

### TotalPay Distribution
- The distribution follows a similar pattern to BasePay but accounts for OvertimePay and OtherPay.
- Most employees fall within a common pay range, but a few have significantly higher total earnings, contributing to the right-skewed nature of the data.

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6))
sns.histplot(df['TotalPay'], bins=20, kde=True, color='#FF4500', linewidth=2, edgecolor='white', alpha=0.8)
plt.title('Distribution of Total Pay', fontsize=16, fontweight='bold', color='white')
plt.xlabel('Total Pay', fontsize=14, fontweight='bold', color='white')
plt.ylabel('Frequency', fontsize=14, fontweight='bold', color='white')
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.tight_layout()
plt.savefig('totalpay_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

- The distribution analysis highlights a highly skewed pay structure, with most employees earning lower wages and a few receiving significantly higher

## Salary and Job Trends Analysis

### Salary Trends Over the Years
- The average `BasePay` per year was analyzed, showing a general trend over time.
- A line plot visualizes fluctuations in average salaries.
- Identifies any rising or declining salary trends over the years.

In [None]:
average_BasePay_per_year = df.groupby("Year")["BasePay"].mean().reset_index()

In [None]:
average_BasePay_per_year

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6))
sns.lineplot(data=average_BasePay_per_year, x="Year", y="BasePay", color='#00FFFF', linewidth=2.5, marker='o', markersize=8)
plt.title("Salary Trends Over Years (Average BasePay per Year)", fontsize=16, fontweight='bold', color='white')
plt.xlabel("Year", fontsize=14, fontweight='bold', color='white')
plt.ylabel("Average BasePay", fontsize=14, fontweight='bold', color='white')
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.tight_layout()
plt.savefig('salary_trends.png', dpi=300, bbox_inches='tight')
plt.show()

### Total Salary Expenditure Trends
- Total `TotalPay` expenditure per year was calculated to analyze yearly spending.
- The plotted line chart shows the overall increase or decrease in total salary costs over time.

In [None]:
TotalPay_expenditure_per_year = df.groupby("Year")["TotalPay"].sum().reset_index()

In [None]:
TotalPay_expenditure_per_year

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6))
sns.lineplot(data=TotalPay_expenditure_per_year, x="Year", y="TotalPay", color='#FFA500', linewidth=2.5, marker='o', markersize=8)
plt.title("TotalPay Expenditure Trends Over Years", fontsize=16, fontweight='bold', color='white')
plt.xlabel("Year", fontsize=14, fontweight='bold', color='white')
plt.ylabel("TotalPay Expenditure", fontsize=14, fontweight='bold', color='white')
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.tight_layout()
plt.savefig('totalpay_expenditure_trends.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
df.head()

### Top-Paid Job Titles
- The top 10 highest-paid job titles were identified based on `TotalPay`.
- A bar chart visualizes the highest-earning roles.

In [None]:
avg_TotalPay_per_job = df.groupby("JobTitle")["TotalPay"].mean().reset_index()

In [None]:
top_10_paid_job = (avg_TotalPay_per_job.sort_values(by="TotalPay",ascending=False)).head(10)
top_10_paid_job

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6))
sns.barplot(data=top_10_paid_job, x="JobTitle", y="TotalPay", color="#335BF9")
plt.title('Top Highest-Paid Job Titles', fontsize=16, fontweight='bold', color='white')
plt.xlabel('Job Title', fontsize=14, fontweight='bold', color='white')
plt.ylabel('Average TotalPay', fontsize=14, fontweight='bold', color='white')
plt.xticks(rotation=85, fontsize=12, color='white')
plt.yticks(fontsize=12, color='white')
plt.grid(axis='y', linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.savefig('top_highest_paid_jobs.png', dpi=300, bbox_inches='tight')
plt.show()

### Job Roles with Highest Overtime Pay
- The top 10 job roles with the highest average `OvertimePay` were identified.
- A bar chart illustrates which jobs earn the most overtime compensation.

In [None]:
avg_OvertimePay_job = df.groupby("JobTitle")["OvertimePay"].mean().reset_index()

In [None]:
top_10_overtime_paid_job = (avg_OvertimePay_job.sort_values(by="OvertimePay",ascending=False)).head(10)
top_10_overtime_paid_job

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 6), dpi=200)
sns.barplot(data=top_10_overtime_paid_job, x="JobTitle", y="OvertimePay", color="#335BF9")
plt.title('Job Roles with Highest Overtime Pay', fontsize=16, fontweight='bold', color='white')
plt.xlabel('Job Title', fontsize=14, fontweight='bold', color='white')
plt.ylabel('Average OvertimePay', fontsize=14, fontweight='bold', color='white')
plt.xticks(rotation=85, fontsize=12, color='white')
plt.yticks(fontsize=12, color='white')
plt.grid(axis='y', linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.savefig('highest_overtime_paid_jobs.png', dpi=300, bbox_inches='tight')
plt.show()

### Unique Job Titles Count
- The dataset contains a total of `2159` unique job titles.

In [None]:
df["JobTitle"].nunique()

### Outliers in Total Compensation
- A boxplot was generated for `TotalPayBenefits` to identify unusually high salaries.
- Outliers are data points that deviate significantly from the overall distribution of salaries.
- The presence of high outliers suggests that a small group of employees earn substantially more than the majority.
- This could be due to executive salaries, bonuses, or specific job roles with higher compensation packages.
- The boxplot helps to visualize the spread of salaries and detect any extreme values that may warrant further investigation.

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(12, 5), dpi=200)

sns.boxplot(data=df, x="TotalPayBenefits", color="#33F98F")
plt.title('Boxplot for TotalPayBenefits (Identifying Unusually High Salaries)', fontsize=16, fontweight='bold', color='white')
plt.xlabel("TotalPayBenefits", fontsize=14, fontweight='bold', color='white')
plt.xticks(fontsize=12, color='white')
plt.yticks(fontsize=12, color='white')
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.6, color='gray')
plt.tight_layout()
plt.savefig('boxplot_totalpaybenefits.png', dpi=300, bbox_inches='tight')
plt.show()

### Highest-Paid Employees
- The 10 highest-paid employees (based on `TotalPayBenefits`) were identified.
- Names, job titles, and total earnings were extracted.

In [None]:
top_10_highest_paid_employees = (df.sort_values(by="TotalPayBenefits",ascending=False)).head(10)
top_10_highest_paid_employees[["EmployeeName","JobTitle","TotalPayBenefits"]]

### Correlation Analysis
- The correlation matrix for `BasePay`, `OvertimePay`, and `TotalPayBenefits` was computed.
- A heatmap visualizes relationships between salary components.
- Helps in understanding which pay components are most interrelated.

In [None]:
correlation_matrix = df[['BasePay', 'OvertimePay', 'TotalPayBenefits']].corr()
correlation_matrix

In [None]:
plt.style.use("dark_background")
plt.figure(figsize=(10, 8))
sns.heatmap(data=correlation_matrix, cmap='coolwarm', annot=True, linewidths=0.5, fmt=".2f", cbar_kws={'shrink': 0.8})
plt.title('Correlation Analysis: BasePay, OvertimePay, TotalPay, and Benefits', fontsize=16, fontweight='bold', color='white')
plt.tight_layout()
plt.savefig('correlation_analysis.png', dpi=300, bbox_inches='tight')
plt.show()