Lets use a file data.csv for this assignment in which we are provided with a data of 1000 employees in the company along with their age, salary, Joining Date, Department and Perfomance Score. So lets first inspect the data.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("data.csv")

# Display basic information about the dataset
df.info()
df.head()

As we can see we have some missing values in column of Performance_Score. So let us first fill these cells with the mean of performance score.

In [None]:
# Check and remove duplicate records
df.drop_duplicates(inplace=True)

print(df['Performance_Score'].mean())


In [None]:
# Handle missing values in Performance_Score (impute with mean)
df['Performance_Score'].fillna(3, inplace=True)

#Removing outliers in Age and Salary
q1= df[['Age', 'Salary']].quantile(0.25)
q3= df[['Age', 'Salary']].quantile(0.75)
#Calculate IQR
IQR=q3-q1
#Define Lower and Upper Bound
LB= q1-1.5*IQR
UB= q3+1.5*IQR

# Filter out outliers
df = df[~((df[['Age', 'Salary']] < LB) | (df[['Age', 'Salary']] > UB)).any(axis=1)]

#Standardizing the data in Department by removing exxtra spaces and making the all data in same case type
df['Department']=df['Department'].str.strip().str.title()


df.info()

Now lets move further with Univariate Analysis
- Determining the count of employee according to their age and salary.
- Determining the count of people according to their perfomance score.
- Determining the number of employee in each department. 


In [None]:
import matplotlib.pyplot as plt


In [None]:

#Summary Statistics of Age Distribution
df['Age'].describe()


In [None]:

#Summary Statistics of Salary Distribution
df['Salary'].describe()


In [None]:

#Summary Statistics of Performance Score
df['Performance_Score'].describe()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=[12,12])

# First plot the Age vs count graph for the employees
df['Age'].plot(kind='hist', bins= 20, edgecolor='black',ax=axes[0,0] )
axes[0,0].set_xlabel('Age')
axes[0,0].set_ylabel('Count')
axes[0,0].set_title('Age Distribution')


# Plot the Salary vs count graph for the employees
df['Salary'].plot(kind='hist', bins= 20, edgecolor='black', color='orange',ax=axes[0,1] )
axes[0,1].set_xlabel('Salary')
axes[0,1].set_ylabel('Count')
axes[0,1].set_title('Salary Distribution')

# Plot the count of employees in each department
df['Department'].value_counts().plot(kind='pie',autopct='%0.1f%%',ax=axes[1,0] )
axes[1,0].set_xlabel('Department')
axes[1,0].set_ylabel(' ')
axes[1,0].set_title('Department Wise Distribution')

# Plot the box plot of Perfromance score
df['Performance_Score'].plot(kind='box', ax=axes[1,1])
axes[1,1].set_ylabel('Count')
axes[1,1].set_title('Performance Score Distribution')

Here we are done with the Univariate Analysis and we can infer:
- Maximum number of employees working are from nearly 45 years.
- Maximum employees are working at salary of about 100000.
- Maximum employees are working in Marketing departement.

Now we move on with BIVARIATE ANALYSIS. We will see:
 - Correlation matrix of Age, Salary and Performance score
 - How the nature of salary varies for people of different age group.
 - How performance score varies according to the age.
 - How salary varies according to performance score.
 - How performance score varies in each department.

In [None]:
import seaborn as sns

In [None]:
# Correlation Matrix for Age, Salary and Perfromance Score
df[['Age','Salary','Performance_Score']].corr()

In [None]:
fig1, axes= plt.subplots(2, 2, figsize=(12,12) )

# Scatar Plot between Salary and Age
sns.scatterplot(x=df['Salary'],y=df['Age'], marker= 'x', color='coral',ax=axes[0,0])
axes[0,0].set_xlabel("Salary")
axes[0,0].set_ylabel("Age")
axes[0,0].set_title("Age v/s Salary Distribution")

# Box Plot of performance score at different ages
sns.boxplot(x=df['Age'],y=df['Performance_Score'],ax=axes[0,1])
axes[0,1].set_xlabel("Age")
axes[0,1].set_ylabel("Perfromance Score")
axes[0,1].set_title("Age v/s Perfromance Score")

# Density graph of performance score at differnet salary
sns.kdeplot(x=df['Salary'], y=df['Performance_Score'], cmap="Blues", fill= True,ax=axes[1,0])
axes[1,0].set_xlabel("Salary")
axes[1,0].set_ylabel("Performance Score")
axes[1,0].set_title("Salary v/s Perfromace Score")

# Count of people in different department with different performance score
sns.countplot(x=df['Department'], hue=df['Performance_Score'], palette='viridis', ax=axes[1,1])
axes[1,1].set_xlabel("Department")
axes[1,1].set_ylabel("Count of Employees")
axes[1,1].set_title("Department-wise Employee Count for Different Performance Scores")

Here we are done with the Bivariate Analysis. We infer that:
- There is no particular relation in age and salary.
- People with performance score more than 3 have comparatively more salary.
- There are maximum people with performance score in each department.

Now we will move on with the Multivariate Analysis. We will look at:
- Pairplot to analyse about age, salarya and performance score simultaneously.
- Heatmap to visualise to correlation between age, salary and performance score.
- Bar plot for salary distribution in different department for different performance score.
- Violin plor for age distribution in different department for different performance score.

In [None]:
sns.pairplot(df, hue='Performance_Score', palette='viridis')
plt.title("Pair Plot")


In [None]:

sns.heatmap(df[['Age','Salary','Performance_Score']].corr(),cmap='coolwarm',annot=True)
plt.title("Correlation Heatmap")

In [None]:
sns.barplot(x="Department", y="Salary", hue="Performance_Score", data=df, palette="viridis")
plt.title("Salary Distribution Across Departments for Different Performance Scores")

In [None]:
sns.violinplot(x=df['Performance_Score'], y=df['Age'], hue=df['Department'], split=True, palette="viridis")
plt.title("Age Distribution Across Performance Scores and Departments")

Here we are done with our multivariate analysis. We infer from vaiuos plots:
- Pair Plot: A pair plot helped visualize interactions between numerical variables for different performance score.

- Heatmap: A heatmap of the correlation matrix provided insights into relationships between numerical variables.

- Grouped Bar Chart: Analyzed salary distribution across departments and performance scores, showing variations in pay among different categories.

- Violin Plot: Revealed that Age distribution varies significantly across different performance scores.

Detailed Analysis Report

1.Overview of the Dataset

The dataset contains records of 1,000 employees across multiple departments. The key attributes include:
- Age (years)
- Salary (annual income in dollars)
- Joining Date
- Department
- Performance Score (1 to 5 scale)

2.Data Quality Check

- Missing Values: 55 entries in the Performance Score column are missing.
- Duplicates: No duplicate records found.
- Data Types: Age and Salary are numeric, while other fields are categorical.

3.Key Statistical Insights

- Age: Employees range from 20 to 59 years; the average age is ~40 years.
- Salary: Salaries vary between $30,028 and $149,922, with a median of $91,750.
- Performance Scores: The average score is 2.84, with most employees scoring between 2 and 4.

4.Visual Analysis & Findings

Age Distribution
- The majority of employees are between 30 and 50 years old.
- There are fewer employees in the younger (20-30) and older (50-59) age groups.
- The distribution follows a normal trend, peaking around the 45-year mark.

Salary vs. Performance Score
- The density plot indicates no strong correlation between salary and performance scores.
- Some high earners (above $120,000) have low performance scores (1 or 2), suggesting salary is not solely based on performance.
- Conversely, some low earners (below $50,000) have high performance scores (4 or 5), hinting at potential disparities in salary distribution.

Performance Score Distribution
- The most common performance scores are 2, 3, and 4.
- Only a small percentage of employees have exceptionally high (5) or low (1) ratings.
- This distribution suggests a majority of employees perform at an average level, with fewer extremes.

Department-wise Salary Trends
- The bargraph reveals considerable salary variation across departments.
- Some departments, likely Marketing and Finance, have higher median salaries.
- Other departments, such as Sales or HR, show lower salary ranges.
- Salary discrepancies could indicate differences in job roles, experience levels, or company priorities.

5.Recommendations & Next Steps

Handling Missing Performance Scores: Fill missing values with the average score ~3 or use predictive modeling to estimate based on other attributes.

Further Analysis:
- Investigate department-wise performance trends to see if certain departments perform better than others.
- Analyze whether longer tenure correlates with higher salaries or better performance.

Performance Improvement Initiatives:
- Identify key drivers of low and high performance to develop targeted employee development programs.
- Assess whether salary adjustments are needed for high performers in low-paying departments.

This analysis provides a strong foundation for workforce insights and decision-making, helping to optimize salary structures and performance evaluation methods.