**Introduction**

Employee data is a valuable asset for any company. It can be used to understand the workforce, identify trends, and make informed decisions about HR policies and practices.

For that we  will use Exploratory Data Analysis (EDA) to explore and analyze employee data. EDA is a process of using data visualization and statistical analysis to gain insights into data. It is a crucial step in any data science project, as it helps us to understand the data before we begin to build models or make predictions.

# Loading the Data

**We will be importing some important libraries for this project**

In [None]:
# Importing important liberary
import pandas as pd
import numpy as np
import datetime
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import linregress

**Reading the data in the form of a csv_file(comma_separated_value).**

In [None]:
# importing the dataset to Pandas DataFrame: emp_df
emp_df = pd.read_csv("/kaggle/input/employee-data/employees.csv")

In [None]:
#display first five rows of the DataFrame: emp_df
emp_df.head()

In [None]:
#display Last five rows of the DataFrame: emp_df
emp_df.tail()

# Exploring the Data

In [None]:
# Check the number of rows and  number of columns in Pandas DataFrame: emp_df
emp_df.shape

In [None]:
# Check the column labels of Pandas DataFrame: emp_df
emp_df.columns

In [None]:
# Get the concise summary of DataFrame: emp_df
emp_df.info()

In [None]:
# Get the statistical summary of numeric columns of DataFrame: emp_df
emp_df.describe()

# Data Cleaning

In [None]:
# checking the null value in Dataframe : emp_df
emp_df.isna().any()

In [None]:
# counting the null value in Dataframe : emp_df
emp_df.isna().sum()

In [None]:
# Droping the null vslue from the Dataframe: emp_df
emp_df.dropna(inplace = True)

In [None]:
# Convert the start date and last login time columns to datetime objects in Dataframe:emp_df
emp_df['Start Date'] = pd.to_datetime(emp_df['Start Date'],format = "mixed")
emp_df['Last Login Time'] = pd.to_datetime(emp_df['Last Login Time'],format = "mixed")


# Checking the Outlire

Creating the boxplot for checking the outlire in the Dataframe:emp_df

In [None]:
# Create a boxplot for the salary variable
plt.boxplot(emp_df['Salary'])
plt.xlabel('Salary')
plt.title('Salary distribution')
plt.show()

In [None]:
# Create a boxplot for the Bonus % variable
plt.boxplot(emp_df['Bonus %'])
plt.xlabel('Bonus %')
plt.title('Bonus % distribution')
plt.show()

# Data Exploration

Q What is the relationship between salary and bonus percentage?

Q what is the averge salary for each team

Q What is the gender distribution of employees in senior management?

Q What is the average time an employee has been with the company, based on their start date and last login time?

Q Which teams have the highest and lowest average salaries?

Q  What is the relationship between salary and bonus percentage?

In [None]:
# Calculate the correlation between salary and bonus percentage
correlation = emp_df['Salary'].corr(emp_df['Bonus %'])

# Create a scatter plot of salary versus bonus percentage
plt.scatter(emp_df['Salary'], emp_df['Bonus %'])


# Add a title and axis labels
plt.title('Salary vs. Bonus Percentage')
plt.xlabel('Salary')
plt.ylabel('Bonus Percentage')

# Show the plot
plt.show()
print(correlation)

This output shows that there is a Negative correlation between salary and bonus percentage, meaning that employees with higher salaries tend to receive lower bonuses.

Q what is the averge salary for each team

In [None]:
# Calculate the average salary for each team
df_grouped = emp_df.groupby('Team')['Salary'].mean()

# Create a bar plot of the average salary for each team
plt.bar(df_grouped.index, df_grouped.values)

# Rotate the x-axis labels
plt.xticks(rotation=90)

# Increase the font size of the x-axis labels
plt.xticks(fontsize=12)

# Add a title and axis labels
plt.title('Average Salary by Team')
plt.xlabel('Team')
plt.ylabel('Average Salary')

# Show the plot
plt.show()

In the above figure you can see the Average salary by team in which engineering and finance almost have same average salary compare to the Distribution Team

Q  What is the gender distribution of employees in senior management?

In [None]:
# Filter the dataframe to only include employees in senior management
df_senior_management = emp_df[emp_df['Senior Management'] == True]

# Count the number of male and female employees in senior management
gender_counts = df_senior_management['Gender'].value_counts()

# Create a pie chart of the gender distribution
plt.pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%')

# Add a title and axis labels
plt.title('Gender Distribution in Senior Management')
plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.

# Show the plot
plt.show()

As you can see majority of the employee are female, accounting for 51.4% of the workforce. It would mean that the company has a gender imbalance in favor of women. This is a relatively rare occurrence, but it is not unheard of.  

Q What is the average time an employee has been with the company, based on their start date and last login time?

In [None]:
# Calculate the time each employee has been with the company
emp_df['Time with Company'] = emp_df['Last Login Time'] - emp_df['Start Date']

# Convert the Time with Company column to a numeric data type
emp_df['Time with Company'] = pd.to_numeric(emp_df['Time with Company'])

# Calculate the average time each employee has been with the company, in years
average_time_with_company_in_years = emp_df['Time with Company'].mean() / 365

# Create a histogram of the time each employee has been with the company
plt.hist(emp_df['Time with Company'], bins=10)

# Add a title and axis labels
plt.title('Time with Company Distribution')
plt.xlabel('Time with Company (years)')
plt.ylabel('Number of Employees')

# Show the plot
plt.show()

# Print the average time each employee has been with the company, in years
#print('Average time with company:', average_time_with_company_in_years, 'years')


The average time with the company is approximately 1 years. It means employee are not stay with the company more then a year

Q Which teams have the highest and lowest average salaries?

In [None]:
# Calculate the average salary for each team
df_grouped = emp_df.groupby('Team')['Salary'].mean()

# Sort the average salaries in descending order
df_grouped = df_grouped.sort_values(ascending=False)

# Get the top 5 teams with the highest average salaries
top_5_teams = df_grouped.index[:5]

# Get the bottom 5 teams with the lowest average salaries
bottom_5_teams = df_grouped.index[-5:]

# Create a bar chart of the average salary for each team
plt.bar(top_5_teams, df_grouped[top_5_teams])
plt.bar(bottom_5_teams, df_grouped[bottom_5_teams], color='red')

# Rotate the x-axis labels
plt.xticks(rotation=90)

# Increase the font size of the x-axis labels
plt.xticks(fontsize=12)

# Add a title and axis labels
plt.title('Average Salary by Team (Top 5 and Bottom 5)')
plt.xlabel('Team')
plt.ylabel('Average Salary')

# Show the plot
plt.show()

* The average salary of employees in the Engineering and Sales teams is higher than the average salary of employees in the Marketing and Client Services teams.

* The difference in average salary between the Engineering and Client Services teams is significant.

**Summary**

 We used various data visualization techniques to gain insights into the data, such as average salary by team, gender distribution of employees in senior management, and average time with the company.

We found that the Engineering team has the highest average salary, followed by the Sales team, the Marketing team. We also found that there is a significant gender imbalance in senior management, with women outnumbering men by a ratio of 51.4:48.6. Additionally, we found that the average time with the company is 1 year.