Introduction

The data set I chose was an HR data set used in the HR Metrics and Analytics department at the New England College of Business. This data set was created to allow HR professionals to learn data analysis. One reason I chose this set is because I believe the data is good for creating visualizations and allowing me to learn data visualization. I retrieved this data set from kaggle. The link to the data set is: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set


Data Exploration

In [None]:
#import pandas library
import pandas as pd
#import numpy library
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')

#print the data set
print(hr_df)

#print the data types of the columns in the data set
print(hr_df.info())

#print the count of NaN values in the data set
print(hr_df.isnull().sum().sum())

#print the summary statistics of the data set
print(hr_df.describe())

Data Wrangling

In [None]:
#change the "MaritialDesc" column name to "Marital_Status"
hr_df.rename(columns={"MaritalDesc": "Marital_Status"},inplace = True)

#change the "Zip"column name to "Zip_Code"
hr_df.rename(columns={"Zip": "Zip_Code"},inplace = True)

#create a new column called "Satisifaction_Level" if EmpSatisfaction > 3 high else low
criteria = [(hr_df['EmpSatisfaction'] <= 3),(hr_df['EmpSatisfaction']  > 3) ]
values = ['No','yes']
hr_df['Satisfaction_Level'] = np.select(criteria, values)

#drop the "DaysLateLast30" column from my data set
del hr_df["DaysLateLast30"]

#drop the 15th and 16th rows from the data set
hr_df.drop([15,16], axis=0, inplace=True)

#sorting on the "Position" column in ascending order and "State" column in descending order
hr_df.sort_values(['Position', 'State'],ascending = [True, False])

#filtering for all employees whose salary is greater than 62000 and who is a US Citizen
hr_df = hr_df[(hr_df.Salary > 62000) & (hr_df.CitizenDesc == "US Citizen")]

#filtering all employees who reside in Virginia(VA) and are Area Sales Manager
hr_df = hr_df[(hr_df.State == "VA") & (hr_df.Position == "Area Sales Manager")]

#convert all values in the ManagerName column to lower case
hr_df['ManagerName'].str.lower()

#Find the mean, min and max values of Salary
hr_df.groupby('Salary').mean()
hr_df.groupby('Salary').min()
hr_df.groupby('Salary').max()

Visualizations

Part1: Matplotlib

Scatter Plot: 

I will create a scatter plot which shows the correlation between Employee Salary and Employee Satisfaction with their performance score. I want to see how well an employee performs based on several factors like work environment, job type as well as well as compensation. Thus,the question here is: Does a high salary and job satisfaction lead to better employee performance?

In [None]:
#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')

#plot and show scatter plot 
plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, c=hr_df.PerformanceScore.astype('category').cat.codes)
plt.title('Performance')
plt.xlabel('Salary')
plt.ylabel('Satisfaction')
plt.show()

#create a legend for the scatter plot -- place it in the upper right hand corner
scatter = plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, c=hr_df.PerformanceScore.astype('category').cat.codes)
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
plt.legend(fontsize = "7",handles=scatter.legend_elements()[0], loc="upper right",
labels = perform_names,title="Employee Performance")
plt.title('Performance')
plt.xlabel('Salary')
plt.ylabel('Satisfaction')
plt.show()

#create a legend for the scatter plot -- place it in the lower right hand corner
scatter = plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, 
c=hr_df.PerformanceScore.astype('category').cat.codes)
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
plt.legend(fontsize = "12",handles=scatter.legend_elements()[0], loc="lower right",
labels = perform_names,title="Employee Performance")
plt.title('Performance')
plt.xlabel('Salary')
plt.ylabel('Satisfaction')
plt.show()

#Place legend outside of the plot
scatter = plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, 
c=hr_df.PerformanceScore.astype('category').cat.codes)
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
plt.legend(bbox_to_anchor=(1.05, 1.0),fontsize = "12",handles=scatter.legend_elements()[0], loc="lower right",
labels = perform_names,title="Employee Performance")
plt.tight_layout()
plt.title('Performance')
plt.xlabel('Salary')
plt.ylabel('Satisfaction')
plt.show()

#changing titles and the x/y labels
scatter = plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, c=hr_df.PerformanceScore.astype('category').cat.codes)
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
plt.legend(bbox_to_anchor=(1.05, 1.0),fontsize = "12",handles=scatter.legend_elements()[0], loc="lower right",
labels = perform_names,title="Employee Performance")
plt.tight_layout()
plt.title('Employee Performance')
plt.xlabel('Employee Salary')
plt.ylabel('Employee Satisfaction')
plt.show()

#changing font size of the x and y ticks
scatter = plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, c=hr_df.PerformanceScore.astype('category').cat.codes)
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
plt.legend(bbox_to_anchor=(1.05, 1.0),fontsize = "12",handles=scatter.legend_elements()[0], loc="lower right",
labels = perform_names,title="Employee Performance")
plt.tight_layout()
plt.title('Employee Performance')
plt.xlabel('Employee Salary')
plt.ylabel('Employee Satisfaction')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

#changing size of axis labels
scatter = plt.scatter( hr_df.Salary,hr_df.EmpSatisfaction, c=hr_df.PerformanceScore.astype('category').cat.codes)
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
plt.legend(bbox_to_anchor=(1.05, 1.0),fontsize = "12",handles=scatter.legend_elements()[0], loc="lower right",
labels = perform_names,title="Employee Performance")
plt.tight_layout()
plt.title('Employee Performance',fontsize=16)
plt.xlabel('Employee Salary',fontsize=14)
plt.ylabel('Employee Satisfaction',fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

Analysis:

After creating a scatter plot, I was able to see some correlation between employee salary, employee satisfaction and employee performance. I believe, if I filtered out for specific positions, I would have seen more of a correlation in some more than others, however, I wanted to see an overll trend. In any case, I did see a trend. I partiular, as job satisfaction went up and salary went up and employee's performance either goes at "Fully Meets" to "Exceeds" as we see from salary 150,000 to around 200,000. We also see from "Needs Improvement" to "Fully Meets" on several occasions, such as from about 70,000 where Employee: "Needs Improvement" and has a lower employee satisfaction(2.0) to a salary of over 100,000 and employee satisfaction of 3.0. In conclusion, I belive there is a positive correlation between employee satisfaction, employee performance and employee salary.

Bar plot:

I will create a bar plot which shows the relationship between employee salary,employee gender employee position. To be more specific. I would like to do this for the following positions: Data Analyst,Database Administrator,Production Technician I,Data Architect,Accountant !, BI Developer. Some questions I would like answered in my analysis would include: On average,which gender makes more money at the Production Technician I position? On average, which position pays below 60000 for both genders?


In [None]:


#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')

#Prepare data for plotting
hr_df_bar = hr_df[hr_df.Position.isin(['Data Analyst','Database Administrator','BI Developer',
'Production Technician I','Accountant I'])]
hr_df_bar_plot = hr_df_bar.groupby(['Position','Sex'])['Salary'].mean().reset_index()
female = hr_df_bar_plot[hr_df_bar_plot['Sex'] == 'F']
male = hr_df_bar_plot[hr_df_bar_plot['Sex'] == 'M ']

#setting the x-axis widths for plotting
x_pos = np.arange(len(male))
tick_labels = ['Data Analyst','Database Administrator','BI Developer','Production Technician I','Accountant I']

#Plotting 
plt.figure(figsize=(10, 410))
plt.bar(x_pos - 0.2, female['Salary'], width=0.4, label='Female')
plt.bar(x_pos + 0.2, male['Salary'], width=0.4, label='Male')
plt.xticks(x_pos, tick_labels)
plt.title('Average Salary by Positon and Gender')
plt.xlabel('Position')
plt.ylabel('Salary')
plt.legend()
plt.show()

#Place legend outside of the plot
plt.figure(figsize=(10, 410))
plt.bar(x_pos - 0.2, female['Salary'], width=0.4, label='Female')
plt.bar(x_pos + 0.2, male['Salary'], width=0.4, label='Male')
plt.xticks(x_pos, tick_labels)
plt.title('Average Salary by Positon and Gender')
plt.xlabel('Position')
plt.ylabel('Salary')
plt.legend(bbox_to_anchor=(1.0, 0.5))
plt.show()


#changing titles and the x/y labels
plt.figure(figsize=(10, 410))
plt.bar(x_pos - 0.2, female['Salary'], width=0.4, label='Female')
plt.bar(x_pos + 0.2, male['Salary'], width=0.4, label='Male')
plt.xticks(x_pos, tick_labels)
plt.title('Average Salary by Positon and Gender')
plt.xlabel('Employee Position')
plt.ylabel('Employee Salary')
plt.legend(bbox_to_anchor=(1.0, 0.5))
plt.show()

Analysis:

In my analysis, I create a grouped bar plot to gage a relationship between employee salary, employee position and employee gender. Interestingly enough, on average, males made more money in postions such as: Data Analyst and production Technician. Females make more money in the positions:Database Administrator, BI Developer and Accountant. It is important to know that the difference between these, on avearage is minimal. It is also important to know that our HR dataset contained 136 male employees and 177 female employees, this helped boost average saleries of female employees slightly above female. employees. I also noticed, that for the Production 
Technicial I positon, males had a much greater salary in average than females. Finally, from our dataset atleast, on average,the data analyst position makes the least amount of money and the Production Technician makes the most money among both genders. 

Line Chart:

I will now create a line chart. I want to see if there is any relationship/trend between an Employee Satisfaction and Employee Engagement for a Data Analyst. In other words, I want to find out: Is an employee who is engagaed in their work also satisfied with their job? 

In [None]:
#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')

#prepare data for plotting
hr_df = hr_df[hr_df.Position.isin(['Data Analyst'])]
Engagement = hr_df[['EngagementSurvey']]
Satisfaction = hr_df[['EmpSatisfaction']]

#plot
plt.plot(Engagement,Satisfaction)
plt.title('Data Analyst:Engagement vs Satisfaction')
plt.xlabel('Engagement')
plt.ylabel('Satisfaction')
plt.show()

#reverse the x and y axis
plt.plot(Satisfaction,Engagement)
plt.title('Data Analyst:Engagement vs Satisfaction')
plt.xlabel('Satisfaction')
plt.ylabel('Engagement')
plt.show()

#change the color of the line to green
plt.plot(Satisfaction,Engagement,color = 'green')
plt.title('Data Analyst:Engagement vs Satisfaction')
plt.xlabel('Satisfaction')
plt.ylabel('Engagement')
plt.show()

Analysis:

In my analysis I wanted to see if there is any correlation between employee satisfaction and employee engagement, in particular to the Data Analyst position. Although not fully, I did see areas of a straight line with a positive correlation between the two variables. Another words, for an increasing employee engagement there was a higher level of satisfaction amongst employees.

Part 2: Seaborn library

Scatterplot: 

I will create a scatter plot which shows the correlation between Employee Salary and Employee Satisfaction with their performance score. I want to see how well an employee performs based on several factors like work environment, job type as well as well as compensation. Thus,the question here is: Does a high salary and job satisfaction lead to better employee performance?

In [None]:
#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')

#plot and show scatterplot
sns.scatterplot(data=hr_df , x="Salary", y="EmpSatisfaction",
c=hr_df.PerformanceScore.astype('category').cat.codes)
plt.show()

#create a legend for the scatter plot -- place it in the upper right hand corner
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
scatter = sns.scatterplot(data=hr_df , x="Salary", y="EmpSatisfaction",
c=hr_df.PerformanceScore.astype('category').cat.codes,hue="PerformanceScore")
sns.move_legend(scatter, "upper right")
plt.show()

#create a legend for the scatter plot -- place it in the lower right hand corner
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
scatter = sns.scatterplot(data=hr_df , x="Salary", y="EmpSatisfaction",
c=hr_df.PerformanceScore.astype('category').cat.codes,hue="PerformanceScore")
sns.move_legend(scatter, "lower right")
plt.show()


#Place legend outside of the plot
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
scatter = sns.scatterplot(data=hr_df , x="Salary", y="EmpSatisfaction",
c=hr_df.PerformanceScore.astype('category').cat.codes,hue="PerformanceScore")
sns.move_legend(scatter, "upper right")
plt.legend(bbox_to_anchor=(1.02, 1.1),loc =2)
plt.tight_layout()
plt.show()

#changing titles and the x/y labels
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
scatter = sns.scatterplot(data=hr_df , x="EmpSatisfaction", y="Salary",
c=hr_df.PerformanceScore.astype('category').cat.codes,hue="PerformanceScore")
sns.move_legend(scatter, "upper right")
plt.legend(bbox_to_anchor=(1.02, 1.1),loc =2)
plt.tight_layout()
plt.show()

#changing font size of the x and y ticks
perform_names = ["Exceeds","Fully Meets","Needs Improvement","PIP"]
scatter = sns.scatterplot(data=hr_df , x="EmpSatisfaction", y="Salary",
c=hr_df.PerformanceScore.astype('category').cat.codes,hue="PerformanceScore")
sns.move_legend(scatter, "upper right")
plt.legend(bbox_to_anchor=(1.02, 1.1),loc =2)
plt.tight_layout()
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

Analysis:

After creating a scatter plot, I was able to see some correlation between employee salary, employee satisfaction and employee performance. I believe, if I filtered out for specific positions, I would have seen more of a correlation in some more than others, however, I wanted to see an overll trend. In any case, I did see a trend. I partiular, as job satisfaction went up and salary went up and employee's performance either goes at "Fully Meets" to "Exceeds" as we see from salary 150,000 to around 200,000. We also see from "Needs Improvement" to "Fully Meets" on several occasions, such as from about 70,000 where Employee: "Needs Improvement" and has a lower employee satisfaction(2.0) to a salary of over 100,000 and employee satisfaction of 3.0. In conclusion, I belive there is a positive correlation between employee satisfaction, employee performance and employee salary.

Bar Plot:

I will create a bar plot which shows the relationship between employee salary,employee gender employee position. To be more specific. I would like to do this for the following positions: Data Analyst,Database Administrator,Production Technician I,Data Architect. Some questions I would like answered in my analysis would include: On average,which gender makes more money at the Production Technician I position? On average, which position pays below 60000 for both genders?

In [None]:
#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')

#Prepare data for plotting
hr_df_bar = hr_df[hr_df.Position.isin(['Data Analyst','Database Administrator','BI Developer',
'Production Technician I','Accountant I'])]
hr_df_bar_plot = hr_df_bar.groupby(['Position','Sex'])['Salary'].mean().reset_index()

#Plotting
sns.barplot(x='Position', y='Salary', hue='Sex', data=hr_df_bar_plot, palette=['blue', 'orange'])
plt.title('Average Salary by Positon and Gender')
plt.xlabel('Position')
plt.ylabel('Salary')
plt.show()


#Place legend outside of the plot
sns.barplot(x='Position', y='Salary', hue='Sex', data=hr_df_bar_plot, palette=['blue', 'orange'])
plt.legend(bbox_to_anchor=(1.0, 0.5))
plt.title('Average Salary by Positon and Gender')
plt.xlabel('Position')
plt.ylabel('Salary')
plt.show()

#changing titles and the x/y labels
sns.barplot(x='Salary', y='Position', hue='Sex', data=hr_df_bar_plot, palette=['blue', 'orange'])
plt.legend(bbox_to_anchor=(1.0, 0.5))
plt.title('Average Salary by Positon and Gender')
plt.xlabel('Position')
plt.ylabel('Salary')
plt.show()

#adding annotations
bar = sns.barplot(x='Position', y='Salary', hue='Sex', data=hr_df_bar_plot, palette=['blue', 'orange'])
for b in bar.patches:
    bar.annotate(format(b.get_height(), '.0f'), (b.get_x() + b.get_width() / 2.,
     b.get_height()), ha = 'center', va = 'center', 
                   size=15,
                   xytext = (0, -12), 
                   textcoords = 'offset points')
plt.legend(bbox_to_anchor=(1.0, 0.5))
plt.show()


Analysis:

In my analysis, I create a grouped bar plot to gage a relationship between employee salary, employee position and employee gender. Interestingly enough, on average, males made more money in postions such as: Data Analyst and production Technician. Females make more money in the positions:Database Administrator, BI Developer and Accountant. It is important to know that the difference between these, on avearage is minimal. It is also important to know that our HR dataset contained 136 male employees and 177 female employees, this helped boost average saleries of female employees slightly above female. employees. I also noticed, that for the Production 
Technicial I positon, males had a much greater salary in average than females. Finally, from our dataset atleast, on average,the data analyst position makes the least amount of money and the Production Technician makes the most money among both genders. 

Line Chart:

I will now create a line chart. I want to see if there is any relationship/trend between an Employee Satisfaction and Employee Engagement for a Data Analyst. In other words, I want to find out: Is an employee who is engagaed in their work also satisfied with their job? 

In [None]:

#load the "hr data set"
hr_df = pd.read_csv ('https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Analysis%20in%20Python/HRDataset_v14.csv')


#prepare data for plotting
hr_df = hr_df[hr_df.Position.isin(['Data Analyst'])]

#plotting
sns.lineplot( x = "EngagementSurvey", y = "EmpSatisfaction",data = hr_df);
plt.title('Engagement vs Satisfaction')
plt.xlabel('Engagement')
plt.ylabel('Satisfaction')
plt.show()

#reverse the x and y axis
sns.lineplot( x = "EmpSatisfaction", y = "EngagementSurvey",data = hr_df);
plt.title('Engagement vs Satisfaction')
plt.xlabel('Engagement')
plt.ylabel('Satisfaction')
plt.show()

#changing color of line
sns.lineplot( x = "EngagementSurvey", y = "EmpSatisfaction",data = hr_df,color = 'green');
plt.title('Data Analyst:Engagement vs Satisfaction')
plt.xlabel('Engagement')
plt.ylabel('Satisfaction')
plt.show()


Analysis:

In my analysis I wanted to see if there is any correlation between employee satisfaction and employee engagement, in particular to the Data Analyst position. Although not fully, I did see areas of a straight line with a positive correlation between the two variables. Another words, for an increasing employee engagement there was a higher level of satisfaction amongst employees.

Part 3: 

In a comment or text box, explain the differences between creating a plot in matplotlib and seaborn, based on your above plots.

To analyze the differenceds between matplotlib and seaborn, it is important to understand that the seaborn library is built on top of the matplotlib library. This by default would mean that creating plots in matplotlib would require more lines of code than in seaborn. Looking at the graphs in both this certainly holds true. A simple one-liner using seaborn can  take various lines in matplotlib. I would also like to note that the two can be integrated. We use the .plot() function from matplotlib even when using the seaborn library.

Conclusions:

This hr data set was a good data set to perform analysis on. I believe that it contained a variety of good variables, such as, employee satisfaction, employee performance. This allowed me to answer a good amount of questions about the data. 