In [91]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import scipy as sy

In [92]:
df = pd.read_csv('HR RAW DATA.csv')

First we have to look at that data in a perspective way. To work with Gender relantionships we must understand the </br>
proportion of each kind. 
It's important to note that the dataset gender categories are binary. 

In [93]:
gender_dist = df['Gender'].value_counts()
gender_dist

Gender
Male      882
Female    588
Name: count, dtype: int64

Gender distribution percentage

In [94]:
perc_Gender = (gender_dist/len(df['Gender'])*100).map('{:.1f}%'.format)
perc_Gender

Gender
Male      60.0%
Female    40.0%
Name: count, dtype: object

Now that we know the proportion of each gender in the whole population, we're gonna look at the Job Level distribution by Gender </br>
normalized by probality density. Before start to plot, let's check the numbers:

In [95]:
relative = df.groupby(['Gender', 'JobLevel']).size().groupby(level=0).apply(
    lambda x: x / x.sum()
)

relative

Gender  Gender  JobLevel
Female  Female  1           0.338435
                2           0.374150
                3           0.159864
                4           0.086735
                5           0.040816
Male    Male    1           0.390023
                2           0.356009
                3           0.140590
                4           0.062358
                5           0.051020
dtype: float64

That 'relative' numbers will be useful to verify some of the follow plots results

In [96]:
job_lv_male = df[df['Gender']== 'Male']['JobLevel']
job_lv_fem = df[df['Gender']== 'Female']['JobLevel']

Job Level distribution by Gender

In [98]:

jb_lv = job_lv_male.values.tolist(), job_lv_fem.values.tolist()

group_labels = 'Male', 'Female'

fig = ff.create_distplot(
                        jb_lv,
                        group_labels,
                        show_hist=False,
                        show_rug=False,
                        colors=['rgb(255,0,0)','rgb(0,0,255)']
                        )

fig.show()

Note that we have a slight female advantage by Job Role perspective. That can be noticed because the data is normalized, given some proportional insights.   

Now we'll make the same kind of line (kde) plot, but what we're gonna searching for is the Percent of Salary Hike by Gender

In [99]:
sal_hk_male = df[df['Gender']== 'Male']['PercentSalaryHike']
sal_hk_fem = df[df['Gender']== 'Female']['PercentSalaryHike']

In [100]:
sal_hike = sal_hk_male.values.tolist(), sal_hk_fem.values.tolist()

group_labels = 'Male', 'Female'

fig = ff.create_distplot(
                        sal_hike,
                        group_labels,
                        show_rug=False,
                        show_hist=False,
                        colors=['rgb(255,0,0)','rgb(0,0,255)'],
                        )

fig.show()

It still have some balanced relation between the two genders. Well done, HR! </br></br>
For some reason there's an unbalenced relation between the amount of male workers (60%) and female workers (40%). </br> 
It can be a lot of social and economic variables that helps such proportion and we can't investigate deeper only with tha data.</br>
However, looking in a proportional way we found that some key factors of equality are well balanced by this time. 

The next plot shows the relation between Monthly Income and Years at the Company. </br>
Each rows represents a Job Level category, starting at the top.</br>
The colors represents each gender.

In [101]:
fig = px.bar(
            df,  
            x='YearsAtCompany', 
            y='MonthlyIncome',
            color='Gender',
            barmode='group',
            facet_row='JobLevel',
            title='Monthly Income and Years at Company by Job Level and Gender',
            height=1000,
            )

fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

It's not a surprise that as long as you work in the company higher is your income,</br>
but, as we can see, there's some workers with 15-20 years in the job that still </br>
are stuck between levels 1 and 2.</br></br>
That's something to be concerned. Let's take a better look at this.

In [123]:
lw_lv_oldWorkers = df[(df['JobLevel']<=2) & (df['YearsAtCompany']>=15)]
llv_oldW_filtered = lw_lv_oldWorkers.filter(items=['JobLevel', 'YearsAtCompany']).reset_index(drop=True)
len(llv_oldW_filtered)

38

In [131]:
lowl_oldw_per = len(llv_oldW_filtered)/len(df)*100
lowlper_formatted = '{:.1f}%'.format(lowl_oldw_per)
lowlper_formatted

'2.6%'

The amount of low level worker with fifteen or more years at the company is 38,</br>
which corresponds 2.6% of the whole population.</br>
It shows something to think about: is that long time employers been prized enough? 