# Using Statistical Methods To Find Pay Gap Within The Organization   

By Niladri Ghosh

## 1. Identify Problem

Pay Gap has became one of the main concerns in recent times within the working environment. We are living in 21st century and the rights and freedom are equal for every being, irrespective of their gender, race, age or whatnot. People should be accomodated with the quality work they do or the revenuw they bring in. So with this motivation we need to find if pay gap on any basis exists within an organization, if yes what are the reasons behind it and report it.

### 1.1 Expected Outcome

The given data provided by Spark Foundation (imaginary) provides various features of employees like Name, Age, Gender, Country, Ethnicity,etc. Our final outcome is whether there is a gap gap or not.

### 1.2 Objective

Since many features in our data is datetype value, we need to find appropriate techniques to convert them into date type format, also we need to rectify and clean the data. This statistical method is refered to as Hypothesis Test.

> Since this is a __hypothesis test__, our __final goal would be to prove that our given hypothesis is wrong, which suggests the alternative hypothesis is correct and hence pay gap exists__.

### 1.3 Identify Data Sources

The dataset contains 174 rows of data and 10 columns (including salary).

* The 10th column provides the employee salary.
* Columns 1-9 provides various details for the employee.

__Getting Started : Load libraries and set options__

In [1]:
# import necessary libraries
import numpy as np
import pandas as pd
import math
from scipy.stats import sem
from scipy.stats import t

# default='warn'
pd.options.mode.chained_assignment = None  
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')

__Load Dataset__

First load the supplied CSV file using Pandas read_csv function.

In [2]:
# read data
df_primary = pd.read_csv('data/evidence.csv',delimiter=',')

In [3]:
# create copy of dataframe
df = df_primary.copy()

__Inspecting the data__

The first step is to visually inspect the new dataset. There are multiple ways to acheive this:
* The easiest way is to fetch first 5 rows is using DataFrame.head(), here df.head().
* Alternatively we can fetch the last 5 rows using DataFrame.tail(), here df.tail().

__NOTE:__ 

For both the above methods we can add a parameter inside the parenthesis '()' to specify how many rows we want to display, thus we can inspect the data.

In [4]:
df.head(10)

Unnamed: 0,Surname,Name,Age,Gender,Country,Ethnicity,Start_date,Department,Position,Salary
0,Sweetwater,Alex,51,Male,United States,White,15-08-2011,Software Engineering,Software Engineering Manager,"$56,160.00"
1,Carabbio,Judith,30,Female,United States,White,11-11-2013,Software Engineering,Software Engineer,"$1,16,480.00"
2,Saada,Adell,31,Female,United States,White,05-11-2012,Software Engineering,Software Engineer,"$1,02,440.00"
3,Szabo,Andrew,34,Male,United States,White,07-07-2014,Software Engineering,Software Engineer,"$99,840.00"
4,Andreola,Colby,38,Female,United States,White,10-11-2014,Software Engineering,Software Engineer,"$99,008.00"
5,Daneault,Lynn,27,Female,United States,White,05-05-2014,Sales,Sales Manager,"$1,12,320.00"
6,Houlihan,Debra,51,Female,United States,White,05-05-2014,Sales,Director of Sales,"$1,24,800.00"
7,Onque,Jasmine,27,Female,United States,White,30-09-2013,Sales,Area Sales Manager,"$1,18,560.00"
8,Jeremy,Peter,43,Male,United States,White,12-05-2014,Sales,Area Sales Manager,"$1,16,480.00"
9,Gonzales,Ricardo,63,Male,United States,White,12-05-2014,Sales,Area Sales Manager,"$1,15,440.00"


In [5]:
# check shape of the given data
df.shape

(174, 10)

We can observe here the number of 174 rows, each with 10 columns.

On contrary we can use use info() method provided by pandas to generate a consise summary of the data. It provides the detail about each column, number of rows, null values, the data type and the memory usage.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174 entries, 0 to 173
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Surname     174 non-null    object
 1   Name        174 non-null    object
 2   Age         174 non-null    int64 
 3   Gender      174 non-null    object
 4   Country     174 non-null    object
 5   Ethnicity   174 non-null    object
 6   Start_date  174 non-null    object
 7   Department  174 non-null    object
 8   Position    174 non-null    object
 9   Salary      174 non-null    object
dtypes: int64(1), object(9)
memory usage: 13.7+ KB


There are some wrong data types. The postcode columns is float, it's a categorical value. And the following columns :

* Start_date column needs to be set as datetime.
* Salary to be set as float.

Change data type of Start_date to datetime and salary to float, since salary has "&" and "," we need to remove it.

In [7]:
# rectify data types 

df['Start_date'] = pd.to_datetime(df['Start_date'])
df['Salary'] = df['Salary'].str.replace('$','')
df['Salary'] = df['Salary'].str.replace(',','')
df['Salary'] = df['Salary'].astype(float)

Check for Null and Duplicate values.

In [8]:
# check duplicates
df.duplicated().all()

False

No duplicated values present in the dataset.

In [9]:
# check null values 
df.isna().all()

Surname       False
Name          False
Age           False
Gender        False
Country       False
Ethnicity     False
Start_date    False
Department    False
Position      False
Salary        False
dtype: bool

No Null values present in the dataset.

Check all the columns.

In [10]:
df.columns

Index(['Surname', 'Name', 'Age', 'Gender', 'Country', 'Ethnicity',
       'Start_date', 'Department', 'Position', 'Salary'],
      dtype='object')

Remove unnecessary columns.

In [11]:
df.drop(['Surname', 'Name','Country','Department'], axis = 1, inplace=True) 

Save the clean dataset.

In [12]:
df.to_csv('data/evidence_clean.csv', index=False)

> Next notebook involves satastical methods to calculate wage gap. In notebook title : NB2_Conclusion.