In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('darkgrid')

plt.rcParams['figure.dpi'] = 200
plt.style.use("fivethirtyeight")

In [14]:
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [15]:
df.tail()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8
1469,34,No,Travel_Rarely,628,Research & Development,8,3,Medical,1,2068,...,1,80,0,6,3,4,4,3,1,2


In [16]:
df.drop(['EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18'], axis="columns", inplace=True)

The axis parameter specifies whether to remove columns or rows. In this case, axis="columns" is used to indicate that columns should be removed. The inplace parameter is set to True, which means that the drop() method modifies the df DataFrame in place rather than returning a new DataFrame.

In [17]:
categorical_column = []
for column in df.columns:
    if df[column].dtype == 'object' and len(df[column].unique()) < 50:
        categorical_column.append(column)

df['Attrition'] = df['Attrition'].astype('category').cat.codes

The cat.codes method converts the categorical variable to numeric values. This method maps each unique value of the categorical variable to a unique integer code. The resulting column contains integer codes that represent the original categories of the "Attrition" column.

The purpose of this code is to prepare the "Attrition" variable for analysis or machine learning models that require numeric input. Converting categorical variables to numeric values is a common step in data preprocessing, as many machine learning models require numeric input. By using .cat.codes, the code is able to map each category to a unique integer code, which can be used as input features in machine learning models or further analyzed using statistical methods.

The code df.Attrition.astype("category").cat.codes is used to convert a single categorical column "Attrition" in a pandas DataFrame object df to a numerical column, whereas feature columns are used to convert multiple columns (both categorical and numerical) in a pandas DataFrame to a format that can be used as input to a machine learning model.



In [18]:
categorical_column.remove('Attrition')
categorical_column

['BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'OverTime']

In [20]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
for column in categorical_column:
    df[column] = label.fit_transform(df[column])

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,2,1102,2,1,2,1,2,0,...,3,1,0,8,0,1,6,4,0,5
1,49,0,1,279,1,8,1,1,3,1,...,4,4,1,10,3,3,10,7,1,7
2,37,1,2,1373,1,2,2,4,4,1,...,3,2,0,7,3,3,0,0,0,0
3,33,0,1,1392,1,3,4,1,4,0,...,3,3,0,8,3,3,8,7,3,0
4,27,0,2,591,1,2,1,3,1,1,...,3,4,1,6,3,3,2,2,2,2


The code uses the LabelEncoder class from scikit-learn to convert categorical variables to numerical variables. 

In [21]:
from sklearn.model_selection import train_test_split

X = df.drop('Attrition', axis = 1)
y = df.Attrition

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)


The code X = df.drop('Attrition', axis = 1) creates a new DataFrame X by dropping the 'Attrition' column from the original DataFrame df along the column axis (i.e., axis = 1). This DataFrame contains the independent variables (i.e., input features) that will be used to train a machine learning model.

test_size: the proportion of the dataset to include in the testing set (in this case, 30%)
random_state: a seed value used by the random number generator for reproducibility of results

In general, it is recommended to use a test size between 20% to 30%, but the optimal test size may vary depending on the problem and the dataset.



In [29]:
X.head()
X.shape

(1470, 30)

In [31]:
y.head()
y.shape

(1470,)

In [32]:
df.shape

(1470, 31)

In [33]:
X_train.shape

(1029, 30)

In [34]:
X_test.shape

(441, 30)