# Feature Engineering

### Having cleaned and explored my data, in this notebook I will prepare my features for modeling.

# Importing Libraries and Data

In [656]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, plot_roc_curve, classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc, precision_recall_curve
import matplotlib.pyplot as plt
import matplotlib

In [508]:
df = pd.read_csv('HR-Employee-Attrition.csv')

In [428]:
len(df)

1470

# Feature Engineering

### In the following cells I prepare my cleaned data for modeling by dividing it into training and validation sets, encoding categorical features, and scaling numerical features.

In [509]:
df.drop('Attrition', axis = 1).select_dtypes('object')

Unnamed: 0,BusinessTravel,Department,Education,EducationField,EnvironmentSatisfaction,Gender,JobInvolvement,JobRole,JobSatisfaction,MaritalStatus,Over18,OverTime,PerformanceRating,RelationshipSatisfaction,WorkLifeBalance
0,Travel_Rarely,Sales,College,Life Sciences,Medium,Female,High,Sales Executive,Very High,Single,Y,Yes,Excellent,Low,Bad
1,Travel_Frequently,Research & Development,Below College,Life Sciences,High,Male,Medium,Research Scientist,Medium,Married,Y,No,Outstanding,Very High,Better
2,Travel_Rarely,Research & Development,College,Other,Very High,Male,Medium,Laboratory Technician,High,Single,Y,Yes,Excellent,Medium,Better
3,Travel_Frequently,Research & Development,Master,Life Sciences,Very High,Female,High,Research Scientist,High,Married,Y,Yes,Excellent,High,Better
4,Travel_Rarely,Research & Development,Below College,Medical,Low,Male,High,Laboratory Technician,Medium,Married,Y,No,Excellent,Very High,Better
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,Travel_Frequently,Research & Development,College,Medical,High,Male,Very High,Laboratory Technician,Very High,Married,Y,No,Excellent,High,Better
1466,Travel_Rarely,Research & Development,Below College,Medical,Very High,Male,Medium,Healthcare Representative,Low,Married,Y,No,Excellent,Low,Better
1467,Travel_Rarely,Research & Development,Bacheolor,Life Sciences,Medium,Male,Very High,Manufacturing Director,Medium,Married,Y,Yes,Outstanding,Medium,Better
1468,Travel_Frequently,Sales,Bacheolor,Medical,Very High,Male,Medium,Sales Executive,Medium,Married,Y,No,Excellent,Very High,Good


In [510]:
df['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

 #### Note: The data set has less than a fifth of attrition cases compared to non attrition cases.

### Encoding Categorical Features

In [511]:
# Dividing the data into the features variables (X) and outcome variable (Y)
X = pd.get_dummies(df.drop('Attrition', axis = 1))
y = pd.get_dummies(df['Attrition'], drop_first = True)

In [512]:
y.value_counts()

Yes
0      1233
1       237
dtype: int64

I kept the yes value for the predictor y. A value of 1 indicates attrition while a value of 0 indicates non attrition

### Dividing data into Training and Validation Sets

In [513]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .25, random_state = 4)

### Generating scalers for numeric data

In [514]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
Scaler1 = StandardScaler()
Scaler2 = MinMaxScaler()

In [519]:
scale_variables = list(X_train.select_dtypes('number').columns[1:20])

In [520]:
scale_variables

['Age',
 'DailyRate',
 'DistanceFromHome',
 'EmployeeCount',
 'EmployeeNumber',
 'HourlyRate',
 'JobLevel',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

In [521]:
list(X_train.select_dtypes('number').columns[1:20])

['Age',
 'DailyRate',
 'DistanceFromHome',
 'EmployeeCount',
 'EmployeeNumber',
 'HourlyRate',
 'JobLevel',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

### In addition to the untouched X_train variable I created two scaled variables to see which one would work best with each model. X_train1 has the numerical variables scaled using a standard scaler while X_train2 is scaled using Min Max scaling.

In [522]:
X_train1 = X_train.copy()
X_train2 = X_train.copy()
X_test1 = X_test.copy()
X_test2 = X_test.copy()

X_train1[scale_variables] = Scaler1.fit_transform(X_train[scale_variables])
X_test1[scale_variables] = Scaler1.transform(X_test[scale_variables])

X_train2[scale_variables] = Scaler2.fit_transform(X_train[scale_variables])
X_test2[scale_variables] = Scaler2.transform(X_test[scale_variables])