# WMP Tutorial: Predicting Attrition with Machine Learning

Now that we have examined the data and done some exploration, let's walk through the process of actually predicting attrition. There are many approaches one can take, and in this tutorial we will walk through only a few of them. 

**Support:** If something does not work or make sense in this tutorial, please reach out to [Sam Showalter](mailto:sshowalter@wmp.com)

## 1. Python Library and Package Imports

Feel free to bring in any of your own custom packages in the labeled section below

In [1]:
# Import data management tools
import pandas as pd

## 2. Read in Data

See if you can read in the dataframe using Pandas. Feel free to use Google.com and Stack Overflow for help. The filepath is listed below.

https://raw.githubusercontent.com/jswortz/UIC_Clustering_Code_2019/master/data/WA_Fn-UseC_-HR-Employee-Attrition.csv

**EXERCISE:** Please read in the data to the variable named df (shorthand for DataFrame), then double check that it imported successfully. Run the cell below your input and verify there are 1470 rows and 35 columns

In [2]:
#WRITE YOUR CODE HERE TO READ IN DATA


In [4]:
#Verify the data shape
("Data Shape: {} rows x {} cols".format(len(df), len(df.columns)))

## 3. Check Data Quality

The first thing to do after reading in the data and doing EDA (shown in previous notebook) is checking data quality. Machine Learning models are very sensitive and intolerant of ill-formatted or missing data. Therefore, the following exercises will allow us to better grasp the data quality itself. 

**EXERCISE:** See if you can find a snappy command to view any missing values in the dataframe, by column name.

In [None]:
# WRITE YOUR CODE HERE TO CHECK FOR NULL DATA


Lucky us! As you can see there is no missing data. Now, we need to determine what data type each column is. This will help us determine what columns we should drop and which need manipulation to be better for modeling.**EXERCISE:** See if you can find a snappy command to view any missing values in the dataframe, by column name.

**EXERCISE:** See if you can find a way to view the data type of each column, either by looking at a snapshot of the dataframe or by printing the data types.

In [None]:
#PRINT THE TOP FIVE ROWS OF THE DATAFRAME HERE


In [None]:
#PRINT THE DATA TYPES, BY COLUMN, of THE DATAFRAME HERE


Interesting. There are several fields with a data type of "object". Use your data engineering skills to further understand what they are.

**EXERCISE:** First, list out the names of all the columns with a data type of `object`. Then, show a snapshot of the dataframe for all of these columns and only these columns.

In [None]:
#WRITE YOUR CODE HERE FOR OBJECT DATA TYPE COLUMNS


In [None]:
#PRINT A SNAPSHOT OF ALL OBJECT FIELDS IN DATAFRAME (HINT: Use your previous answer to help you)


**EXERCISE:** Finally, fill out the following forms for each data type. Consider the following definitions for each, and take your time. This is very important for doing feature engineering correctly. Think about what the significance of a data type and its representation may have on a model.

 - **Categorical**: Often a string value. Different groups, or categories of objects (e.g. Cat, Dog, Mouse, ...)
 - **Numerically Discrete**: Countably infinite numerical values. Often integers.
 - **Numerically Continuous**: Numbers, but not countably infinite. Often float values.
 - **Boolean**: Yes or No values. Sometime represented as 1 or 0 **OR** as a two option categorial variable.
 - **Ordinal**: Categorial variables with proximal association. e.g.) small, medium, large **OR** level 1, 2, 3


| Column Name | Potential Data Types |
| :----------- | ---------------------|
|**`Age`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal 
|
|**`Attrition`**$\;\;\;\;\;\;\;\;\;\;$ |     <input type="checkbox"> Categorical  $\;\;\;\;$ <input type="checkbox"> Numeric Discrete $\;\;\;\;$ <input type="checkbox"> Numeric Continuous $\;\;\;\;$ <input type="checkbox"> Boolean $\;\;\;\;$ <input type="checkbox"> Ordinal
|
|**`BusinessTravel`**$\;\;\;\;\;\;\;\;\;\;$ |     <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`DailyRate`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`Department`**$\;\;\;\;\;\;\;\;\;\;$ |     <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`DistanceFromHome`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`Education`**$\;\;\;\;\;\;\;\;\;\;$    |  <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`EducationField`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`EmployeeCount`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`EmployeeNumber`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`EnvironmentSatisfaction`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`Gender`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`HourlyRate`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`JobInvolvement`**$\;\;\;\;\;\;\;\;\;\;$ |     <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`JobLevel`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`JobRole`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`JobSatisfaction`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`MaritalStatus`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`MonthlyIncome`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`MonthlyRate`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`NumCompaniesWorked`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`Over18`**$\;\;\;\;\;\;\;\;\;\;$ |     <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`OverTime`**$\;\;\;\;\;\;\;\;\;\;$    |  <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`PercentSalaryHike`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`PerformanceRating`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`RelationshipSatisfaction`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`StandardHours`**$\;\;\;\;\;\;\;\;\;\;$    |  <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`StockOptionLevel`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`TotalWorkingYears`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`TrainingTimesLastYear`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`WorkLifeBalance`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`YearsAtCompany`**$\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`YearsInCurrentRole`**$\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|
|**`YearsSinceLastPromotion`**   $\;\;\;\;\;\;\;\;\;\;$  |    <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|  
|**`YearsWithCurrManager`**  $\;\;\;\;\;\;\;\;\;\;$   |   <input type="checkbox" align = "right"> Categorical  $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Discrete $\;\;\;\;$ <input type="checkbox" align = "right"> Numeric Continuous $\;\;\;\;$ <input type="checkbox" align = "right"> Boolean $\;\;\;\;$ <input type="checkbox" align = "right"> Ordinal
|      

## 4. Feature Engineering

Now that we have a good grasp for the data and its data types, we need to do the following to prepare for modeling.

    1.) Determine what fields make sense to include for modeling. (What could reasonably impact attrition?)
    2.) How should different data types be transformed to be conducive for modeling? 
    3.) Write the code necessary for creating a modeling dataset.

### 4.a. Dropping Inapplicable Fields

We want to include everything that could even have the slightest impact on attrition, but leave out the remaining fields.

**EXERCISE**: List the fields that you should remove and write a small description outlining your logic.

LIST OF FIELDS TO REMOVE:
\
        1.) **EmployeeNumber** -- This is just an employees identification number and has nothing to do with attrition
    \
        2.) **HourlyRate** -- Monthly rate is already provided, which documents the same thing (could switch with Monthly)
    \
        3.) **DailyRate** -- Daily rate is already provided, which documents the same thing (could switch with Monthly)
        
**EXERCISE**: With your data to drop now defined, drop the columns while keeping the data in variable `df`. (**HINT**: There are three columns to drop)

In [None]:
#WRITE CODE HERE TO DROP COLUMNS


### 4.b. Handling Categorical Data

Categorical data is almost always represented with string data types. Mathematical models do not know how to process strings so as data scientists we must alias them and create dummy variables.

**EXERCISE**: The function below is intended to create dummy variables of each categorical variable in the dataset, then combine all of the information into a single dataframe called `categorical_df`. Create a list of the categorical variables that need to be converted, then write the function call to convert them. Name the output variable `categorical_df`.

In [None]:
def create_categorical_df(data, column_names):
    
    categorical_df = None
    
    for categorical_column in column_names:
        
        #WRITE DUMMY VARIABLE FUNCTION HERE
        categorical_dummies = pd.get_dummies(data[categorical_column])
        
        if categorical_df is None:
            categorical_df = categorical_dummies
            
        else:
            categorical_df = pd.concat([categorical_df, categorical_dummies], axis =1)
    
    return categorical_df

In [None]:
#MAKE LIST OF CATEGORICAL COLUMN NAMES HERE


#MAKE FUNCTION CALL HERE TO MAKE CATEGORICAL DF


**EXERCISE**: Verify you were successful by printing the head of the dataset

In [None]:
#WRITE CODE HERE TO VIEW HEAD OF CATEGORICAL DF


### 4.c. Handling Ordinal Data

Similar to categorical data, ordinal data is also often a string value. However, we do not need to make dummies of this data since there is a measure of proximity. We only need to alias the data but also ensure that **the logical order of the data is maintained**. 

**EXERCISE**: The function below converts ordinal data into a model friendly format. Create a list of all ordinal data types to be converted, then create the function call and save the output at ordinal_df. (HINT: There are a total of **Four** columns, but **ONLY ONE** needs to be preprocessed. Put only the one that needs to be preprocessed in the `ordinal_df` list below. We will add the rest later.

In [None]:
def create_ordinal_df(data, ordinal_names, ordinal_dict):
    
    ordinal_df = data.loc[:,ordinal_names]
    
    for name in ordinal_names:
        ordinal_df[name] = ordinal_df[name].map(ordinal_dict)
    
    return ordinal_df

def view_ordinality(data, ordinal_names):
    for name in ordinal_names:
        print(data[name].drop_duplicates())
        print()

In [None]:
#MAKE LIST OF (ONE) ORDINAL COLUMN NAME HERE


**EXERCISE**: View the ordinality of the appropriate column name, and make the corresponding dictionary (e.g. 0 = low, 1 = medium, 2 = high).

In [None]:
#VIEW ORDINALITY OF ORDINAL COLUMNS


In [None]:
#MAKE ORDINAL DICTIONARY


In [None]:
#MAKE FUNCTION CALL HERE TO CREATE ORDINAL DICT


Finally, let's add back all of the ordinal columns that did not need any alteration. These include Education, StockOptionLevel, and JobLevel.

In [None]:
#Data is added back for you.
ordinal_df = pd.concat([ordinal_df, df.loc[:, ["Education", "StockOptionLevel", "JobLevel"]]], axis = 1)

**EXERCISE**: Verify you were successful by checking the data types of the ordinal_df. They should be integers.

In [None]:
#VIEW DATA TYPES OF ORDINAL DF HERE


### 4.d. Handling Boolean Data

Boolean data is also often in string format. This conversion is trickier because you need to determine what represents **YES** and **NO** for each column. For attrition, this is self-explanatory (1 = attrition, 0 = not). For columns like gender, pick either gender as the **YES** gender. Ultimately it will not impact the modeling.

**EXERCISE**: Determine the variables that are boolean, as well as their YES / NO values, by following the steps below. Then convert the values using the provided function.

In [None]:
# CREATE A LIST OF BOOLEAN COLUMNS


In [None]:
# Visualize the unique names found in each column name by running the function
def visualize_booleans(data, boolean_names,):
    for name in boolean_names:
        print(data[name].drop_duplicates())
        print()

In [None]:
#CALL THE VISUALIZE BOOLEANS FUNCTION HERE


**EXERCISE**: It looks like we have another column that does not provide us with any unique data. Figure out which column that is, and drop it. 

In [None]:
# WRITE CODE HERE TO DROP COLUMN


In [None]:
#REMOVE Over18 FROM boolean_names LIST (done for you)


**EXERCISE**: Finally, create a list of lists in the format shown below, and feed it into the provided function to conver boolean values.

In [None]:
def convert_boolean_df(data, boolean_names_and_values):
    for name_value in boolean_names_and_values:
        data[name_value[0]] = data[name_value[0]].map({name_value[1]: 1,
                                                       name_value[2]: 0})


In [None]:
# CREATE A LIST OF LISTS ALL BOOLEAN COLUMNS
# Each list has the order: [COLUMN_NAME, YES_VALUE, NO_VALUE]



In [None]:
#CALL FUNCTION TO CONVERT BOOLEAN DF HERE


**EXERCISE**: Verify you were successful by running `visualize_booleans` again and comparing the results.

In [None]:
#VISUALIZE BOOLEAN OUTPUT HERE USING visualize_booleans FUNCTION


### 4.e. Normalizing Numerical Data

We are almost ready for modeling! The last thing to do is convert our numerical data into something with more context. When we look at something like income for modeling, the actual value means nothing. What **DOES** matter is how different that value is from other data observations. 

Therefore, many data scientists standardize numerical data by fitting it to a normal distribution and storing the z-score rather than the actual values.

**EXERCISE**: Create a list of all the columns that need to be normalized (there are many!) and call it `normalized_names`. Then call the function that will normalize them and save them all to `normalized_df`.

In [None]:
from sklearn.preprocessing import StandardScaler

def normalize_values(data, column_names):
    normalized_df = StandardScaler().fit_transform(data.loc[:,
                                                            column_names])
    
    return pd.DataFrame(normalized_df, columns = column_names)


In [None]:
#WRITE NORMALIZED COLUMN NAMES LIST HERE



In [None]:
#CALL normalize_values FUNCTION HERE


**EXERCISE**: Verify you were successful by viewing the head of the normalized_df

In [None]:
#VIEW HEAD OF NORMALIZED DF HERE


### 4.f. Finalizing the Modeling Dataset

Finally, we are ready to combine everything together into a modeling dataset. The dataset df is refined for you below to trim out all values that we have changed.

**EXERCISE**: Using all of the previously created dataframes, including `boolean_df` below, combine all of the data into a single flat dataset

In [None]:
# Editing df to only include up-to-date columns (done for you)
boolean_df = df.loc[:,boolean_names]

In [None]:
#COMBINE ALL DATA TOGETHER HERE (including boolean_df)


**EXERCISE**: Verify that you combined the data correctly by checking the shape of `df`. You should have 1470 rows and 48 columns

In [None]:
#VIEW SHAPE OF DF HERE


## 5. Model Training and Evaluation

Finally, we are ready to start modeling. In this section we will compare the performance of three fairly basic Machine Learning algorithms and see how well they can predict attrition.

The three models are listed below, and will be implemented using Scikit-Learn, one of the most popular ML packages in Python:

    1.) K-Nearest Neighbors (KNN)
    2.) Gaussian Naive Bayes (GNB)
    3.) Logistic Regression (LOG)
    
Here is a quick blurb on how each of these ML algorithms work.
   
- **KNN** 

Takes an unknown person (don't know if there was attrition) and compares it to the most similar people they work with. If the most similar "K" people were characterized by attrition, then the unknown individual is as well.
   
- **GNB** 
    
Probabilistically examines the mean characteristics of the stereotypical attrition and non-attrition individual. Assuming feature independence (naivete), the unknown person is classified based on the stereotype in which they are characteristically most similar.
  
- **LOG** 

Regression system that classifies individuals by determining an optimal separation boundary between individuals that were characterized by attrition and those that were not. 

### 5.a. Importing Models

For convenience, the models have been imported below

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier

# GNB
from sklearn.naive_bayes import GaussianNB

# LOG
from sklearn.linear_model import LogisticRegression

Before we can do anything with these models, we need to instantiate them.

**EXERCISE**: Instantiate all three models below, setting them to the names of their abbreviations (this is simple, do not overthink this).

In [None]:
# KNN


# GNB


# LOG


### 5.b. Split the data into train and test slices

To evaluate models in an unbiased way, we have to set some data aside that the model will not see during training. This process is called train test split.

**EXERCISE**: The train test split function is provided to you below. Split the input and target (**target = Attrition**) data into slices. Set the output variables to be X_train, X_test, y_train, and y_test **in that order**. Make the split on 25% of the data.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#WRITE CODE HERE


### 5.c. Train all of the models

Now we are ready to train the models! In order to train the models, you must call `.fit` on each model. Be sure to only give the model the training data! e.g.) model.fit(train_x_values, train_y_values)

**EXERCISE**: Call `.fit` for all of the instantiated models.

In [None]:
# KNN


# GNB


# LOG


### 5.d. Make prediction for all of the models

Now we can use our trained models to predict. You can do so by providing **ONLY THE TRAINING DATA** to the trained model and calling `.predict`. e.g.) model.predict(test_x_values)

**EXERCISE**: Call `.predict` for all of the instantiated models. Save all output as "<MODEL_NAME>_preds". e.g.) KNN_preds.

In [None]:
# KNN


# GNB


# LOG


### 5.e. Evaluate performance for all models

**EXERCISE**: Now that we have predictions, use the function provided below to determine which model is best! See the function's inputs for what you need to provide (**HINT**: which of the `train_test_split` slices includes the actual classes for the test data?)

In [None]:
from sklearn.metrics import classification_report

def get_performance(model_name, actual_results, predictions):
    print(model_name + "\n" +
          classification_report(actual_results, 
                                predictions))
    print("\n================================================================================\n")

In [None]:
#KNN PERFORMANCE


#GNB PERFORMANCE


#LOG PERFORMANCE


**EXERCISE**: Which model was the best?
    
<input type="checkbox" align = "right"> KNN  $\;\;\;\;$ <input type="checkbox" align = "right"> GNB $\;\;\;\;$ <input type="checkbox" align = "right"> LOG 

**CONGRATULATIONS! YOU FINISHED THE WMP MACHINE LEARNING TUTORIAL!**

## Advanced Exercises 

If you liked this tutorial, want to get involved in Machine Learning at West Monroe, or both, please go ahead and try the following additional exercises. 

1. Unsupervised Machine Learning groups data based on metrics of similarity. Leverage a clustering algorithm like K-means and document its findings with the given data. What is the composition of each cluster and what insights about attrition can you determine? Save your results as a Jupyter notebook and send them to **Sam Showalter** or **Jordan Totten** for review.

2. In this tutorial we used the package Scikit-learn to examine different Machine Learning algorithms and their ability to predict attrition with the given data. For the simplest of these algorithms, K-Nearest Neighbors, implement the algorithm from scratch (no Sklearn!) and compare the results to the findings in your notebook.

3. Really love this stuff? Are you a math whiz? Implement a Gaussian Naive Bayes' classifier from scratch to classify this data. Compare its performance to the Scikit-learn GNB. If you do so successfully, you will get an all-analytics shout out!