# Preprocessing

This stage is focused on ensuring our data is ready to be fed into a model. 

- Columns: The train, validate, and test dataset should have the same columns. If any modifications were made to train, they should be executed on validate and test, as well. 

- Datatypes: All datatypes that will be fed into model need to be numeric (dummy vars, factor vars, manual encoding)  
- Scale numeric data: If an algorithm will be used that will be affected by the differing weights, this ensures that continuous variables have the same weight and are on the same units.
    - This will be talked about in a future module

- Tidy data: Getting your data in the shape it needs to be for modeling and exploring. Every row should be an observation and every column should be a feature/attribute/variable. We want 1 observation per row, and 1 row per observation. If you want to predict a customer churn, each row should be a customer and each customer should be on only 1 row. (address duplicates, aggregate, melt, reshape, ...)   
    - this will be talked about in a future module

We will be focused on cleaning up datatypes for modeling. This is known as encoding, where we turn all string values into numeric values. 

We will continue working through the titanic dataset.

In [9]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import acquire
import prepare

In [2]:
# Get Titanic data

df = acquire.get_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [3]:
# Prepare Titanic data

train, validate, test = prepare.prep_titanic_data(df)
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,fare,embark_town,alone
583,583,0,1,male,0,0,40.125,Cherbourg,1
165,165,1,3,male,0,2,20.525,Southampton,0
50,50,0,3,male,4,1,39.6875,Southampton,0
259,259,1,2,female,0,1,26.0,Southampton,0
306,306,1,1,female,0,0,110.8833,Cherbourg,1


Get dummy vars for sex and embark_town

- dummy_na: create a dummy var for na values, also?   
- drop_first: drop first dummy var (since we know if they do not belong to any of the vars listed, then they must belong to the first one that is not listed).   

In [4]:
# Using drop_first leaves sex_male, embark_town_Queenstown, and embark_town_Southampton.

dummy_train = pd.get_dummies(train[['sex','embark_town']], dummy_na=False, drop_first=[True, True])
dummy_train.head()

Unnamed: 0,sex_male,embark_town_Queenstown,embark_town_Southampton
583,1,0,0
165,1,0,1
50,1,0,1
259,0,0,1
306,0,0,0


In [5]:
# Concatenate the dummy_df dataframe above with the original df and verify.

train = pd.concat([train, dummy_train], axis=1)
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,583,0,1,male,0,0,40.125,Cherbourg,1,1,0,0
165,165,1,3,male,0,2,20.525,Southampton,0,1,0,1
50,50,0,3,male,4,1,39.6875,Southampton,0,1,0,1
259,259,1,2,female,0,1,26.0,Southampton,0,0,0,1
306,306,1,1,female,0,0,110.8833,Cherbourg,1,0,0,0


In [6]:
# Drop string values that have been replaced with encoded values.

train = train.drop(columns=['sex', 'embark_town'])
train.head()

Unnamed: 0,passenger_id,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,583,0,1,0,0,40.125,1,1,0,0
165,165,1,3,0,2,20.525,0,1,0,1
50,50,0,3,4,1,39.6875,0,1,0,1
259,259,1,2,0,1,26.0,0,0,0,1
306,306,1,1,0,0,110.8833,1,0,0,0


This dataframe is now ready for modeling. 

We need to follow the same steps for the validation and test dataframes, so they will also work with the algorithms. 

In [7]:
# Using drop_first leaves sex_male, embark_town_Queenstown, and embark_town_Southampton.

dummy_val = pd.get_dummies(validate[['sex','embark_town']], dummy_na=False, drop_first=[True, True])
dummy_test = pd.get_dummies(test[['sex','embark_town']], dummy_na=False, drop_first=[True, True])

# Concatenate the dummy_df dataframe above with the original df.

validate = pd.concat([validate, dummy_val], axis=1)
test = pd.concat([test, dummy_test], axis=1)

# Drop string values that have been replaced with encoded values.

validate = validate.drop(columns=['sex', 'embark_town'])
test = test.drop(columns=['sex', 'embark_town'])

## Exercises

Do these exercises in a notebook called `modeling.ipynb` first, then transfer the final functions to the `model.py` file. 

This work should all be saved in your local `classification-exercises` repo. Add, commit, and push your changes.

Using the Titanic dataset

1. Use the function defined in `acquire.py` to load the Titanic data.  

1. Use the function defined in `prepare.py` to prepare the titanic data. 

1. Encode the categorical columns on train dataset. Create dummy variables of the categorical columns and concatenate them onto the dataframe. Remove the columns they are replacing. Repeat on validate and test. 

1. Create a function named `preprocess_titanic` that accepts the train, validate, and test titanic data, and returns the dataframes ready for modeling.

Using the Telco dataset

1. Use the function defined in `acquire.py` to load the Telco data.  

1. Use the function defined in `prepare.py` to prepare the Telco data. 

1. Encode the categorical columns on train. 
    1. Encode at least one column using `.replace`
    1. Encode at least one column using `.map`
    1. Encode the rest of the columns by creating dummy variables and concatenating them onto the dataframe.
    
1. Repeat the same steps on validate and test.

1. Create a function named `prep_telco` that accepts the train, validate, and test telco data, and returns the dataframes ready for modeling.