# Prepare Data

Plan - Acquire - **Prepare** - Explore - Model - Deliver

**Goal**: Prepare, tidy, and clean the data so that it is ready for exploration and analysis.

**Input:** 1 or more dataframes acquired through the "acquire" step.

**Output:** 1 dataset split into 3 samples in the form of dataframes: train, validate & test.

**Artifact:** `prepare.py`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import splitting and imputing functions
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# turn off pink boxes for demo
import warnings
warnings.filterwarnings("ignore")

# import our own acquire module
import acquire

In [None]:
df = acquire.get_titanic_data()

# STEP 1: Summarizing

In [None]:
# rows & columns

In [None]:
# view first n rows

In [None]:
# Get information about the dataframe: column names, rows, datatypes, non-missing values.

In [None]:
# Get summary statistics for numeric columns.

In [None]:
# Check out distributions of numeric columns.

In [None]:
# Use .describe with object columns.

In [None]:
# Create bins for fare using .value_counts.
# Using sort = false will sort by bin values as opposed to the frequency counts.

In [None]:
# Find columns with missing values and the total of missing values.

# STEP 2: Cleaning the Data

#### Duplicate Data?

In [None]:
# Drop duplicates...run just in case; reassign and check the shape of my data.

#### Missing Data?

In [None]:
# Drop columns with too many missing values for now and reassign; check the shape of my data.

In [None]:
# Validate that the columns are dropped.

We could fill `embark_town` with most common value, 'Southampton', by hard-coding the value using the `fillna()` function, as below. Or we could use an imputer. We will demonstrate the imputer after the train-validate-test split.

In [None]:
# Run .fillna() on the entire df.

In [None]:
# Validate that missing values in embark_town have been handled.

#### Outliers?

#### Erroneous Values?

#### Correct Datatypes?

#### Text Normalization?

#### Tidy Data?

In [None]:
# Each column should only represent one variable
# Each row should be one observation (passenger)

#### Create New Variables?

Get dummy vars for sex and embark_town

dummy_na: create a dummy var for na values, also?
drop_first: drop first dummy var (since we know if they do not belong to any of the vars listed, then they must belong to the first one that is not listed).

In [None]:
# Concatenate the dummy_df dataframe above with the original df and validate.

#### Rename Columns?

#### Scaling Data?

In [None]:
# You want to scale data when you're using methods based on measures
# of how far apart data points, like support vector machines
# or k-nearest neighbors.

### Lets not do that all over again repeatedly...lets make a function

Testing that the function does what we intend for it to do:

# Step 3: Splitting

In [None]:
# 20% test, 80% train_validate
# then of the 80% train_validate: 30% validate, 70% train. 

In [None]:
# Observe split

### Turn it into a function

Testing that the function is doing what we intend for it to do:

# Alternative Method: Impute

We can impute values using the mean, median, mode (most frequent), or a constant value. We will use sklearn.imputer.SimpleImputer to do this.

1. Create the imputer object, selecting the strategy used to impute (mean, median or mode (strategy = 'most_frequent').
1. Fit to train. This means compute the mean, median, or most_frequent (i.e. mode) for each of the columns that will be imputed. Store that value in the imputer object.
1. Transform train: fill missing values in train dataset with the stored value
1. Transform validate: fill missing values in validate dataset with the stored value
1. Transform test: fill missing values in test dataset with the stored value

In [None]:
# Get fresh Titanic data to use with missing values in embark_town again.

In [None]:
# ONLY look at train dataset after we split our data.

Create the `SimpleImputer` object, which we will store in the variable `imputer`. In the creation of the object, we will specify the strategy to use (mean, median, most_frequent). Essentially, this is creating the instructions and assigning them to a variable, `imputer`.

`Fit` the imputer to the columns in the training df. This means that the imputer will determine the most_frequent value, or other value depending on the strategy called, for each column.

It will store that value in the imputer object to use upon calling `transform`. We will call `transform` on our train, validate, and test datasets to fill any missing values.

In [None]:
# Validate that there are no longer any Null values in embark_town.

### Simplify our life with a function

Note: the `clean_data()` function is already dealing with missing values. If we want to use imputation, we will need to go back and tweak our earlier function.

In [None]:
# Yay functions!

### We can create a function made of our other functions

In [None]:
# Another function? YES PLZ!

In [None]:
# Acquire fresh Titanic data to test my funtion.

In [None]:
# Run final prepare function and validate what that the function is working properly.

# Exercise Time