# Plan - Acquire - Prepare - Explore - Model - Deliver

**Goal**: Prepare, tidy, and clean the data so that it is ready for exploration and analysis.

**Input**: 1 or more dataframes acquired through the "acquire" step.

**Output**: 1 dataset split into 3 samples in the form of dataframes: train, validate & test.

**Artifact**: prepare.py

## How?
1. Summarize our data:
    - head(), describe(), info(), isnull(), value_counts(), shape, ...
    - plt.hist(), plt.boxplot()
    - document takeaways (nulls, datatypes to change, outliers, ideas for features, etc.)

2. Clean the data:
    - missing values: drop columns with too many missing values, drop rows with too many missing values, fill with zero where it makes sense, and then make note of any columns you want to impute missing values in (you will need to do that on split data).
    - **outlier**: an observation point that is distant from other observations https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
    - outliers: ignore, drop rows, snap to a selected max/min value, create bins (cut, qcut)
    - data errors: drop the rows/observations with the errors, correct them to what it was intended
    - address text normalization issues...e.g. deck 'C' 'c'. (correct and standardize the text)
    - tidy data: getting your data in the shape it needs to be for modeling and exploring. every row should be an observation and every column should be a feature/attribute/variable. You want 1 observation per row, and 1 row per observation. If you want to predict a customer churn, each row should be a customer and each customer should be on only 1 row. (address duplicates, aggregate, melt, reshape, ...)
    - creating new variables out of existing variables (e.g. z = x - y)
    - rename columns
    - datatypes: need numeric data to be able to feed into model (dummy vars, factor vars, manual encoding)
    - scale numeric data: so that continuous variables have the same weight, are on the same units, if algorithm will be used that will be affected by the differing weights, or if data needs to be scaled to a gaussian/normal distribution for statistical testing. (linear scalers and non-linear scalers)

3. Split the data:
    - split our data into train, validate and test sample dataframes
    - Why? overfitting: model is not generalizable. It fits the data you've trained it on "too well". 3 points does not necessarily mean a parabola.
    - **train**: in-sample, explore, impute mean, scale numeric data (max() - min()...), fit our ml algorithms, test our models.
    - **validate, test**: represents future, unseen data
    - **validate**: confirm our top models have not overfit, test our top n models on unseen data. Using validate performance results, we pick the top 1 model.
    - **test**: out-of-sample, how we expect our top model to perform in production, on unseen data in the future. ONLY USED ON 1 MODEL.
    - You want to do all the prep that can be done on the full dataset before you split. Go through, work on DF for all you need to, then move to train when it's time. So you don't have to go back and forth, because leads to errors and inconsistencies in data.

## algorithm vs. model

- **algorithm**: the method that sklearn provides, such as decision_tree, knn, ..., y = mx+b
- **model**: that algorithm specific to our data, e.g. 
    - **regression**: the model would contain the slope value and intercept value. y = .2x+5

**Should I do this on the full dataset or on the train sample?**

this: the action, method, function, step you are about to take on your data.

1. Are you comparing, looking at the relationship or summary stats or visualizations with 2+ variables?
2. Are you using an sklearn method?
3. Are you moving into the explore stage of the pipeline?

If **ONE** or more of these is yes, then you should be doing it on your train sample. If **ALL** are no, then the entire dataset is fine.

## Summarize Data

In [7]:
#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import acquire

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [9]:
import warning
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'warning'

In [12]:
#grab our acquired dataset:
df = acquire.df_titanic()

AttributeError: module 'acquire' has no attribute 'df_titanic'

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings("ignore")

import acquire

We'll use the funciton we defined in the last lesson to acquire our data:

In [15]:
df = acquire.df_titanic()

AttributeError: module 'acquire' has no attribute 'df_titanic'

In [None]:
df.shape

In [None]:
df.head(2)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
num_cols = df.columns[[df[col].dtype == 'int64' for col in df.columns]]
for col in num_cols:
    plt.hist(df[col])
    plt.title(col)
    plt.show()

In [None]:
obj_cols = df.columns[[df[col].dtype == 'O' for col in df.columns]]
for col in obj_cols:
    print(df[col].value_counts())
    print(df[col].value_counts(normalize=True, dropna=False))

In [None]:
# sort = false will sort the bin values as opposed to the frequency counts
# value counts of fare by binning
df.fare.value_counts(bins=5, sort=False)

In [None]:
# columns with missing values
missing = df.isnull().sum()
missing[missing > 0]

Takeaways

embarked == embark_town, so remove embarked & keep embark_town
class == pclass, so remove class & keep pclass (already numeric)
drop deck...way too many missing values
fill embark_town with most common value ('Southampton')
drop age column
encode or create dummy vars for sex & embark_town.

## Clean the Data

In [None]:
# drop duplicates...run just in case
df.drop_duplicates(inplace=True)

In [None]:
# drop columns with too many missing to have any value right now
cols_to_drop = ['deck', 'embarked', 'class', 'age']
df = df.drop(columns=cols_to_drop)

We could fill embark_town with most common value, 'Southampton', by hard-coding the value using the fillna() function, as below. Or we could use an imputer. We will demonstrate the imputer after the train-validate-test split.

In [None]:
df['embark_town'] = df.embark_town.fillna(value='Southampton')

Get dummy vars for sex and embark_town

dummy_na: create a dummy var for na values, also?
drop_first: drop first dummy var (since we know if they do not belong to any of the vars listed, then they must belong to the first one that is not listed).

In [None]:
dummy_df = pd.get_dummies(df[['sex','embark_town']], dummy_na=False, drop_first=[True, True])

# append dummy df cols to the original df. 
df = pd.concat([df, dummy_df], axis=1)

Create a function to perform these steps when we need to reproduce our dataset.

In [None]:
def clean_data():
    '''
    This function will drop any duplicate observations, 
    drop columns not needed, fill missing embarktown with 'Southampton'
    and create dummy vars of sex and embark_town. 
    '''
    df.drop_duplicates(inplace=True)
    df.drop(columns=['deck', 'embarked', 'class', 'age'], inplace=True)
    df.embark_town.fillna(value='Southampton', inplace=True)
    dummy_df = pd.get_dummies(df[['sex', 'embark_town']], drop_first=True)
    return pd.concat([df, dummy_df], axis=1)

## Train, Validate, Test Split

In [None]:
# 20% test, 80% train_validate
# then of the 80% train_validate: 30% validate, 70% train. 

train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

Option for Missing Values: Impute
We can impute values using the mean, median, mode (most frequent), or a constant value. We will use sklearn.imputer.SimpleImputer to do this.

Create the imputer object, selecting the strategy used to impute (mean, median or mode (strategy = 'most_frequent').
Fit to train. This means compute the mean, median, or most_frequent (i.e. mode) for each of the columns that will be imputed. Store that value in the imputer object.
Transform train: fill missing values in train dataset with that value identified
Transform test: fill missing values with that value identified

Create the SimpleImputer object, which we will store in the variable imputer. In the creation of the object, we will specify the strategy to use (mean, median, most_frequent). Essentially, this is creating the instructions and assigning them to a variable we will reference.