
# Prepare Data

Plan - Acquire - **Prepare** - Explore - Model - Deliver

## What is the prep stage for? 

- Start with 1 or more dataframes that we have acquired
- summarize our data (head(), describe(), info(), isnull(), value_counts())
- visualize the distributions of the variables (we are NOT looking at relationships of variables, just individual variables and the values that make those up).  (plt.hist(), sns.boxplot(), value_counts())

- address missing values (sometimes we have to explore a little, to see how to address and then come back.)  (drop the rows, drop the columns, fill with 0, impute with mean, median, regression)
- address outliers (drop the rows/observations, pin them to the max-non-outlier value, bin values, keep them)
- data integrity issues, data errors (drop the rows/observations with the errors, correct them to what it was intended)
- address text normalization issues...e.g. deck 'C' 'c'. (correct and standardize the text)
- scale numeric data so that variables have the same weight, are on the same units (linear scalers and non-linear scalers)
- convert text/objects/characters data into data that can be represented numerically. (encoding, manual transformations using conditionals, e.g.,  dummy variables)

Round 1 of a project, to get to an mvp: do the minimum you have to do. - null values in the simplest way that makes sense, dropping extreme outlier observations, scaling (depending on the algorithm). 

## Train, Validate, Test, oh my!

split our data into train, validate and test sample dataframes...why?  

overfitting: model is not generalizable. It fits the data you've trained it on "too well". 3 points does not necessarily mean a parabola.  

**train:** *in-sample*, explore, impute mean, scale numeric data (max() - min()...), fit our ml algorithms, test our models. 


> **algorithm:** the method that sklearn provides, such as decision_tree, knn, ..., y = mx+b  
> **model:** that algorithm specific to our data, e.g. regression: the model would contain the slope value and intercept value. y = .2x+5

**validate, test**: represents future, unseen data

**validate**: confirm our top models have not overfit, test our top n models on unseen data. Using validate performance results, we pick the top **1** model. 

**test**: *out-of-sample*, how we expect our top model to perform in production, on unseen data in the future. **ONLY USED ON 1 MODEL.**

Should I do *this* on the full dataset or on the train sample? 
*this*: the action, method, function, step you are about to take on your data.   
1. Are you comparing, looking at the relationship or summary stats or visualizations with 2+ variables?   
2. Are you using an sklearn method?   
3. Are you moving into the explore stage of the pipeline?   

If **ONE** or more of these is yes, then you should be doing it on your train sample.   
If **ALL** are no, then the entire dataset is fine.   

You want to do all the prep that can be done on the full dataset before you split.   
Go through, work on DF for all you need to, then move to train when it's time. So you don't have to go back and forth, because leads to errors and inconsistencies in data. 

## prepare.py

What should prepare.py contain? 
Anyth

In [1]:
import pandas as pd
import acquire


In [3]:
acquire.get_titanic_data()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.9250,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1000,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.0500,S,Third,,Southampton,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,,Southampton,1
887,887,1,1,female,19.0,0,0,30.0000,S,First,B,Southampton,1
888,888,0,3,female,,1,2,23.4500,S,Third,,Southampton,0
889,889,1,1,male,26.0,0,0,30.0000,C,First,C,Cherbourg,1
