# Data Prep

**Prerequisites**

- Pandas
- Sklearn

**Outcomes**

- Be familiar with some common data prep tools: standardizing, scaling, feature encoding
- Be able to construct a sklearn pipeline that does data preparation work

## Data cleaning

- It is often said that more than 90% of a data scientist's time is spent preparing data
- That's likely an underestimate
- In order to derive useful results from a model, you need to feed the model useful data
- As the saying goes "Garbage in, garbage out"

### Start early

- The data preparation process begins before any data is actually collected!
- Being part of experiment design or the data collection process is first best
- When this is not possible, knowing as much as possible about data source will help
    - Identify potential biases    
    - Gain intuition on relationships between data
    - Know what data *isn't* there

### Types of data manipulation

- Once you have some data and have decided what to model, you will likely need to prepare that data for the model
- Some common transformations include:
    - Standariding or scaling the data
    - Feature encoding: one-hot-encoding, ordinal encoding, discritization
    - Handling missing data: imputation, filtering, 
    - Feature engineering: polynomial features, other non-linear transforms
- sklearn provides tools for all these types of pre-processing

## Pipelines

In sklearn there are two core parent classes

1. `Transformers`: transform from $X$ to $\hat{X}$
    - `.fit(X)`: performs necessary calculations to do transformation (stores results)
    - `.transform(X)`: does transform of $X$ to $\hat{X}$
1. `Estimators`: Given $X$ and $y$ data (or just $X$ for unsupervised) find model *parameters*
    - `.fit(X, y)`: compute parameters of model
    - `.predict(X)`: compute predicted $y$'s based on $X$'s that are passed in
    
> Note `.fit_transform(X)` is shorthand for first fitting, then transforming. Similarly `.fit_predict` will first fit and then generate predictions

### `sklearn.pipeline`

- Many ML tasks require multiple steps of preprocessing before passing data to model
- These are represented as transformers
- A pipeline is a 0 or more transformers and then a single Estimator
- Data is passed through transformers, in the order specified, then to estimator


### pipeline lifecycle

1. Define the pipeline `model = sklearn.pipeline.make_pipeline([trans1, trans2, ..., transN, est])`
2. Fit the model: `model.fit(X, y)`. Looks like this:
```python
X1 = trans1.fit_transform(X)
X2 = trans2.fit_transform(X1)
# ...
XN = transN.fit_transform(XNm)
est.fit(XN, y)
```
3. Generate predictions: `model.predict(x)`:
```python
x1 = trans1.transform(x)
x2 = trans2.transform(x1)
# ...
xn = transN.transform(xNm)
yhat = est.predict(xn)
```

> pipelines save you the hassle of calling `.fit` and `.transform` on all the transformers every time!

### Example

In [None]:
from sklearn import preprocessing, pipeline, linear_model
import numpy as np

In [None]:
# create some dummy data
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

y = np.array([0.1, -2.3, 1.2])

In [None]:
# define and fit transformer
trans1 = preprocessing.PolynomialFeatures(degree=3)
trans1.fit(X_train)

In [None]:
# apply transformation to training data
X1 = trans1.transform(X_train)
print(X1.shape)
X1

In [None]:
# define and fit linear model, using transformed data
linreg = linear_model.LinearRegression()
linreg.fit(X1, y)

In [None]:
# predict on training set
linreg.predict(X1)

In [None]:
# let's try to evaluate on a test dataset
X_test = np.array([[2, 3, 1], [-1, -1, 0.2]])

In [None]:
# easy to go wrong...
linreg.predict(X_test)

In [None]:
# need to transform first...
X_test2 = trans1.transform(X_test)

In [None]:
# ... then we can predict
linreg.predict(X_test2)

In [None]:
# easier to set up in a pipeline
model = pipeline.make_pipeline(trans1, linreg)

# single call to fit
model.fit(X_train, y)

In [None]:
X_test

In [None]:
# single call to predict
model.predict(X_test)

## Scaling

- Many machine learning algorithms require data to be scaled
- Sometimes, the underlying math will even assume features `X` are distributed N(0, 1)
- `sklearn.preprocessing.StandardScaler` is a routine to make each feature have mean 0 and variance 1

In [None]:
import pandas as pd

In [None]:
df = pd.read_parquet("https://css-materials.s3.amazonaws.com/ML/linear_models_2/insurance_claims_data.parquet")

In [None]:
df_numbers = df.select_dtypes([float, int])
df_strings = df.select_dtypes([object])

In [None]:
df_numbers.describe().T

In [None]:
scaler = preprocessing.StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df_numbers),
    index=df_numbers.index,columns=df_numbers.columns
)

In [None]:
df_scaled.describe().T

Notice mean and std are now (0,1) for all variables

Further Reference: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html