# ASGN Standardization 

- Seperate out continuous and non-continuous variables for the assignment and this will then be the ones, drop off those variables you don't want, to include in your array. 

- r2 should be very low. 

## The Cardinal sin of data leakage

**Having data in the training sample that you wouldn't have for real world predictions**

Examples
1. y is explicitly in X (yikes)
2. y is a 2018 variable, but there is a 2019 variable in X
3. subtle: y is loan default, but X contains employee ID and some employees are brought in to handle trouble-loans (if you include it, the firm can't use the model to deploy the trouble-loan specialists)
4. if out-of-sample predicted stock movements have R2 above 10%... unlikely! (or: you'll be richer than Bezos soon)
5. this code below 

## Avoiding Data Leakage

- Preventing 1-4: Be very familiar with the data and how it was collected and built 
- Preventing 5: Do your data prep _**within**_ CV folds and where the transformations are done using only info from the training 

```python

# loop over folds 
for train_index, test_index in StratifiedKFold(n_splits=5).split(X,y):

    # .split() yields the indices in train/test sets. use those to get 
    # the x/y vars for each separated out:
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    
###################################################################
    # NEW: do the data prep inside this fold, only using training data 
    ###################################################################

    # e.g. figure out means/std in Xtrain so we can impute/std
    prep_methods.fit(Xtrain)                 # "fit" the transform means "estimate (like in training a model) what to do"
    Xtrain = prep_methods.transform(Xtrain)  # apply those to Xtrain to impute and std
    
    # fit/estimate, predict OOS, evaluate and store
    model.fit(X_train,y_train)
    
    ###################################################################
    # NEW: transform the test data the same... 
    ###################################################################
    
    X_test = prep_methods.transform(X_test)  # apply TEST data the FIT from the TRAIN data 
    
    y_predict = model.predict(X_test)
    accuracy.append(   accuracy_score(y_test, y_predict)      )

```

In [2]:
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn import svm

iris = load_iris() # data

# set up the pipeline, which will, given a set of observations 
# 1. fit and apply these steps to the training fold
# 2. in the testing fold, apply the transform and model to predict (no estimation)

classifier_pipeline = make_pipeline(
                                    preprocessing.StandardScaler(),  # clean the data
                                    svm.SVC(C=1)                     # model
                                    )

cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)

{'fit_time': array([0.00743794, 0.00287795, 0.00203085, 0.00188613, 0.00221181]),
 'score_time': array([0.002666  , 0.00086117, 0.00051022, 0.00148416, 0.00118399]),
 'test_score': array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])}

In [None]:
# question 1: try this with a Nearest Neighbors Classifier (5 min)

from sklearn.neighbors import KNeighborsClassifier
knn_pipe = make_pipeline(
                        preprocessing.StandardScaler(),  # clean the data
                        KNeighborsClassifier()           # model
                        )

cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)

iris2 = load_iris()
X2 = pd.DataFrame(iris2.data)
X2.columns = [1,2,3,4]
X2[2] = X2[2].sample(frac=0.5,random_state=14)
X2[2].describe()
iris2.data = X2

# print the scores using IRIS2.data (not iris.data)
# this produces an error because of the missing values!
# cross_validate(knn_pipe, iris2.data, iris.target, cv=5)

# so add an imputation step to the pipeline! (5 min, use lecture page!)
knn_pipe2 = ......
cross_validate(knn_pipe2, iris2.data, iris.target, cv=5)

In [None]:
# so add an imputation step to the pipeline! (5 min, use lecture page!)
from sklearn.impute import SimpleImputer
knn_pipe2 = make_pipeline(
                        SimpleImputer(strategy='mean'),  # fill missing values
                        preprocessing.StandardScaler(),  # clean the data
                        KNeighborsClassifier()           # model
                        )

cross_validate(knn_pipe2, iris2.data, iris.target, cv=5)

In [None]:
# grid search will let you specify all the parameters of the model
# you want to tweak, and the values you want to try

from sklearn.model_selection import GridSearchCV

# set up parameter grid to try
# the parameter grid is a dictionary where key:value pairs are built like:
#     stepName<two underlines>paramName : [list of settings to try]
param_grid = {'kneighborsclassifier__n_neighbors':[1,5,6,7,8,9,10]}

# like a normal estimator, this has not yet been applied to any data
grid = GridSearchCV(knn_pipe2, param_grid=param_grid)
grid.fit(iris2.data, iris.target)
grid.best_params_

# now save that pipeline as a model object!
optimal_knn_model = grid.best_estimator_

In [None]:
## Final Summary

- We've now seen more post model diagnostics 
- We can specify the models in `make_pipeline` alongside data cleaning/preprocessing steps that improve model performance without introducing data leakage. 
- There are many imputation, and scaling methods available in `sklearn`, and which one you use depends on the use-case. (Read about and try several!)
- Your pipeline for the assignment will be more complicated if you want to include categorical vars
- You can optimize all of the parameters throughout your pipeline using `GridSearchCV`
    - `GridSearchCV` also allows you to specify how you create folds
    - Which leads us to...

**LAST BIG POINT:** 
- Must of your projects involve an important time series dimension. (Ex: predicting stock returns) 
- In these cases, `KFold` and `StratifiedKFold` won't work (you can't have 1985 in the test sample)
- See: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html