# Module 2: Cross-Validation - Practice

In this practice you will create a **20-fold cross-validation** to a **Gaussian Naive Bayes model**, 
which attempts to fit the **titanic** dataset. We will be using the entire dataset for training with cross-validation. 

+ Look for **placeholders** in the code and fill in the appropriate code.
+ Expect requirements in **bold** font when provided.
+ Presentation of printouts are not strict as long as they are readable and equivalent.


In [2]:
import os, sys
from collections import Counter
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB


## Load Dataset

Load dataset from files into multidimensional array.

In [3]:
# Dataset location
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
count,890.0,890.0,890.0,890.0,890.0,890.0,890.0,890.0
mean,2.31236,0.642697,29.548697,0.503371,0.351685,32.865772,0.895506,0.389888
std,0.837241,0.479475,13.379025,1.095286,0.790069,52.639685,0.529535,0.487999
min,1.0,0.0,0.17,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,1.0,0.0
50%,3.0,1.0,28.0,0.0,0.0,13.775,1.0,0.0
75%,3.0,1.0,37.0,1.0,0.0,29.925,1.0,1.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,2.0,1.0


In [4]:
dataset.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
0,2,1,26.0,1,1,29.0,1,1
1,2,1,36.0,0,0,13.0,1,0
2,2,1,66.0,0,0,10.5,1,0
3,3,1,24.0,2,0,24.15,1,0
4,3,1,21.0,0,0,7.925,1,0


## Part 1: Cross-validation with sklearn

Make a **20-fold** cross-validation using `cross_val_score()` provided by sklearn.

In [None]:
model = GaussianNB()

# Add your code below this comment (Question #P01)
# ----------------------------------
X = <placeholder>    # include all the columns except the last one
y = <placeholder>   # last col (survived)


In [None]:
from sklearn.preprocessing import MinMaxScaler

# perform scaling

scaler = MinMaxScaler().fit(<placeholder>)  

X_scaled = scaler.transform(<placeholder>)


In [None]:
cv_scores = sklearn.model_selection.cross_val_score(<placeholder>, <placeholder>, <placeholder>, cv=<placeholder>)
cv_scores

In [None]:
np.mean(cv_scores)

## Part 2: Create cross-validation manually

Make a 20-fold cross-validation **without** using the scikit learn provided cross-validation scoring method.

(This cell is just a copy in case you lose the original code.)

```python
# Add your code below this comment (Question #P02)
# ----------------------------------
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(<placeholder>, <placeholder>)
    y_folds = np.array_split(<placeholder>, <placeholder>)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))

# Add your code below this comment (Question #P03)
# ----------------------------------
    for i in range(cv):
        X_train = np.concatenate([X_folds[<placeholder>] for j in range(cv) if <placeholder>])
        X_test = X_folds[<placeholder>]
        y_train = np.concatenate([y_folds[<placeholder>] for j in range(cv) if <placeholder>])
        y_test = y_folds[<placeholder>]
        model.fit(<placeholder>, <placeholder>)
        yield model.score(<placeholder>, <placeholder>)

print("Cross-validation:")
for i, score in enumerate(cross_val_score(model, X, y, cv=<placeholder>)):
    print(('\tscore[%d] ='%i), score)
```


In [None]:
# Add your code below this comment (Question #P02)
# ----------------------------------
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(<placeholder>, <placeholder>)
    y_folds = np.array_split(<placeholder>, <placeholder>)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))

# Add your code below this comment (Question #P03)
# ----------------------------------
    for i in range(cv):
        X_train = np.concatenate([X_folds[<placeholder>] for j in range(cv) if <placeholder>])
        X_test = X_folds[<placeholder>]
        y_train = np.concatenate([y_folds[<placeholder>] for j in range(cv) if <placeholder>])
        y_test = y_folds[<placeholder>]
        model.fit(<placeholder>, <placeholder>)
        yield model.score(<placeholder>, <placeholder>)


In [None]:
# now, test the above function
print("Cross-validation:")
cv_scores = []
for i, score in enumerate(cross_val_score(model, <placeholder>, <placeholder>, cv=<placeholder>)):
    print(('\tscore[%d] ='%i), score)
    cv_scores.append(score)

In [None]:
np.mean(cv_scores)

# Save your notebook!  Then `File > Close and Halt`