# Module 2: Cross Validation - Answers for practice

In this practice you will create a **20-fold cross validation** to a **Gaussian Naive Bayes model**, which attempts to fit the **titanic** dataset.

+ Look for **placeholders** in the code and fill in the blanks.
+ Expect requirements in **bold** font when provided.
+ Presentation of printouts are not strict as long as they are readable and equivalent.


In [1]:
import os, sys
from collections import Counter
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB


## Load Dataset

Load dataset from files into multidimensional array.

In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
count,890.0,890.0,890.0,890.0,890.0,890.0,890.0,890.0
mean,2.31236,0.642697,29.548697,0.503371,0.351685,32.865772,0.895506,0.389888
std,0.837241,0.479475,13.379025,1.095286,0.790069,52.639685,0.529535,0.487999
min,1.0,0.0,0.17,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,1.0,0.0
50%,3.0,1.0,28.0,0.0,0.0,13.775,1.0,0.0
75%,3.0,1.0,37.0,1.0,0.0,29.925,1.0,1.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,2.0,1.0


In [3]:
dataset.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
0,1,1,38.0,0,1,153.4625,1,0
1,3,1,24.0,0,0,7.25,2,0
2,1,1,50.0,1,0,55.9,1,0
3,1,1,53.0,0,0,28.5,0,1
4,2,1,66.0,0,0,10.5,1,0


## Part 1: Cross validation with sklearn

Make 20-fold cross validation using cross_val_score() provided by sklearn.

In [4]:
model = GaussianNB()

# Add your code below this comment (Question #P01)
# ----------------------------------
X = dataset.iloc[:,:-1]  # include all the columns except the last one
y = dataset.survived


In [5]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X)  

X_scaled = scaler.transform(X)


In [6]:

cv_scores = sklearn.model_selection.cross_val_score(model, X_scaled, y, cv=20) # <placeholder>
cv_scores

array([0.8       , 0.86666667, 0.75555556, 0.84444444, 0.73333333,
       0.75555556, 0.71111111, 0.82222222, 0.68888889, 0.71111111,
       0.86363636, 0.77272727, 0.81818182, 0.86363636, 0.72727273,
       0.81818182, 0.68181818, 0.68181818, 0.77272727, 0.70454545])

In [7]:
np.mean(cv_scores)

0.7696717171717171

## Part 2: Create cross validation manually

In [8]:
# Add your code below this comment (Question #P02)
# ----------------------------------
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(X, cv)
    y_folds = np.array_split(y, cv)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))
    
# Add your code below this comment (Question #P03)
# ----------------------------------
    for i in range(cv):
        X_train = np.concatenate([X_folds[j] for j in range(cv) if j!=i])
        X_test = X_folds[i]
        y_train = np.concatenate([y_folds[j] for j in range(cv) if j!=i])
        y_test = y_folds[i]
        model.fit(X_train, y_train)
        yield model.score(X_test, y_test)



In [9]:
print("Cross validation:")
cv_scores = []
for i, score in enumerate(cross_val_score(model, X_scaled, y, cv=20)):
    print(('\tscore[%d] ='%i), score)
    cv_scores.append(score)

Cross validation:
X_folds Counter({(45, 7): 10, (44, 7): 10}) y_folds Counter({(45,): 10, (44,): 10})
	score[0] = 0.8
	score[1] = 0.8666666666666667
	score[2] = 0.8
	score[3] = 0.7777777777777778
	score[4] = 0.7555555555555555
	score[5] = 0.7333333333333333
	score[6] = 0.7555555555555555
	score[7] = 0.7777777777777778
	score[8] = 0.7111111111111111
	score[9] = 0.7333333333333333
	score[10] = 0.8409090909090909
	score[11] = 0.7727272727272727
	score[12] = 0.8181818181818182
	score[13] = 0.8863636363636364
	score[14] = 0.6590909090909091
	score[15] = 0.8409090909090909
	score[16] = 0.7045454545454546
	score[17] = 0.6590909090909091
	score[18] = 0.75
	score[19] = 0.7272727272727273


In [10]:
np.mean(cv_scores)

0.768510101010101

# Save your notebook!