### Exercise 1 - Given 1000 records in a dataset, 1000 models are trained with 999 records as part of the training sample and the remaining 1 sample for testing, and the error rate is averaged out, this validation technique is called what?

Answer: LOOCV (leave-one-out cross validation)

### Exercise 2 - In k-fold cross validation technique, the value of k being small could lead to which of the following in relation to the error rate?

Answer: High bias and low variance

### Exercise 3 - In k-fold cross validation technique, the value of k being large could lead to which of the following in relation to the error rate?

Answer: Low bias and high variance

### Exercise 4 - Explain what regularization is and why it is useful.

Answer: Fitting a model with all the input variables. It shrinks the variables towards zero relative to the least squares estimates. This helps with reducing variance. The two best techiqnues for shrinking the regression coefficients towards zero are *ridge regression* and the *LASSO*

### Exercise 5 - Use framingham.csv data file. It is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The dataset provides the patients' information. It includes over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.

In [7]:
### 5a - Using the pandas library, read the csv data file and create a data-frame called heart. ###

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score

heart = pd.read_csv('framingham.csv')
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [8]:
### 5b - Remove observations with missing values ###

heart = heart.dropna()

In [28]:
### 5c - Perform a 5-folds cross validation with the goal of measuring the performance
### in terms of F1-score, of two competing models:

#TenYearCHD is target variable
X = heart.drop(columns = ['TenYearCHD'], axis=1)
Y = heart['TenYearCHD']

#5 KFolds cross validation
k_fold = KFold(n_splits = 5, shuffle = True)

#Empty lists to store f1 score results
score_1 = list()
score_2 = list()

for train_idx, val_idx in k_fold.split(X):
    #Split data
    X_train, X_val = X.iloc[train_idx],X.iloc[val_idx]
    Y_train, Y_val = Y.iloc[train_idx],Y.iloc[val_idx]
    
    ### Model 1 ###
    X_1 = X_train[['age', 'currentSmoker', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']]
    X_val_1 = X_val[['age', 'currentSmoker', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']]
    
    #Scaling input variables
    scaler = MinMaxScaler()
    X_1 = scaler.fit_transform(X_1)
    X_val_1 = scaler.fit_transform(X_val_1)
    
    md_1 = LogisticRegression().fit(X_1, Y_train)
    
    #Predicting
    pred_1 = md_1.predict_proba(X_val_1)[:,1]
    
    #25% threshold, change likelihood to labels
    pred_1_label = np.where(pred_1 < 0.25, 0, 1)
    
    #f1 score model 1
    score_1.append(f1_score(Y_val, pred_1_label))
    
    ### Model 2 - Got rid of sysBP and diaBP from input variables ###
    X_2 = X_train[['age', 'currentSmoker', 'totChol', 'BMI', 'heartRate', 'glucose']]
    X_val_2 = X_val[['age', 'currentSmoker', 'totChol', 'BMI', 'heartRate', 'glucose']]
    
    #Scaling input variables
    X_2 = scaler.fit_transform(X_2)
    X_val_2 = scaler.fit_transform(X_val_2)    
    
    md_2 = LogisticRegression().fit(X_2, Y_train)
    
    #Predicting
    pred_2 = md_2.predict_proba(X_val_2)[:,1]
    
    #25% threshold, change likelihood to labels
    pred_2_label = np.where(pred_2 < 0.25, 0, 1)
    
    #f1 score model 2
    score_2.append(f1_score(Y_val, pred_2_label))
    
print('Model 1: ', score_1, '\n')
print('Model 2: ', score_2)

Model 1:  [0.3448275862068966, 0.38267148014440433, 0.3679525222551929, 0.2857142857142857, 0.3951367781155015] 

Model 2:  [0.36090225563909767, 0.3206751054852321, 0.34532374100719426, 0.32131147540983607, 0.35179153094462545]


In [29]:
### 5d - Report the average F1-score of each of the models. 
### What model would you use to predict TenYearCHD? Explain

print('Average f1 score of model 1: ', np.mean(score_1))
print('Average f1 score of model 2: ', np.mean(score_2))

Average f1 score of model 1:  0.3552605304872562
Average f1 score of model 2:  0.3400008216971971


From the above, I would use model 1 to predict TenYearCHD because the f1 score of model 1 is higher.
f1 scales from 0 to 1, a higher f1 score means that the model better represents each observation