<h2>Exercise 1 - A data scientist is running an AdaBoost classifier on a dataset with 100 observations. Answer the
following:</h2>

1a) Initial weight of observation 72 is 1/100 since there are 100 observations and they are all weighted equally to begin with.

1b) The 72nd observation being misclassified means that the new weight assigned to it will be *larger* than the initial weight. This is because we want it to get more attention since it was wrong

<h2>Exercise 2 - Explain why AbaBoost is an ensemble learning algorithm? Be specific.</h2>

**Answer:** It takes a bunch of weak learners to make a strong learner. The data points with the most errors get larger weights. Each model learns from the previous one and they form a much better predictor. 

<h2>Exercise 3 - refer to picture taken of handwritten work</h2>

<h2>Exercise 4 - If your AdaBoost ensemble under-fits the training dataset, what would you do to fix
that? That is, which hyper-parameters should you tweak?</h2>

**Answer:** Increase the number of estimators to learn from more weak learners and the learning rate since those two go together

<h2>Exercise 5 - For binary classification, which of the following statements are TRUE of AdaBoost with decision trees as learners?</h2>

**Answer:** A, B, & C

<h2>Exercise 6 - Which of the following is/are TRUE about gradient boosting trees?</h2>

**Answer:** F (so both b and c)

<h2>Exercise 7 - In this course have covered two boosting frameworks. What is the main difference
between AdaBoost and Gradient Boosting? Be specific.</h2>

**Answer:** AdaBoost weighs previous misclassified observations by weighting them larger. Gradient Boosting fits the weak learners to the residuals/errors of the previous learners. Gradient boosting trains on minimizing the loss function.

<h2>Exercise 8 - Use framingham datafile to answer questions below</h2>

In [1]:
#8a - read the csv data file and create a data-frame called heart. Remove the observations with missing values.

import pandas as pd

heart = pd.read_csv('framingham(4).csv')
heart = heart.dropna()
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [2]:
### 8b ###

import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, accuracy_score 
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

#Predictor variables = age, totChol, sysBP, BMI, heartRate, and glucose
#Target variables = TenYearCHD

X = heart[['age', 'totChol' , 'sysBP', 'BMI', 'heartRate', 'glucose']]
Y = heart['TenYearCHD']

#We will store our recall and accuracy scores here
RF_recall = list()
ET_recall = list()
Ada_recall = list()
GB_recall = list()
RF_accuracy = list()
ET_accuracy = list()
Ada_accuracy = list()
GB_accuracy = list()

#Repeat 100 times
for i in range(0, 100):
    
    #Split data
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y)


    ### Random Forest Classifier ###
    RF_md = RandomForestClassifier(n_estimators = 500, max_depth = 5).fit(X_train, Y_train)
    
    #Predict
    RF_pred = RF_md.predict_proba(X_test)[:, 1]
    
    #Labels with 10% cutoff
    RF_labels = np.where(RF_pred < 0.1, 0, 1)
    RF_recall.append(recall_score(Y_test, RF_labels))
    RF_accuracy.append(accuracy_score(Y_test, RF_labels))


    ### Extra Trees Classifier ###
    ET_md = ExtraTreesClassifier(n_estimators = 500, max_depth = 5).fit(X_train, Y_train)
    
    #Predict
    ET_pred = ET_md.predict_proba(X_test)[:, 1]
    
    #Labels with 10% cutoff
    ET_labels = np.where(ET_pred < 0.1, 0, 1)
    ET_recall.append(recall_score(Y_test, ET_labels))
    ET_accuracy.append(accuracy_score(Y_test, ET_labels))

    
    ### AdaBoost ###
    Ada_md = AdaBoostClassifier(estimator = DecisionTreeClassifier(max_depth = 3), 
                                n_estimators = 500,
                                learning_rate = 0.01).fit(X_train, Y_train)
    
    #Predict
    Ada_pred = Ada_md.predict_proba(X_test)[:, 1]
    
    #Labels with 10% cutoff
    Ada_labels = np.where(Ada_pred < 0.1, 0, 1)
    Ada_recall.append(recall_score(Y_test, Ada_labels))
    Ada_accuracy.append(accuracy_score(Y_test, Ada_labels))

    
    ### Gradient Boosting ###
    GB_md = GradientBoostingClassifier(max_depth = 3,
                                       n_estimators = 500, 
                                       learning_rate = 0.01).fit(X_train, Y_train)
    
    #Predict
    GB_pred = GB_md.predict_proba(X_test)[:, 1]
    
    #Labels with 10% cutoff
    GB_labels = np.where(GB_pred < 0.1, 0, 1)
    GB_recall.append(recall_score(Y_test, GB_labels))
    GB_accuracy.append(accuracy_score(Y_test, GB_labels))

In [3]:
#Average accuracy and recall of each model above
print('RF recall: ', np.mean(RF_recall))
print('RF accuracy: ', np.mean(RF_accuracy))

print('ET recall: ', np.mean(ET_recall))
print('ET accuracy: ', np.mean(ET_accuracy))

print('Ada recall: ', np.mean(Ada_recall))
print('Ada accuracy: ', np.mean(Ada_accuracy))

print('GB recall: ', np.mean(GB_recall))
print('GB accuracy: ', np.mean(GB_accuracy))

RF recall:  0.8361607142857141
RF accuracy:  0.48892076502732246
ET recall:  0.9362500000000001
ET accuracy:  0.33814207650273226
Ada recall:  0.9914285714285715
Ada accuracy:  0.15834699453551915
GB recall:  0.8115178571428572
GB accuracy:  0.5068169398907104


From the above, I would use RF or GB since the recall is fairly high without having a very low accuracy

<h4>### 8c ###</h4>

All of the models have a recall over 80% but none of them have an accuracy over 80%. We could play around with the max_depth, n_estimators, and learning_rate to try and reach 80%