# Homework Assignment 6 - Evan Callaghan

### 1. A data scientist is running an AdaBoost classifier on a dataset with 100 observations. Answer the following:

### a) What is the weight initial weight of observation 72th in the training dataset?

#### Initially, an AdaBoost classifier weighs all observations equally. Since there is 100 observations under consideration, the initial weight of the 72nd observation is 1/100 (0.01). 

### b) The 72nd observation in the training dataset is misclassified by the first weak learner chosen by the data scientist. Is the new weight of the 72nd observation in the training dataset larger or smaller than the weight assigned to that observation initially?

#### AdaBoost works to increase the weight of misclassified observations in order to force the learner to focus on the regions that were more problematic for the previous estimators. Therefore, since the 72nd observation is misclassified by the first learner, the new weight will be larger than the weight initially assigned. 

### 2. Explain why AbaBoost.M1 is an ensemble learning algorithm?

#### An ensemble learning algorithm is defined as a set of weak learners that are trained together (or in a sequence) to make up a committee. In classification and regression problems, the final result is obtained by averaging the predictions or employing a majority vote. In AdaBoost.M1, a set number of weak models are added sequentially and trained using the weighted training data. Therefore, AdaBoost.M1 is an ensemble learning algorithm.


### 3. Suppose you are running AdaBoost.M1 (with η = 0.1) with 4 training examples. At the start of the current iteration, the four examples have the weights shown in the following table. Another column says if the weak classifier got them correct or incorrect. Determining the new weights for these four examples:

#### 

### 4.  If your AdaBoost ensemble under-fits the training dataset, what would you do to fix that? That is, which hyper-parameters should you tweak?

#### If your AdaBoost ensemble under-fits the training dataset, I would try to increase the number of estimators or reduce the regularization hyper-parameter of the base estimator. These options could increase the complexity of the model and avoid the problem of under-fitting.


### 5. For binary classification, which of the following statements are TRUE of AdaBoost with decision trees as learners?

#### A. It usually has lower bias than a single decision tree.
#### C. It assigns higher weights to observations that have been misclassified.

In [None]:
## 6. a) Using the pandas library to read the csv data file and create a data-frame called heart

import boto3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import recall_score

## Defining the bucket
s3 = boto3.resource('s3')
bucket_name = 'data-445-bucket-callaghan'
bucket = s3.Bucket(bucket_name)

## Defining the csv file
file_key = 'framingham.csv'

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

heart = pd.read_csv(file_content_stream)

## Removing observation with missing values
heart = heart.dropna()

heart.head()

In [None]:
## b) Using age, totChol, sysBP,BMI, heartRate, and glucose as the predictor variables and TenYearCHD as the target 
## variable to do the following:

## Defining the input and target variables:
X = heart[['age', 'totChol', 'sysBP', 'BMI', 'heartRate', 'glucose']]
Y = heart['TenYearCHD']

## Splitting the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y)

## Normalizing the input variables
scaler = MinMaxScaler(feature_range = (0,1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

## Defining empty lists to store results
md1_recall = []
md2_recall = []
md3_recall = []


## Repeating the steps 100 times
for i in range(0, 100):

    ## Model 1
    ## Building a random forest classifier (using 500 trees and maximum depth tree equal to 3)
    ## ------------------------------------------

    ## Building the model
    md1 = RandomForestClassifier(n_estimators = 500, max_depth = 3).fit(X_train, Y_train)

    ## Predicting on the test set
    md1_preds = md1.predict_proba(X_test)[:, 1]

    ## Using 10% as cutoff value and reporting the recall
    md1_preds = np.where(md1_preds < 0.1, 0, 1)
    md1_recall.append(recall_score(Y_test, md1_preds))



    ## Model 2
    ## Building an AdaBoost classifier (using 500 trees and maximum depth tree equal to 3)
    ## ------------------------------------------

    ## Building the model
    md2 = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3), n_estimators = 500).fit(X_train, Y_train)

    ## Predicting on the test set
    md2_preds = md2.predict_proba(X_test)[:, 1]

    ## Using 10% as cutoff value and reporting the recall
    md2_preds = np.where(md2_preds < 0.1, 0, 1)
    md2_recall.append(recall_score(Y_test, md2_preds))



    ## Model 3
    ## Building an AdaBoost classifier (using 50 learners and the support vector machine classifier 
    ## as the learner with rbf as the kernel)
    ## ------------------------------------------

    ## Building the model
    md3 = AdaBoostClassifier(base_estimator = SVC(kernel = 'rbf', probability = True), n_estimators = 50).fit(X_train, Y_train)

    ## Predicting on the test set
    md3_preds = md3.predict_proba(X_test)[:, 1]

    ## Using 10% as cutoff value and reporting the recall
    md3_preds = np.where(md3_preds < 0.1, 0, 1)
    md3_recall.append(recall_score(Y_test, md3_preds))


## Computing the average recall of each of the models across the 100 iterations
print('Average Recall Score of Model 1:', np.mean(md1_recall))
print('Average Recall Score of Model 2:', np.mean(md2_recall))
print('Average Recall Score of Model 3:', np.mean(md3_recall))

## What model would use to predict TenYearCHD? 