## Problem 2 [40 points] - Chase Rensberger
 For this problem, you will need to learn to use software libraries for the following non-linear classifier types:

    • Boosted Decision Trees (i.e., boosting with decision trees as weak learner)
    • Random Forests
    • Support Vector Machines with Gaussian Kernel
    
All of these are available in scikit-learn, although you may also use other external libraries (e.g., XGBoost 1 for boosted decision trees and LibSVM for SVMs). You are welcome to implement learning algorithms for these classifiers yourself, but this is neither required nor recommended.

Use the non-linear classifiers from above for classification of Adult dataset. You can download the data from [a9a](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) in libSVM data repository. The a9a data set comes with two files: the training data file a9a with 32,561 samples each with 123 features, and a9a.t with 16,281 test samples. Note that a9a data is in LibSVM format. In this format, each line takes the form 〈label〉 〈feature-id〉:〈feature-value〉 〈feature- id〉:〈feature-value〉 ..... This format is especially suitable for sparse datasets. Note that scikit-learn includes utility functions (e.g., load svmlight file) for loading datasets in the LibSVM format.

For each of learning algorithms, you will need to set various hyperparameters (e.g., the type of kernel and regularization parameter for SVM; tree method, max depth, number of weak classifiers, etc for XG- Boost; number of estimators and min impurity decrease for Random Forests). Often there are defaults that make a good starting point, but you may need to adjust at least some of them to get good performance. Use hold-out validation or K-fold cross-validation to do this (scikit-learn has nice features to accomplish this, e.g., you may use train test split to split data into train and test data and sklearn.model selection for K-fold cross validation). Do not make any hyperparameter choices (or any other similar choices) based on the test set! You should only compute the test error rates after you have settled on hyperparameter settings and trained your three final classifiers.

In [1]:
from sklearn.datasets import load_svmlight_file
from sklearn import metrics
from Problem2RandomForests import determine_random_forest_hp
from Problem2SVMwGK import determine_SVM_hp
from Problem2BoostedDecisionTrees import determine_xgboost_hp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import numpy as np

In [2]:
data_train = load_svmlight_file("a9a.txt")
data_test = load_svmlight_file("a9a.t")

In [3]:
# First seperate our input files, we will pass our training data into our various classification functions
X_train = data_train[0]
y_train = data_train[1]
X_test = data_test[0]
y_test = data_test[1]

## Boosted Decision Trees

In [4]:
default_n_estimators = 100
default_max_depth = None
default_lambda=None
default_learning_rate=None
default_missing=np.nan
default_objective='binary:logistic'

In [5]:
# Will throw a warning that for the life of me I can't get rid of, this can be ignored though
bdt_out = determine_xgboost_hp(X_train, y_train, X_test, y_test, default_n_estimators, default_max_depth, default_lambda, default_learning_rate, default_missing, default_objective)



Fitting 5 folds for each of 10 candidates, totalling 50 fits






In [6]:
# Test score with default values
bdt_out[0]

0.848289417111971

In [7]:
# Best hyperparameters based on randomized search with 5 fold cross validation
bdt_out[1]

{'reg_lambda': 1.0,
 'n_estimators': 50,
 'missing': 5.5,
 'max_depth': None,
 'learning_rate': 0.5}

In [8]:
# Test score with best hyperparameters (Note that this is the score found from one iteration of randomized search and the score below may be even below the default values, I used a bunch of iterations of this randomized search along with some of my own playing around with values to get a decent result)
bdt_out[2]

0.8467538848965052

In [9]:
# Hyper parameters looked at during randomized search
bdt_out[3]['params']

[{'reg_lambda': 2.0,
  'n_estimators': 700,
  'missing': 5.5,
  'max_depth': None,
  'learning_rate': 2.0},
 {'reg_lambda': 3.5,
  'n_estimators': 50,
  'missing': 7,
  'max_depth': None,
  'learning_rate': 3.5},
 {'reg_lambda': 2.0,
  'n_estimators': 50,
  'missing': 10,
  'max_depth': None,
  'learning_rate': 4.0},
 {'reg_lambda': 3.5,
  'n_estimators': 550,
  'missing': 7.5,
  'max_depth': None,
  'learning_rate': 4.0},
 {'reg_lambda': 1.0,
  'n_estimators': 50,
  'missing': 5.5,
  'max_depth': None,
  'learning_rate': 0.5},
 {'reg_lambda': 3.0,
  'n_estimators': 950,
  'missing': 4,
  'max_depth': None,
  'learning_rate': 0.5},
 {'reg_lambda': 3.5,
  'n_estimators': 550,
  'missing': 0,
  'max_depth': None,
  'learning_rate': 4.0},
 {'reg_lambda': 2.5,
  'n_estimators': 450,
  'missing': 9,
  'max_depth': None,
  'learning_rate': 0.5},
 {'reg_lambda': 2.5,
  'n_estimators': 400,
  'missing': 9.5,
  'max_depth': None,
  'learning_rate': 3.5},
 {'reg_lambda': 1.5,
  'n_estimators': 8

In [10]:
# cv test scores (test meaing not the actual test data and the data suplied by cv) seperated by split for each hyper parameter looked at.
[bdt_out[3]['split0_test_score'], bdt_out[3]['split1_test_score'], bdt_out[3]['split2_test_score'], bdt_out[3]['split3_test_score'], bdt_out[3]['split4_test_score']]  

[array([0.73606633, 0.73606633, 0.7013665 , 0.76769538, 0.84415784,
        0.82557961, 0.76769538, 0.82941809, 0.72792876, 0.59020421]),
 array([0.38129607, 0.75      , 0.75921376, 0.78378378, 0.84428747,
        0.82616708, 0.78378378, 0.83092752, 0.754914  , 0.6722973 ]),
 array([0.77349509, 0.7590602 , 0.76842752, 0.71268428, 0.84797297,
        0.83031327, 0.71268428, 0.8367629 , 0.65801597, 0.65694103]),
 array([0.78931204, 0.58660934, 0.74523956, 0.75844595, 0.85288698,
        0.83215602, 0.75844595, 0.83522727, 0.68796069, 0.68166462]),
 array([0.77088452, 0.76243857, 0.69394963, 0.7705774 , 0.84781941,
        0.83000614, 0.7705774 , 0.8367629 , 0.70132064, 0.79499386])]

## Random Forest

In [11]:
#Default values
default_n_estimators = 100
default_bootstrap = True
default_max_depth = None
default_min_impurity_decrease = 0.0
default_min_samples_leaf = 1

In [12]:
# Function returns a tuple with (test score with default values, best params dicitionary, test score with best params, cross validation results).
# This function will take roughly 2.5 minutes to run.x

rf_out = determine_random_forest_hp(X_train, y_train, X_test, y_test, default_n_estimators, default_bootstrap, default_max_depth, default_min_impurity_decrease, default_min_samples_leaf)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [13]:
# Test score with default values
rf_out[0]

0.8326884098028376

In [14]:
# Best hyperparameters based on randomized search with 5 fold cross validation
rf_out[1]

{'n_estimators': 800,
 'min_samples_leaf': 13,
 'min_impurity_decrease': 0.0,
 'max_depth': 100,
 'bootstrap': True}

In [15]:
# Test score with best hyperparameters (Note that this is the score found from one iteration of randomized search and the score below may be even below the default values, I used a bunch of iterations of this randomized search along with some of my own playing around with values to get a decent result)
rf_out[2]

0.8470609913395983

In [16]:
# Hyper parameters looked at during randomized search
rf_out[3]['params']

[{'n_estimators': 500,
  'min_samples_leaf': 10,
  'min_impurity_decrease': 4.5,
  'max_depth': 130,
  'bootstrap': True},
 {'n_estimators': 300,
  'min_samples_leaf': 1,
  'min_impurity_decrease': 9.5,
  'max_depth': 70,
  'bootstrap': False},
 {'n_estimators': 500,
  'min_samples_leaf': 10,
  'min_impurity_decrease': 1.5,
  'max_depth': 10,
  'bootstrap': True},
 {'n_estimators': 950,
  'min_samples_leaf': 1,
  'min_impurity_decrease': 0.5,
  'max_depth': 130,
  'bootstrap': True},
 {'n_estimators': 800,
  'min_samples_leaf': 13,
  'min_impurity_decrease': 0.0,
  'max_depth': 100,
  'bootstrap': True},
 {'n_estimators': 650,
  'min_samples_leaf': 13,
  'min_impurity_decrease': 1.5,
  'max_depth': 15,
  'bootstrap': True},
 {'n_estimators': 650,
  'min_samples_leaf': 11,
  'min_impurity_decrease': 1.5,
  'max_depth': 15,
  'bootstrap': True},
 {'n_estimators': 800,
  'min_samples_leaf': 5,
  'min_impurity_decrease': 3.0,
  'max_depth': 110,
  'bootstrap': False},
 {'n_estimators': 700

In [17]:
# cv test scores (test meaing not the actual test data and the data suplied by cv) seperated by split for each hyper parameter looked at.
[rf_out[3]['split0_test_score'], rf_out[3]['split1_test_score'], rf_out[3]['split2_test_score'], rf_out[3]['split3_test_score'], rf_out[3]['split4_test_score']]  

[array([0.75909719, 0.75909719, 0.75909719, 0.75909719, 0.8377092 ,
        0.75909719, 0.75909719, 0.75909719, 0.75909719, 0.75909719]),
 array([0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.84029484,
        0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.75921376]),
 array([0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.84428747,
        0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.75921376]),
 array([0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.84659091,
        0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.75921376]),
 array([0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.84428747,
        0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.75921376])]

## Support Vector Machines with Gaussian Kernel

In [18]:
default_kernel = 'rbf'
default_gamma = 'scale'
default_c = 1.0

In [19]:
svm_out = determine_SVM_hp(X_train, y_train, X_test, y_test, default_kernel, default_gamma, default_c)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END learning_rate=3.5, max_depth=None, missing=7, n_estimators=50, reg_lambda=3.5; total time=   2.6s
[CV] END learning_rate=3.5, max_depth=None, missing=7, n_estimators=50, reg_lambda=3.5; total time=   2.7s
[CV] END learning_rate=4.0, max_depth=None, missing=10, n_estimators=50, reg_lambda=2.0; total time=   2.2s
[CV] END learning_rate=4.0, max_depth=None, missing=10, n_estimators=50, reg_lambda=2.0; total time=   2.6s
[CV] END learning_rate=4.0, max_depth=None, missing=7.5, n_estimators=550, reg_lambda=3.5; total time=  20.8s
[CV] END learning_rate=0.5, max_depth=None, missing=5.5, n_estimators=50, reg_lambda=1.0; total time=  10.7s
[CV] END learning_rate=0.5, max_depth=None, missing=4, n_estimators=950, reg_lambda=3.0; total time= 2.8min
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=800, reg_lambda=1.5; total time=  24.6s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=4.5, min_sam



[CV] END learning_rate=2.0, max_depth=None, missing=5.5, n_estimators=700, reg_lambda=2.0; total time=  32.4s
[CV] END learning_rate=0.5, max_depth=None, missing=5.5, n_estimators=50, reg_lambda=1.0; total time=  10.6s
[CV] END learning_rate=0.5, max_depth=None, missing=4, n_estimators=950, reg_lambda=3.0; total time= 2.9min
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=800, reg_lambda=1.5; total time=  23.5s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=4.5, min_samples_leaf=10, n_estimators=500; total time=   1.9s
[CV] END bootstrap=True, max_depth=10, min_impurity_decrease=1.5, min_samples_leaf=10, n_estimators=500; total time=   2.1s
[CV] END bootstrap=True, max_depth=100, min_impurity_decrease=0.0, min_samples_leaf=13, n_estimators=800; total time=  39.8s
[CV] END ...................C=2.5, gamma=auto, kernel=linear; total time=  46.7s




[CV] END learning_rate=2.0, max_depth=None, missing=5.5, n_estimators=700, reg_lambda=2.0; total time=  33.0s
[CV] END learning_rate=0.5, max_depth=None, missing=5.5, n_estimators=50, reg_lambda=1.0; total time=  10.6s
[CV] END learning_rate=0.5, max_depth=None, missing=4, n_estimators=950, reg_lambda=3.0; total time= 2.8min
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=800, reg_lambda=1.5; total time=  24.6s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=4.5, min_samples_leaf=10, n_estimators=500; total time=   1.9s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=0.5, min_samples_leaf=1, n_estimators=950; total time=   3.7s
[CV] END bootstrap=True, max_depth=100, min_impurity_decrease=0.0, min_samples_leaf=13, n_estimators=800; total time=  39.3s
[CV] END ...................C=2.5, gamma=auto, kernel=linear; total time=  47.0s




[CV] END learning_rate=2.0, max_depth=None, missing=5.5, n_estimators=700, reg_lambda=2.0; total time= 1.4min
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=550, reg_lambda=3.5; total time=  14.9s
[CV] END learning_rate=0.5, max_depth=None, missing=9, n_estimators=450, reg_lambda=2.5; total time= 1.2min
[CV] END learning_rate=3.5, max_depth=None, missing=9.5, n_estimators=400, reg_lambda=2.5; total time=  10.3s
[CV] END learning_rate=3.5, max_depth=None, missing=9.5, n_estimators=400, reg_lambda=2.5; total time=   9.9s
[CV] END learning_rate=3.5, max_depth=None, missing=9.5, n_estimators=400, reg_lambda=2.5; total time=  10.0s
[CV] END learning_rate=3.5, max_depth=None, missing=9.5, n_estimators=400, reg_lambda=2.5; total time=   9.8s
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=800, reg_lambda=1.5; total time=  24.8s
[CV] END bootstrap=False, max_depth=70, min_impurity_decrease=9.5, min_samples_leaf=1, n_estimators=300; total time=   0.9s
[C



[CV] END learning_rate=3.5, max_depth=None, missing=7, n_estimators=50, reg_lambda=3.5; total time=   2.7s
[CV] END learning_rate=3.5, max_depth=None, missing=7, n_estimators=50, reg_lambda=3.5; total time=   3.2s
[CV] END learning_rate=4.0, max_depth=None, missing=10, n_estimators=50, reg_lambda=2.0; total time=   2.4s
[CV] END learning_rate=4.0, max_depth=None, missing=7.5, n_estimators=550, reg_lambda=3.5; total time=  20.0s
[CV] END learning_rate=4.0, max_depth=None, missing=7.5, n_estimators=550, reg_lambda=3.5; total time=  15.6s
[CV] END learning_rate=0.5, max_depth=None, missing=4, n_estimators=950, reg_lambda=3.0; total time= 2.8min
[CV] END learning_rate=3.5, max_depth=None, missing=9.5, n_estimators=400, reg_lambda=2.5; total time=  13.1s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=4.5, min_samples_leaf=10, n_estimators=500; total time=   1.8s
[CV] END bootstrap=True, max_depth=10, min_impurity_decrease=1.5, min_samples_leaf=10, n_estimators=500; total time



[CV] END learning_rate=3.5, max_depth=None, missing=7, n_estimators=50, reg_lambda=3.5; total time=   2.7s
[CV] END learning_rate=4.0, max_depth=None, missing=10, n_estimators=50, reg_lambda=2.0; total time=   2.7s
[CV] END learning_rate=4.0, max_depth=None, missing=10, n_estimators=50, reg_lambda=2.0; total time=   2.4s
[CV] END learning_rate=4.0, max_depth=None, missing=7.5, n_estimators=550, reg_lambda=3.5; total time=  20.5s
[CV] END learning_rate=4.0, max_depth=None, missing=7.5, n_estimators=550, reg_lambda=3.5; total time=  15.8s
[CV] END learning_rate=0.5, max_depth=None, missing=4, n_estimators=950, reg_lambda=3.0; total time= 2.9min
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=800, reg_lambda=1.5; total time=  21.9s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=4.5, min_samples_leaf=10, n_estimators=500; total time=   1.9s
[CV] END bootstrap=True, max_depth=10, min_impurity_decrease=1.5, min_samples_leaf=10, n_estimators=500; total time=

/Users/chr/opt/anaconda3/envs/CMPSC448/lib/p

V] END bootstrap=True, max_depth=10, min_impurity_decrease=1.5, min_samples_leaf=10, n_estimators=500; total time=   2.0s
[CV] END bootstrap=True, max_depth=130, min_impurity_decrease=0.5, min_samples_leaf=1, n_estimators=950; total time=   3.9s
[CV] END bootstrap=True, max_depth=15, min_impurity_decrease=1.5, min_samples_leaf=13, n_estimators=650; total time=   2.5s
[CV] END bootstrap=True, max_depth=15, min_impurity_decrease=1.5, min_samples_leaf=13, n_estimators=650; total time=   2.5s
[CV] END bootstrap=True, max_depth=15, min_impurity_decrease=1.5, min_samples_leaf=11, n_estimators=650; total time=   2.5s
[CV] END bootstrap=False, max_depth=110, min_impurity_decrease=3.0, min_samples_leaf=5, n_estimators=800; total time=   2.4s
[CV] END bootstrap=False, max_depth=110, min_impurity_decrease=3.0, min_samples_leaf=5, n_estimators=800; total time=   2.4s
[CV] END bootstrap=False, max_depth=100, min_impurity_decrease=9.0, min_samples_leaf=18, n_estimators=700; total time=   2.3s
[CV] E



ND bootstrap=False, max_depth=100, min_impurity_decrease=9.0, min_samples_leaf=18, n_estimators=700; total time=   2.1s
[CV] END bootstrap=True, max_depth=80, min_impurity_decrease=8.5, min_samples_leaf=12, n_estimators=450; total time=   1.7s
[CV] END ..................C=5.0, gamma=auto, kernel=sigmoid; total time=  23.7s
[CV] END ..................C=5.0, gamma=auto, kernel=sigmoid; total time=  23.1s
[CV] END learning_rate=2.0, max_depth=None, missing=5.5, n_estimators=700, reg_lambda=2.0; total time=  30.3s
[CV] END learning_rate=0.5, max_depth=None, missing=5.5, n_estimators=50, reg_lambda=1.0; total time=  10.5s
[CV] END learning_rate=0.5, max_depth=None, missing=5.5, n_estimators=50, reg_lambda=1.0; total time=  10.0s
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=550, reg_lambda=3.5; total time=  13.6s
[CV] END learning_rate=4.0, max_depth=None, missing=0, n_estimators=550, reg_lambda=3.5; total time=  15.9s
[CV] END learning_rate=4.0, max_depth=None, missin



In [20]:
# Test score with default values
svm_out[0]

0.8505620047908605

In [21]:
# Best hyperparameters based on randomized search with 5 fold cross validation
svm_out[1]

{'kernel': 'linear', 'gamma': 'auto', 'C': 2.5}

In [22]:
# Test score with best hyperparameters (Note that this is the score found from one iteration of randomized search and the score below may be even below the default values, I used a bunch of iterations of this randomized search along with some of my own playing around with values to get a decent result)
svm_out[2]

0.8497021067501996

In [23]:
# Hyper parameters looked at during randomized search
svm_out[3]['params']

[{'kernel': 'linear', 'gamma': 'auto', 'C': 2.5},
 {'kernel': 'sigmoid', 'gamma': 'auto', 'C': 5.0},
 {'kernel': 'sigmoid', 'gamma': 'auto', 'C': 1.5},
 {'kernel': 'sigmoid', 'gamma': 'auto', 'C': 1.0},
 {'kernel': 'poly', 'gamma': 'auto', 'C': 8.0},
 {'kernel': 'sigmoid', 'gamma': 'auto', 'C': 5.5},
 {'kernel': 'rbf', 'gamma': 'auto', 'C': 9.5},
 {'kernel': 'poly', 'gamma': 'auto', 'C': 0.5},
 {'kernel': 'rbf', 'gamma': 'scale', 'C': 8.0},
 {'kernel': 'rbf', 'gamma': 'auto', 'C': 5.5}]

In [24]:
# cv test scores (test meaing not the actual test data and the data suplied by cv) seperated by split for each hyper parameter looked at.
[svm_out[3]['split0_test_score'], rf_out[3]['split1_test_score'], svm_out[3]['split2_test_score'], svm_out[3]['split3_test_score'], svm_out[3]['split4_test_score']]  

[array([0.84461846, 0.84277599, 0.84246891, 0.84185475, 0.81590665,
        0.84262245, 0.84231537, 0.75909719, 0.83847689, 0.84308306]),
 array([0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.84029484,
        0.75921376, 0.75921376, 0.75921376, 0.75921376, 0.75921376]),
 array([0.84659091, 0.84613022, 0.84382678, 0.84490172, 0.82263514,
        0.84551597, 0.84843366, 0.75921376, 0.84536241, 0.84643735]),
 array([0.84874079, 0.84843366, 0.84797297, 0.84751229, 0.82478501,
        0.84812654, 0.85165848, 0.75921376, 0.84843366, 0.84996929]),
 array([0.84874079, 0.84904791, 0.84643735, 0.84536241, 0.82048526,
        0.84904791, 0.84689803, 0.75921376, 0.84689803, 0.84735872])]

## Questions
### 1. A brief description of each algorithm and how it works.

**Boosted Decision Trees (specifically XGBoost)**
is a modified version of decision trees which can be used for both regression and classification. It makes use of gradient boosting which essentially means that in each stage, we introduce a weak leaner (a decision stump where the output is dependent on a single feature) to compensate for the poor performance of already obtained weak leaners. This means that we are essentially doing boosting which is a method for producing accurate classifiers by combining classifiers which are only slightly better than a random guess with an update step that aims to minimize a loss function (gradient descent).

**Random Forests**
are similar to bootstrap aggregating(bagging) with decision trees in that it makes use of sampling with replacement from our data set and then constructing a decision tree for each sample and averaging their output in some way(majority voting). Random forest offers an improvement over this though by decorrelating our decision trees. We do this because in pure bagging with decision trees where we consider the same features for each sample, it is possible that there will be some feature(s) that are very important and cause very similar trees to form, undermining our efforts of averaging high variance models. Random Forests deals with this problem by selecting a random subset of features that each tree considers, ensuring the trees have greater variance. This will in turn make the average of our trees less variable and more reliable.

**Support Vector Machines with Gaussian Kernel**
is similar to perceptron in that we are often looking to linearly seperate some data. The key differents to SVMs is that there is some margin (hard or soft) that we are trying to maximize. The key idea to the algorithm we are running here is the guassian kernel, which makes use of a trick in order to force our data to be linearly seperable and that involves mapping our data into a higher dimension space (often called the feature space) and then solving the problem in the new space without any loss of correctness.

### 2. Description of your training methodology, with enough details so that another machine learning enthusiast can reproduce the your results. You need to submit all the codes (python and Jupyter notebooks) to reproduce your code. Please use prefix Problem2*.py where you need to replace * with the name of non-linear classifier for your coding files.

The main idea of my training methodolgy is to give each algorithm a wide range of values for each hyper parameter and then using a randomized cross validated search (run multiple times) in order to get a get parameters that are (ideally) significantly better than the defaults.

I define 3 seperate functions in 3 seperate files corresponding to each algorithm. They are all located in the immediate working directory of this jupyter notebook. This notebook and every python file are meant to be runable without any adjustments so reproducing my results should be simple. I don't use any deterministic random state to ensure the data is exactly reproducable though, but any reruns should be very similar.

Each function first trains the classifier on our data with no cross validation and default parameters and then scores that model on the test data. This approach gives us a baseline we want to improve. Each function then defines a grid of possible values for each hyperparamater and since doing a cross validated grid search on each parameter combination would take many hours to complete, I opted for a random search through our grid with 5 fold cross validation. Not that this step does not use our test data in any way.

Each function then takes the best paramaters from this process and trains the classifier with these new parameters on our data and scores it on our test data. We can then compare this to the default and see if there is any improvement to the hyper parameter modification.

Each function returns a tuple with the (default parameter test score, the best hyper parameters it found, the test score of our classifier trained on those best parameters, and the cross validation results).

You can also increase the number of iterations of the random search to get more optimal values.

My approach to actually getting the final values that I landed on what a combination of running a bunch of randomized searches on different parameter matrices to find what parameters improved accuracy and then a little bit of specific testing on different parameters to fine tune the values.

### 3. The list of hyperparameters and brief description of each hyperparameter you tuned in training, their default values, and the final hyperparameter settings you use to get the best result.

**Boosted Decision Trees (XGBoost)**
- n_estimators: Number of rounds of boosting (default: 100 Final: 100)
- max_depth: Maximum tree depth for base learners (default: None Final: None)
- lambda: L2 regularization term on weights (default: None Final: 3.0)
- learning_rate: How dramatic a shift is from one step to another (Default: None Final: None)
- missing: Value in the data which needs to be present as a missing value (Default: None Final: None)
- objective: Used to specify our objective function (default: binary:logistic Final: binary:logistic)

**Random Forests**
- n_estimators: Number of trees in the forest (Default: 100 Final: 650)
- boostrap: Whether bootstrap samples (Creating new data by sampling with replacement from exisiting data) are used when building trees (Default: True Final: True)
- max_depth: The maximum depth of the tree (Default: None Final: None)
- min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value (Default: 0.0 Final: 0.0)
- min_samples_leaf: The minimum number of samples required to be at a leaf node (Default: 1 Final: 4)

**Support Vector Machines with Gaussian Kernel**
- kernel: Specifies the kernel type to be used in the algorithm (Default: 'rbf' Final: 'rbf')
- gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’ (Default: 'scale' Final: 'scale')
- c: Regularization parameter (Default: 1.0, Final: '2.0')

### 4. Training error rates, hold-out or cross-validation error rates, and test error rates for your final classifiers. You are also encouraged to report other settings you tried with the accuracy it achieved (please make a table with a column with each hyperparamter and accuracy of configuration of parameters).

**Boosted Decision Tree (specifically XGBoost)**
- Final Parameters: (n_estimators=100 , max_depth=None, lambda=3.0 , learning_rate=None , missing=np.nan, objective=binary:logistic)
- Cross validation error rate = 0.15269788
- Test error rate = 0.148886

**Random Forest**
- Final Parameters: (n_estimators=650, bootstrap=True, max_depth=None, min_impurity_decrease=0.0, min_samples_leaf=4)
- Cross validation error rate = 0.15444848
- Test error rate = 0.149868

**Support Vector Machines with Gaussian Kernel**
- Final Parameters (kernel=rbf, gamma=scale, c=2.0)
- Cross validation error rate = 0.15266721
- Test error rate = 0.149438

**Other settings I tried and the accuracy they achieved:**

![HP%20Table.png](attachment:HP%20Table.png)

### 5. Please do your best to obtain the best achievable accuracy for each classifier on given dataset. Note: The amount of effort you put on tuning the parameters will be determined based on the discrepancy between the accuracy you get and the best achievable accuracy on a9a data for each algorithm.

I saw marginal improvement with the hyperparameters I found, usually an improvement of about 0.25 - 2% (above the default). The highest accuracy I got across all classifiers was with XGBoost where I achieved an accuracy of 85.1114%.