# Starting Off

How does a random forest algorithm try to correct for the short comings of a decision tree?

Your answer should specifically mention the shortcomings of a decision tree and be explicit in how it overcomes that shortcoming. 

# Ensemble Methods Applied

Agenda:
- Review code for Voting Classifier, Bagging Classifier, and Random Forest
- Practice finding optimal hyperparameter for  Random Forest with gridsearch


## Import and Prep Titanic dataset

In [38]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.ensemble import BaggingClassifier, RandomForestRegressor, ExtraTreesRegressor


In [39]:
# Read in data and split data to be used in the models
titanic = pd.read_csv('https://raw.githubusercontent.com/learn-co-students/nyc-ds-033020-lectures/master/Mod_3/decision_trees/cleaned_titanic.csv', index_col='PassengerId')



In [40]:
# Create matrix of features
X = titanic.drop('Survived', axis = 1) # grabs everything else but 'Survived'

# Create target variable
y = titanic['Survived'] # y is the column we're trying to predict

# Create a list of the features being used in the 
feature_cols = X.columns

In [41]:
# Use x and y variables to split the training data into train and test set then scale that data

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)

## Fit a KNN model

In [42]:
from sklearn.neighbors import KNeighborsClassifier

In [43]:
knn = KNeighborsClassifier(n_neighbors=8)

In [44]:
knn.fit(X_train, y_train)

knn_preds = knn.predict(X_test)

knn_f1 = metrics.f1_score(y_test, knn_preds)


print(knn_f1)

0.7770700636942675


## Fit a Logistic Regression model 

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
lr = LogisticRegression(class_weight='balanced')

In [10]:
lr.fit(X_train, y_train)

LogisticRegression(class_weight='balanced')

In [11]:
lr_preds = lr.predict(X_test)

lr_f1 = metrics.f1_score(y_test, lr_preds)

print(lr_f1)

0.8066298342541436


## Fit a Decision Tree Classifier

In [12]:
from sklearn.tree import DecisionTreeClassifier

In [13]:
dtc = DecisionTreeClassifier(max_depth=5, class_weight='balanced')

dtc.fit(X_train, y_train)

dtc_preds  = dtc.predict(X_test)

dtc_f1 = metrics.f1_score(y_test, dtc_preds)

print(dtc_f1)

0.8047337278106509


## Combine three models using Voting Classifier

In [14]:
from sklearn.ensemble import VotingClassifier


For the estimators, we must provide a list of tuples. The first value in the tuple is is the name given to the model/estimator in the second value. SKlearn requires this because there is additional functionality where you can access information about the specific models, so you need to name the models to access them later.  

In [15]:
voting_clf = VotingClassifier(
                estimators=[('logreg', lr), ('knneighbors', knn), ('decisiontree', dtc)], 
                voting='hard')

voting_clf.fit(X_train, y_train)

vc_preds = voting_clf.predict(X_test)

vc_f1 = metrics.f1_score(y_test, vc_preds)

print(vc_f1)

0.8255813953488372


### Use a voting classifier with multiple Logistic regression models 

In [16]:

C_param_range = [0.001,0.01,0.1,1,10]
titles = ['lr_0_001', 'lr_0_01', 'lr_0_1', 'lr_1', 'lr_10']

params = dict(zip(titles, C_param_range)) 
models = {}

table = pd.DataFrame(columns = ['C_parameter','F1'])
table['C_parameter'] = C_param_range
j = 0

for k , v  in params.items():
    
    # Create model using different value for c  
    lr = LogisticRegression(penalty = 'l2', C = v, random_state = 1, class_weight='balanced')
    
    #save the model to a dictionary to use later in our voting classifiers
    models[k]= lr
    
    #the steps below this point are unnecessary in order to create a voting classifier, 
    #but it is easy to fit the model and see how performance changes for different levels of regularization
    lr.fit(X_train, y_train)
    
    # Predict using model
    y_preds = lr.predict(X_test)

    # Saving accuracy score in table
    table.iloc[j,1] = metrics.f1_score(y_test, y_preds)
    j += 1



In [17]:
models

{'lr_0_001': LogisticRegression(C=0.001, class_weight='balanced', random_state=1),
 'lr_0_01': LogisticRegression(C=0.01, class_weight='balanced', random_state=1),
 'lr_0_1': LogisticRegression(C=0.1, class_weight='balanced', random_state=1),
 'lr_1': LogisticRegression(C=1, class_weight='balanced', random_state=1),
 'lr_10': LogisticRegression(C=10, class_weight='balanced', random_state=1)}

In [18]:
#review performance for different levels of C
table


Unnamed: 0,C_parameter,F1
0,0.001,0.735135
1,0.01,0.751381
2,0.1,0.804469
3,1.0,0.80663
4,10.0,0.80663


Now that we have programmatically created multiple logistic regression models, let's use them in an ensemble model

In [22]:
lr_voting = VotingClassifier(estimators=list(models.items()), 
                              voting='hard')

lr_voting.fit(X_train, y_train)

lrv_preds = voting_clf.predict(X_test)

lrv_f1 = metrics.f1_score(y_test, lrv_preds)

print(lrv_f1)

0.8255813953488372


## Fit a Bagging Classifier for a Logistic Regression model. 

In [28]:
bc_lr = BaggingClassifier(
            base_estimator=LogisticRegression(random_state = 1, class_weight='balanced'), 
            n_estimators= 100,
            max_samples= .7,
            max_features= 6,
            oob_score= True
                )

In [29]:
bc_lr.fit(X_train, y_train)



BaggingClassifier(base_estimator=LogisticRegression(C=1.0,
                                                    class_weight='balanced',
                                                    dual=False,
                                                    fit_intercept=True,
                                                    intercept_scaling=1,
                                                    l1_ratio=None, max_iter=100,
                                                    multi_class='auto',
                                                    n_jobs=None, penalty='l2',
                                                    random_state=1,
                                                    solver='lbfgs', tol=0.0001,
                                                    verbose=0,
                                                    warm_start=False),
                  bootstrap=True, bootstrap_features=False, max_features=6,
                  max_samples=0.7, n_estimators=100, n_jobs=None,

In [30]:
# Use the oob_score to get some idea of how the model performs on a validation set

bc_lr.oob_score_

0.7762762762762763

In [31]:
# See how the model performs on the test set

bc_lr_preds = bc_lr.predict(X_test)

bc_lr_f1 = metrics.f1_score(y_test, bc_lr_preds)

print(bc_lr_f1)

0.7826086956521741


***What is the difference in the `VotingClassifier` algorithm and the `BaggingClassifier` algorithm?***

Your answer: Bagging makes many random selections of training data where voting runs each model on all the training data

# Fitting a Random Forest Classifier

In [23]:
# Instantiate the classifier using 100 trees
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state = 1, n_estimators=100, max_depth=1, max_features=4)

In [24]:
#let's look at all the different default features
rfc

RandomForestClassifier(max_depth=1, max_features=4, random_state=1)

In [25]:
#fit the model to the training data
rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=1, max_features=4, random_state=1)

In [26]:
#use the fitted model to predict on the test data
rfc_preds = rfc.predict(X_test)

rfc_f1 = metrics.f1_score(y_test, rfc_preds)

# checking accuracy on the test data
print('Test F1 score: ', rfc_f1)

Test F1 score:  0.7074829931972789


***Increase the number of trees and see how the model performs***

### GridsearchCV with Random Forest

Let's use grid search to identify the best tuning parameters to use for a random forest model. 

In [56]:
from sklearn.model_selection import GridSearchCV

In [57]:
#create a dictionary of all the parameters you want to tune
param_grid = {
    'max_depth': [3, 4, 5],
    'max_features': [3, 4, 5, 6, 7],
    'max_leaf_nodes': [10, 100, 1000, 10000],
}

In [58]:
rnd_f = RandomForestClassifier(n_estimators=10)

In [59]:
#create a grid search object and fit it to the data

grid_tree = GridSearchCV(estimator=rnd_f, param_grid=param_grid,
                    n_jobs=-1, verbose=1, cv=3, scoring='f1')
grid_tree.fit(X_train, y_train)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.8s finished


GridSearchCV(cv=3, estimator=RandomForestClassifier(n_estimators=10), n_jobs=-1,
             param_grid={'max_depth': [3, 4, 5],
                         'max_features': [3, 4, 5, 6, 7],
                         'max_leaf_nodes': [10, 100, 1000, 10000]},
             scoring='f1', verbose=1)

In [60]:
grid_results = pd.DataFrame(grid_tree.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
grid_results

Unnamed: 0,mean_test_score,std_test_score,params
0,0.703646,0.037216,"{'max_depth': 3, 'max_features': 3, 'max_leaf_..."
1,0.663249,0.033413,"{'max_depth': 3, 'max_features': 3, 'max_leaf_..."
2,0.687032,0.043245,"{'max_depth': 3, 'max_features': 3, 'max_leaf_..."
3,0.706554,0.018585,"{'max_depth': 3, 'max_features': 3, 'max_leaf_..."
4,0.694313,0.020183,"{'max_depth': 3, 'max_features': 4, 'max_leaf_..."
5,0.684691,0.047636,"{'max_depth': 3, 'max_features': 4, 'max_leaf_..."
6,0.709877,0.006211,"{'max_depth': 3, 'max_features': 4, 'max_leaf_..."
7,0.69822,0.003335,"{'max_depth': 3, 'max_features': 4, 'max_leaf_..."
8,0.693479,0.028483,"{'max_depth': 3, 'max_features': 5, 'max_leaf_..."
9,0.678071,0.048839,"{'max_depth': 3, 'max_features': 5, 'max_leaf_..."


In [61]:
### Identify the best params 


grid_results['mean_test_score'].sort_values()
#Identify the best score during fitting with cross-validation


1     0.663249
9     0.678071
10    0.680623
5     0.684691
2     0.687032
29    0.691001
35    0.691696
34    0.691713
16    0.691737
19    0.692688
23    0.692870
13    0.693199
8     0.693479
4     0.694313
48    0.697832
7     0.698220
14    0.699676
56    0.700116
15    0.700518
25    0.700830
12    0.702291
18    0.702979
0     0.703646
32    0.704815
17    0.704982
3     0.706554
24    0.706948
26    0.708183
36    0.708468
28    0.708499
44    0.708676
38    0.709471
6     0.709877
33    0.710567
50    0.712770
54    0.713506
39    0.713607
47    0.714288
37    0.714993
30    0.716355
57    0.716482
22    0.716619
45    0.717194
59    0.717654
31    0.717782
58    0.717839
49    0.718217
40    0.718584
11    0.719568
21    0.720899
51    0.723274
52    0.727223
27    0.728990
46    0.729040
43    0.730999
20    0.733271
55    0.734581
53    0.735333
42    0.735469
41    0.743515
Name: mean_test_score, dtype: float64

In [None]:
#predict on the test set


# checking accuracy


# checking accuracy
