In [4]:
# import the necessary packages
import warnings
warnings.filterwarnings('ignore')


import pandas as pd
import numpy as np
from plotnine import *


from sklearn.tree import DecisionTreeClassifier # Decision Tree
from sklearn.model_selection import train_test_split

from sklearn import metrics 
from sklearn.preprocessing import StandardScaler #Z-score variables

from sklearn.model_selection import train_test_split # simple TT split cv
from sklearn.model_selection import KFold # k-fold cv
from sklearn.model_selection import LeaveOneOut #LOO cv
from sklearn.model_selection import cross_val_score # cross validation metrics
from sklearn.model_selection import cross_val_predict # cross validation metrics
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import plot_confusion_matrix

# 1. The Ensemble
A common theme in applied Machine Learning is *The Ensemble Method*. Ensemble methods use multiple machine learning models (these models can be the same type or different algorithms entirely). The idea is that using ensembles improves predictive performance, because even though our models are sometimes incorrect, it's unlikely that a MAJORITY of the models in our ensemble will all be incorrect in the exact same way each time. Therefore in aggregate, we will get a more accurate model.

Each model gets a "vote" about what category a data point should be in (ensemble methods also work for continuous outcomes, but here we'll focus on categorical ones). Whichever category gets the most "votes" is the category we choose for that data point. 

To combat overfitting and reduce potential *over-reliance* on a small number of features, we can use the two following techniques when creating models for our ensemble:

* **Bagging (Bootstrap Aggregating)**: Instead of using all of our training data to train each model in our sample we use **bootstrapping** to choose the samples we will include.
    * **Bootstrapping** is when you randomly sample data points *with replacement*, meaning that a data point can be included in your bootstrapped sample *more* than once, OR not at all.
* **Random Feature Selection**: Instead of using all the available features/predictors in our dataset for every model, for each model we randomly choose a different subset of features to use when training. This helps our ensemble generalize, because it doesn't become overly reliant on one feature (since that feature might not appear in every model).

While ensemble methods take a lot of computational power (you're training MANY models instead of just one), in practice they're often really useful. An incredibly popular ensemble method is the **Random Forest** which is an ensemble method that uses a bunch of decision trees along with Bagging and Random Feature selection to generate the ensemble.

## 1.1 Building a Random Forest

Let's build a tiny random forest function of our own! Write a function `Forest()` that takes in 6 arguments:

* `n_samples` (**integer**): number of bootstrapped samples to use to train each decision tree.
* `n_features` (**integer**): number of randomly selected features from your data set to use when training.
* `n_trees` (**integer**): how many decision trees to create for the ensemble.
* `max_depth` (**integer**): the max_depth for all of your trees.
* `X` (**data frame**): the *already* z-scored predictor data to be used.
* `y` (**data frame**): the outcome data to be used (`X` and `y` are the same length, and the $i^{th}$ element of `X` corresponds to the $i^{th}$ element of `y`)

The function should:

1. use a for loop to create `n_trees` models and store them in a list called `forest` (yes! You can store fitted decision trees in a list!)
2. For each model you should choose use bootstrapping to sample `n_samples` data points to train each model. Remember that boostrapping means sampling WITH replacement (hint: try using [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select (*with replacement*) which row numbers/indices to use.
3. For each model, randomly select `n_features` to use to train your model. (hint: try using [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select (*withOUT replacement*) which predictor indices to use.
4. For all models, make sure you set the `max_depth`.
5. For each model, train the model (no need to use model validation, and assume X is already z-scored).
6. Return a list (`forest`) of dictonaries that look like this (where `tree` is your trained model and `samples_index` is an array of indices for the features/predictors you selected):
 ```{"tree": tree, "feats": samples_index}```

In [6]:
# I'll Just Leave this hint here...

## simple bootstrapping example of names dataframe 
np.random.seed(1234)

names = ["Alex", "Charlie", "Addison", "James", "Blake", "Greg", "Daniel", "Susan", "Erik", "Georgia", "Kayne",
         "Lydia", "Peter", "Jane", "Jasper", "Link", "Rhett", "John", "Miranda", "Luke", "Leia", "Janet", "Jung",
         "Anthony", "Mark", "Torrence", "Bonnie", "Rudy", "Lisa", "Bart", "Tina", "Marie"]

names_df = pd.DataFrame({"name": names, "age": np.random.randint(17,27, len(names))})
names_df

names_index = np.random.choice(range(0,len(names)), 15, replace = True)
names_boot = names_df.iloc[names_index]

# notice how Lisa shows up more than once?

names_boot

Unnamed: 0,name,age
14,Jasper,26
19,Luke,19
7,Susan,24
28,Lisa,20
10,Kayne,25
11,Lydia,17
14,Jasper,26
28,Lisa,20
17,John,17
23,Anthony,17


In [5]:
### YOUR CODE HERE ###
def Forest(X, y, n_samples = 1000, n_features = 5, n_trees = 100, max_depth = 5):
    forest = []
    
    # create models
    for i in range(0,n_trees):
        
        # 1. randomly bootstrap datapoints by selecting from X's row indices WITH replacement

        # 2. randomly choose features by selecting from X's column indices WITHOUT replacement
     
        # 3. subset X and y to only include the rows and features that were randomly selected above
        
        # add trained tree and feature/column indices (from 2 above) to forest dict
        #forest.append({"tree": tree, "feats": features_index})
    return(forest)
### /YOUR CODE HERE ###

In [14]:
def Forest(X, y, n_samples = 1000, n_features = 5, n_trees = 100, max_depth = 5):
    forest = []
    
    # create models
    for i in range(0,n_trees):
        
        # randomly bootstrap datapoints
        samples_index = np.random.choice(range(0,X.shape[0]), n_samples, replace = True)
        
        # randomly choose features
        if n_features >= X.shape[1]: #if they ask for more features than you have...
            features_index = range(0,X.shape[1])
        else:
            features_index = np.random.choice(range(0,X.shape[1]), n_features, replace = False)
        
        # select only the rows and features that were randomly selected above
        X_bagged = X.iloc[samples_index, features_index]
        y_bagged = y.iloc[samples_index]
        
        tree = DecisionTreeClassifier(max_depth = max_depth)
        tree.fit(X_bagged,y_bagged)
        
        # add tree to forest
        forest.append({"tree": tree, "feats": features_index})
    return(forest)       

## 1.2 Use `Forest()`
Using `X_cols_df` and `y_df` (data generated at the top of the notebook) as your training set, call `Forest()` to build an ensemble model.

In [11]:
X_cols_df = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/X_cols_df.csv")
y_df = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/y_df.csv")
X_cols_df.head()

Unnamed: 0.1,Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,...,X240,X241,X242,X243,X244,X245,X246,X247,X248,X249
0,0,-0.546514,-1.563641,1.115903,0.952985,0.192271,0.973264,0.553697,1.170051,-0.454866,...,0.934221,-1.226211,1.37616,-1.456604,-1.236548,0.193204,-1.266364,0.338426,0.867695,0.688788
1,1,0.120423,-1.40524,0.970508,1.660086,-0.606927,1.416359,-0.514392,1.881836,-0.62507,...,0.965089,-1.143669,1.615641,-0.872585,-1.053896,0.009002,-1.679799,1.281591,0.761152,0.564718
2,2,0.594233,-1.655526,1.57843,1.440725,0.097003,-0.297129,0.1546,0.785563,-0.831346,...,1.202674,-1.198594,1.027653,-1.789076,-0.627827,-0.033458,-1.243939,1.105503,0.848282,1.169202
3,3,0.387366,-1.36314,1.026211,2.032231,-0.059808,0.45647,0.751682,2.23554,-0.490625,...,1.064223,-1.743916,1.163846,-1.939953,-0.263838,0.387277,-1.787271,1.297693,0.57183,1.263347
4,4,-0.027355,-1.657837,2.38326,1.375557,-0.35858,1.46994,-0.190193,1.236663,-0.730892,...,1.352485,-1.440584,1.437159,-1.355588,-0.708502,0.050739,-1.910396,1.089436,0.49845,0.767962


In [None]:
y_df.head()

In [None]:
### YOUR CODE HERE ###

my_forest = ### call Forest and create an ensemble.

### /YOUR CODE HERE ###

## 1.3 Comparing Ensemble to an Individual Model

- Use the `ForestPredictor()` function below (which takes in the ensemble created by `Forest()` and data) to generate predictions for `X_cols_df2`, our *test* set.
- Use the `ForestPredictor()` function below (which takes in the ensemble created by `Forest()` and data) to generate predictions for `X_cols_df`, our *train* set.
- calculate the accuracy of your ensemble.
- calculate the accuracy for ONE of your ensemble models by using `oneModel = my_forest[0]` to grab the first model of your ensemble. 

### 1.3.1
In this example, does an ensemble method do *better* (in terms of train accuracy) than an individual decision tree? Explain how you figured this out.

### 1.3.2
In this example, does an ensemble method do *better* (in terms of overfitting) than an individual decision tree? Use `X_cols_df2` and `y_df2` as the test set.

In [15]:
def ForestPredictor(forest, X):
# takes in a list of dictionaries like this but longer:
# [
#     {"trees":DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
#      max_depth=5, max_features=None, max_leaf_nodes=None,
#      min_impurity_decrease=0.0, min_impurity_split=None,
#      min_samples_leaf=1, min_samples_split=2,
#      min_weight_fraction_leaf=0.0, presort='deprecated',
#      random_state=None, splitter='best'),
#     "feats": array([ 63, 101,  39, 133, 137])},

#  {"trees":DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
#      max_depth=5, max_features=None, max_leaf_nodes=None,
#      min_impurity_decrease=0.0, min_impurity_split=None,
#      min_samples_leaf=1, min_samples_split=2,
#      min_weight_fraction_leaf=0.0, presort='deprecated',
#      random_state=None, splitter='best'),
#     "feats": array([ 63, 101,  39, 133, 137])}
# ]
    import operator
    from collections import Counter

    X = X_cols_df
    ps = []

    # get predictions from each model
    for model in forest:
        tree = model["tree"]
        X_sub = X.iloc[:, model["feats"]]

        p = tree.predict(X_sub)
        ps.append(p)

    ps = pd.DataFrame(ps)
    
    # get ensemble prediction for each data point
    predictions = []
    
    for column_ind in range(0, ps.shape[1]):
        ensemble_predict = ps.iloc[:,column_ind]
        predictions.append(ensemble_predict.mode()[0])

    return(predictions)

In [10]:

### YOUR CODE HERE ###
# ForestPredict() will take your ensemble and use it to find the predicted values for X_cols_df2
ensemble_predictions =  ### Call ForestPredictor using my_forest and X_cols_df2

### /YOUR CODE HERE ###

In [None]:
### YOUR CODE HERE ###

# calculate the accuracy for the ensemble


# calculate the accuracy for the first model
oneModel = my_forest[0]


### /YOUR CODE HERE ###

## 1.4 Comparing Ensemble to an Individual ModelS

- put the accuracy from your ENSEMBLE model in the code below
- run the cell to see a histogram of the individual tree accuracies, and the (dashed line) ensemble accuracy.

### 1.4.1
Write down your thoughts about this graph. What patterns do you see between individual tree accuracies and ensemble accuracies?

In [3]:
### YOUR CODE HERE ###
ensemble_acc = 0.775### put your ensemble accuracy here!

### /YOUR CODE HERE ###

allAcc = [accuracy_score(y_df2,my_forest[mod]["tree"].predict(X_cols_df2.iloc[:,my_forest[mod]["feats"]])) for mod in range(0,len(my_forest))]

df = pd.DataFrame({"acc": allAcc, "no": range(0,len(my_forest))})
(ggplot(df, aes(x = "acc")) +
 geom_histogram(color = "black", fill = "lightblue", binwidth = 0.025) +
 xlim([0,1]) + theme_minimal() + geom_vline(xintercept = ensemble_acc, linetype = "dashed", size = 3))


NameError: name 'my_forest' is not defined

### 1.4.2
How does the difference between individual tree accuracies and ensemble accuracies change when you change the number of predictors used in each tree?

In [16]:
### YOUR CODE HERE ###
n_feat = 100
### /YOUR CODE HERE ###


my_forest2 = Forest(X_cols_df, y_df, n_features = n_feat)

ensemble_acc2 = accuracy_score(y_df2, ForestPredictor(forest, X_cols_df2))

### /YOUR CODE HERE ###

allAcc2 = [accuracy_score(y_df2,my_forest2[mod]["tree"].predict(X_cols_df2.iloc[:,my_forest2[mod]["feats"]])) for mod in range(0,len(my_forest2))]

df = pd.DataFrame({"acc": allAcc2, "no": range(0,len(my_forest2))})
(ggplot(df, aes(x = "acc")) +
 geom_histogram(color = "black", fill = "lightblue", binwidth = 0.025) +
 xlim([0,1]) + theme_minimal() + geom_vline(xintercept = ensemble_acc, linetype = "dashed", size = 3))


TypeError: '<' not supported between instances of 'str' and 'int'

You'll be graded on 1) the correctness of your code 2) the answers to the questions.