#**Random Forest**
<font color='grey' size='1.5'> Created by Parisa Hosseinzadeh for *Machine learning for proteins*, Spring 2022. 

Today, we will work on decision trees and random forests. The test case we will be using today is the *Pima Indians diabetes* set from [Kaggle](https://www.kaggle.com/datasets/kumargh/pimaindiansdiabetescsv?resource=download). 

Here is the description of the dataset from the website:

This dataset describes the medical records for Pima Indians
and whether or not each patient will have an onset of diabetes within ve years.

Fields description follow:

- **preg** = Number of times pregnant
- **plas** = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **pres** = Diastolic blood pressure (mm Hg)
- **skin** = Triceps skin fold thickness (mm)
- **test** = 2-Hour serum insulin (mu U/ml)
- **mass** = Body mass index (weight in kg/(height in m)^2)
- **pedi** = Diabetes pedigree function
- **age** = Age (years)
- **class** = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

## Prepratation

### Loading required modules

In [None]:
%matplotlib inline

import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Loading and preparing the dataset

In [None]:
# defining features/column names
features = ['preg','plas','pres','skin',
            'test','mass','pedi','age',
            'class']
# loading the dataset
data = pd.read_csv('pima-indians-diabetes.csv', 
                   header=0,
                   names=features)
# viewing the top 5 rows
data.head()

In [None]:
# let's take a look at feature distribution
data.hist(bins=50, figsize=(20,15))

In [None]:
# let's see how many samples we have in each category
print ('class 0 =', len(data[data['class'] == 0]))
print ('class 1 =', len(data[data['class'] == 1]))

#### Q1. Time to exercise

1. Based on what you see in the distributions, what cleaning processes you need for this data?
2. What type of train/test split you will use for this dataset based on the numbers in each class?


Using what you learned from [last lecture](https://colab.research.google.com/drive/1nxtav8c-I2Qav3OlISkZky8WJ1WWdD2F?usp=sharing), perform the following:

2. Create a stratified test/train split, with 10% of data on test.

Note that for decision trees, scaling is not required as they use a threshold in the feature space.

In [None]:
# copy your data

# perfomr splitting
# call them train, test


## Decision Tree

Let's first start with building a decision tree. There is a [DecisionTreeClassifier in scikit](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) but today, we will write our own code to get a better sense of how things work. 

### Visualizing data and finding thresholds

To make the process easier, we will first generate two subclass from our dataset: those with diabetes and those without. We then visualize these sets and choose our thresholds -- the best we can-- based on these.

In [None]:
# selecting sub-dataframes for each class
class_0 = train[train['class'] == 0]
class_1 = train[train['class'] == 1]
# making sure we're keeping the lengths
print('class 0 has length ', len(class_0),
      '\n and class 1 has length', len(class_1),
      '\n and all data is of size ', len(data))

In [None]:
# plotting distribution of features for class 0
class_0.hist(bins=50, figsize=(20,15))

In [None]:
# plotting distribution of features for class 1
class_1.hist(bins=50, figsize=(20,15))

### Writing a very simple decision tree model

Now, let's write a decision tree model. Each group has the choice of **up to 5** features to include for their decision tree. You will also need to define a threshold for each feature.

Remeber, **class** is not a feature, but it's the labels.

#### Q2. Time to exercise

1. What features did your group pick?
2. What are the thresholds you're using?
3. Which feature will be the top of your tree?

Now let's write the code. Your code is simply a nested if statement. Let's take a quick look at an example of how this works.

Let's say you want to write a code that can classifies all the bold blue numbers from this list:

<font color='blue'> 1 **0** 1 **1** <font color='red'> 0 **1** 1 0

The pseudo-code looks something like this:

is it blue:
   if yes, is it bold:
      if yes, label "success"

The code then will look like this:

```
label = 0
if color = blue:
    if format = bold:
        label = 1
```

*Question*: What would have happened if we started our classification by checking if it was bold first?

Now work with your group to write the code for a very simple decision tree. I have added some suggestions of how to start below.

In [None]:
# Create an list with the same length as
# your dataset, filled with 0

# write your nested if statements
# when you get to the final interior if
# set label at that location to be 1
# you can iterate through a datafarame using
# for i,r in df.iterrows()
# where i is the index and r is the row
# you can also access dataframe in a certain row i 
# by typing df.iloc[i]

# add the predictions as a new column to your dataframe

### Performance analysis

Now that you have a model, let's see how it is working. 

If you have time, try some other thresholds and see if you can improve your threshold.

#### Q3. Time to exercise

In the cell below, write a code to calculate the precision, recall, and accuracy of your model on your test set.

report these numbers

In [None]:
# Hint:
# TP = # of 1s in prediction that are 1 in data
# TN = # of 0s in prediction that are 0 in data
# FP = # of 1s in prediction that are 0 in data
# FN = # of 0s in prediction that are 1 in data

# you can select a subset from a dataframe with conditions using:
# new_df = df[df['c1'] == df['c2'] and ...]


In [None]:
# Optional
# Check to see if you can draw the confusion matrix 
# using either matplotlib or seaborn


## Random Forest

Now that you have built some intuiton of a decision tree, let's work our way through testing a random forest.

### Voting

Let's see if voting helps with the performance of our decision trees using a small test set.

In [None]:
# load the test set (same way you loaded data)

In [None]:
# predict class values using your thresholds

#### Q4. Time to exercise

1. What was your results?
2. What was the result after voting?
3. What is your conclusion?

### RandomForestClassifier

Now, let's use scikit's [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for properly classifying our data.

In [None]:
# preparing data to run
# all features without class labels
X = train[['preg','plas','pres','skin',
            'test','mass','pedi','age']]
# class labels
labels = train['class']

In [None]:
from sklearn.ensemble import RandomForestClassifier

# building the model
random_forest = RandomForestClassifier(
                      random_state= 42, # to make sure numbers are reproducible
                      bootstrap=True, # To reduce correlation 
                      max_depth = 1, # number of features
                      n_estimators = 20, # number of trees
                      )

In [None]:
# training
random_forest.fit(X, labels)

In [None]:
# let's check which features where more important
# check Important features
feature_importances_df = pd.DataFrame(
    {"feature": list(X.columns), "importance": random_forest.feature_importances_}
).sort_values("importance", ascending=False)

# Display
feature_importances_df

#### Q5. Time to exercise

Now let's see how well your RF works on your test data. Report precision, recall, and accuracy.

In [None]:
# prepare test_X similarly to how you prepared X
test_X = 

In [None]:
# perform prediction
predictions = random_forest.preedict(test_X)

In [None]:
# calculate precision, recall and accuracy

In [None]:
# FYI: you can also get accuracy using this
random_forest.score(test_X, test_labels)
#or this
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(labels, predictions))

#### Q6. Time to exercise

Try reducing the depth from 5 to 1. What do you see?

Try changing the number of estimators to 50 and 100. What do you see?

In [None]:
# place holder for new models

### Optional:

As we talked, random forests use a process called *bagging**. Thus, for each tree, a subset of input samples are not involved in training. Therefore, you can simply use that as a test for each tree and get an error called [out-of-the-bag error](https://en.wikipedia.org/wiki/Out-of-bag_error). This way, you won't need to split the data into train/test. 

The code below shows you how that is done.

In [None]:
# preparing data to run
# all features without class labels
X = data[['preg','plas','pres','skin',
            'test','mass','pedi','age']]
# class labels
labels = data['class']

In [None]:
# building the model
random_forest = RandomForestClassifier(
                      random_state= 42, 
                      bootstrap=True, 
                      oob_score=True,
                      )

In [None]:
random_forest.fit(X, labels)
oob_error = np.round(random_forest.oob_score_, 2)
print( 'random forest with ', str(len(random_forest.estimators_)),
       'trees has an OOB error of :', oob_error)

## Sample answers to decision tree

Below you may find some of the ways you can code your decision tree.

In [None]:
# decision tree with 3 features
# f1, f2, f3 are the features
# t1, t2, t3 are the thresholds

#setting the predictions to be all 0
predictions = [0 for i in range(len(test))]

j = 0
#looping through all data to set those that are 1
for id,sample in test.iterrows():
  if sample['f1'] < t1:
    if sample['f2'] > t2:
      if sample['f3'] <= t3:
        predictions[j] = 1
  j += 1

# adding predictions as a column to train
test['prediction'] = predictions

# to calculate accuracy:
df_true = np.where(
    test['class'] == test['prediction']
)
df_TP = np.where(
    (
        test['class'] == test['prediction']
     ) & (
        df['class'] == 1   
    )
)