# Random Forest
---

1.   **[Introduction to Random Forest](#1.-Introduction-to-Random-Forest)**
2.   **[Foundations of Random Forest](#2.-Foundations-of-Random-Forest)**
3.   **[Random Forest Tuning](#3.-Random-Forest-Tuning)**
4.   **[Model Validation](#4.-Model-Validation)**
5.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
6.   **[Model Construction](#6.-Model-Construction)**
7.   **[Model Results & Evaluation](#7.-Model-Results-&-Evaluation)**

---
<a name="1.-Introduction-to-Random-Forest"></a>
### 1. Introduction to Random Forest

#### 1.1 Definitions

**Random Forest |** Ensemble of decision trees trained on bootstrapped data with randomly selected features

**Ensembling |** Ensemble learning refers to aggregating predictive model outputs to make a prediction rather than relying on an individual model

**Base Learner |** Each individual model that comprises an ensemble 

**Weak Learning |** A model that performs slightly better than randomly guessing

**Bootstrapping |** Refers to sampling with replacement which is the process of selecting a random item from a dataset and returning it back to the dataset before the next selection, allowing the same item to be selected multiple times.

**Bagging |** Bootstrapping + Aggregating 



---
<a name="2.-Foundations-of-Random-Forest"></a>
### 2. Foundations of Random Forest

#### 2.1 Concept
If you build a bagging ensemble of decision trees but take it one step further by randomizing the features used to train each base learner, the result is called a random forest.

Random forest models leverage randomness to reduce the likelihood that a given base learner will make the same mistakes as other base learners. When mistakes between learners are uncorrelated, it reduces both bias and variance. In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement.

#### 2.2 Sampling affect on predictions
The effect of all this sampling is that the base learners each see only a fraction of the possible data that’s available to them and it is possible for model scores to improve with sampling, while also requiring significantly less time to run, since each tree is built from less data.

Random forest builds on bagging, taking randomization even further by using only a fraction of the available features to train its base learners. This randomization from sampling often leads to both better performance scores and faster execution times, making random forest a powerful and relatively simple tool in the hands of any data professional.

---
<a name="3.-Random-Forest-Tuning"></a>
### 3. Random Forest Tuning

Random forest tuning is the same as decision tree tuning which refers to the process of adjusting the hyperparameters to manipulate the structure of a decision trees algorithms with the aim of improving the overall performance on a given task.

**Most Important Hyperparameters for Random Forest Tuning:**

| Hyperparameter | What it does | Input type | Default Value | Considerations |
| --- | --- | --- | --- | --- |
| `n_estimators` | Specifies the number of trees the model will build in its ensemble | int | 100 | A typical range is 50–500. Consider how much data, how deep the trees are allowed to grow and how many samples are bootstrapped to grow each tree (generally need more trees if they’re shallow, and more trees if your bootstrap sample size is smaller). Also consider if use case has latency requirements. |
| `max_depth` | Specifies how many levels tree can have.   If None, trees grow until leaves are pure or until all leaves contain less than min_samples_split samples. | int | None  | Random forest models often use base learners that are fully grown. Restricting tree depth can reduce train/latency times and prevent overfitting. If not None, consider values 3–8. |
| `min_samples_split` | Controls threshold below which nodes become leaves  If float, then it represents a percentage (0–1] of max_samples. | int or float | 2 | Consider (a) how many samples are in dataset, and (b) how much of that data to allow each base learner to use (i.e., the value of the max_samples hyperparameter). The fewer samples available, the lesser the number of samples may need to be allowed in each leaf node (otherwise the tree would be very shallow). |
| `min_samples_leaf` | A split can only occur if it guarantees a minimum of this number of observations in each resulting node. If float, then it represents a percentage (0–1] of max_samples. | int or float | 1 | Consider (a) how many samples are in your dataset, and (b) how much of that data you're allowing each base learner to use (i.e., the value of the max_samples hyperparameter). The fewer samples available, the lesser the number of samples may need to be allowed  in each leaf node (otherwise the tree would be very shallow). |
| `max_features` | Specifies the number of features that each tree randomly selects during training. - If int, then consider max_features features at each split. - If float, then max_features is a fraction and round (max_features * n_features) features are considered at each split. - If “sqrt”, then max_features=sqrt(n_features). - If “log2”, then max_features=log2(n_features). - If None, then max_features=n_features. | {“sqrt”, “log2”, None}, int or float. | “sqrt” | Consider how many features the dataset has and how many trees will be grown. Fewer features sampled during each bootstrap means more base learners would be needed. Small max_features values on datasets with many features mean more unpredictive trees in the ensemble. |
| `max_samples` | Specifies the number of samples bootstrapped from the dataset to train each base model If float, then it represents a percentage (0–1] of the dataset.If None, then draw X.shape[0] samples. | int or float | None | Consider the size of your dataset. When working with large datasets, it can be beneficial to limit the number of samples in each tree, because doing so can greatly reduce training time and yet still result in a robust model. For example, 20% of 1 billion may be enough to capture patterns in the data, but if you have 1,000 samples then you’ll probably need to use them all. |

---
<a name="4.-Model-Validation"></a>
### 4. Model Validation

Model validation is the set of processes and activities intended to verify that models are performing as expected. Model validation refers to the whole process of evaluating different models, selecting one, and then continuing to analyze the performance of the selected model to better understand its strengths and limitations. 

#### 4.1 Validation Sets 
The simplest way to maintain the objectivity of the test data is to create another partition in the data—a validation set—and save the test data for after you select the final model. The validation set is then used, instead of the test set, to compare different models.

When building a model using a separate validation set, once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.


#### 4.2 Cross Validation 
Cross Validation is a process that uses different portions of the data to test and train a model on different iterations.  

Cross-validation makes more efficient use of the training data by splitting the training data into k number of “folds” (partitions), training a model on k – 1 folds, and using the fold that was held out to get a validation score. The training process occurs k times, each time using a different fold as the validation set. At the end, the final validation score is the average of all k scores. This process is also commonly referred to as k-fold cross validation.

After a model is selected using cross-validation, that selected model is then refit to the entire training set (i.e., it’s retrained on all k folds combined).

Cross-validation reduces the likelihood of significant divergence of the distributions in the validation data from those in the full dataset. For this reason, it’s often the preferred technique when working with smaller datasets, which are more susceptible to randomness. The more folds you use, the more thorough the validation. However, adding folds increases the time needed to train, and may not be useful beyond a certain point.

#### 4.3 Model Selection
Once candidate models have been trained and validated a champion model is selected. Model validation scores factor heavily into this selection but score is seldom the only criterion. Often other factors are also considered. 
- How explainable is the model? 
- How complex is it? 
- How resilient is it against fluctuations in input values? 
- How well does it perform on data not found in the training data? 
- How much computational cost does it have to make predictions? 
- Does it add much latency to any production system? 
It’s not uncommon for a model with a slightly lower validation score to be selected over the highest-scoring model due to it being simpler, less computationally expensive, or more stable.


---
<a name="5.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
import numpy as np
import pandas as pd

import pickle as pkl
 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

# Load the dataset into a DataFrame and save in a variable
data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration
After loading the dataset, the next step is to prepare the data to be suitable for clustering. This includes: 

*   Exploring data
*   Checking for missing values
*   Encoding categorical data 
*   Dropping irrelevant columns
*   Renaming columns
*   Create training and testing data

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Display the data type for each column. NB logistic regression models expect numeric data
data.dtypes

##### 5.2.1 Check for Missing Values

In [None]:
# Check for missing values.
data.isnull().sum()

In [None]:
# Drop rows with missing values.
# Save DataFrame in variable `data_subset`.
data_subset = data.dropna(axis=0).reset_index(drop = True)

In [None]:
# Check for missing values.
data_subset.isna().sum()

In [None]:
# View first 10 rows.
data_subset.head(10)

##### 5.2.2 Drop Columns

In [None]:
# Drop the island column.
data_subset = data_subset.drop(['irrelevant_column'], axis=1)

##### 5.2.3 Encode Data

Decision trees require numeric columns so all relevant columns (target and predictor variables) must be converted accordingly through the process of One-hot encoding or Label encoding
- Columns requiring special attention should be dealt with separately (special focus on target variable)
- Finally all columns can be converted to numeric using `pandas.get_dummies()`

In [None]:
# Represent the data in the target variable numerically
data_subset['target_variable'] = data_subset['target_variable'].map({"class_1": 1, "class_2": 0})

In [None]:
# Convert all categorical columns to numeric.
data_subset = pd.get_dummies(data_subset, drop_first = True)

##### 5.2.4 Create Training and Test Data

In [None]:
# Separate the dataset into labels (y) and features (X).
y = data_subset["target_variable"]

X = data_subset.copy()
X = X.drop("target_variable", axis = 1)

# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify= y, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, stratify= y_train, random_state = 0)

---
<a name="6.-Model-Construction"></a>
### 6. Model Construction


#### 6.1 Tune Model

In [None]:
# Determine set of hyperparameters
cv_params = {'n_estimators' : [50,100], 
              'max_depth' : [10,50],        
              'min_samples_leaf' : [0.5,1], 
              'min_samples_split' : [0.001, 0.01],
              'max_features' : ['sqrt'], 
              'max_samples' : [.5,.9]}

**Using a separate validation set is a common practice in machine learning to evaluate the performance of a model and prevent overfitting.**

During the training process, the model is optimized to minimize the error on the training data, and it can start to memorize the training data instead of learning the underlying patterns in the data. This can lead to poor performance on new, unseen data.

To evaluate the performance of the model on new data, a separate validation set is used. The validation set is a portion of the original dataset that is not used for training, and it is used to evaluate the performance of the model during the training process.

By monitoring the performance of the model on the validation set, we can tune the hyperparameters and adjust the model to prevent overfitting. Once we have a well-tuned model, we can use a final test set, which is another portion of the original dataset that is not used for training or validation, to evaluate the performance of the model on completely unseen data.

Therefore, using a separate validation set is important to ensure that the model generalizes well to new data and is not overfitting to the training data. It allows us to optimize the model's performance while ensuring that it is not just memorizing the training data.

```For Model Validation GridsearchCV is going to want to cross-validate the data. if cv parameter were left blank it would split the data into 5  folds for cv by default. Because using a separate validation set it is important to explicitly tell the function how to perform the validation. This includes telling it every row in the training and testing sets. Use a list comprehension to generate a list of the same length as the x_tr data where each value is either -1 or 0. Use this list to indicate to GridSearchCV that each row labeled -1 is training set and each row labeled 0 is in the validation set```

In [None]:
# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index] 

# use PredefinedSplit from sklearn.model_selection to provide train/test indices to split data into training and test sets using a predefined scheme
custom_split = PredefinedSplit(split_index)

#### 6.2 Create Model

In [None]:
# Instantiate the model
rf = RandomForestClassifier(random_state=0)

In [None]:
# Search over specified parameters using GridSearchCV
rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs= -1, verbose= 1)

In [None]:
%%time

# Fit the model.
rf_val.fit(X_train, y_train)

In [None]:
# Obtain optimal parameters.
rf_val.best_params_

---
<a name="7.-Model-Results-&-Evaluation"></a>
### 7. Model Results & Evaluation

#### 7.1 Optimize Parameters

In [None]:
# Use the selected model to predict on test data
# Use optimal parameters on GridSearchCV
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50, 
                                min_samples_leaf = 1, min_samples_split = 0.001,
                                max_features="sqrt", max_samples = 0.9, random_state = 0)

In [None]:
# Predict on test set.
y_pred = rf_opt.predict(X_test)

#### 7.2 Obtain Performance Scores

In [None]:
# Get precision score.
pc_test = precision_score(y_test, y_pred, pos_label = "satisfied")
print("The precision score is {pc:.3f}".format(pc = pc_test))

In [None]:
# Get recall score.
rc_test = recall_score(y_test, y_pred, pos_label = "satisfied")
print("The recall score is {rc:.3f}".format(rc = rc_test))

In [None]:
# Get accuracy score.
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))

In [None]:
# Get F1 score.
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))

#### 7.3 Evaluate Feature Importance

In [None]:
# Uncover which features might be most important by building a feature importance graph
importances = model.best_estimator_.feature_importances_
forest_importances = pd.Series(importances, index=X.columns)

# Sort the feature importances in descending order
sorted_importances = forest_importances.sort_values(ascending=True)

# Create a new figure and axis
fig, ax = plt.subplots(figsize=(8, 4))

# Plot the sorted feature importances with swapped x and y axis
sorted_importances.plot.barh(ax=ax)

# Set the axis labels and title
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Relative Feature Importance - Random Forest')

# Add grid
plt.grid(True)

plt.show()

Pros and cons of performing the model selection using test data instead of a separate validation dataset.

Pros: <br />
*  The coding workload is reduced.
*  The scripts for data splitting are shorter.
*  It's only  necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons: <br />
* If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
* A potential overfitting issue could happen when fitting the model's scores on the test data.

#### 7.3 Evaluation

Four basic parameters for evaluating the performance of a classification model?

1. True positives (TP): These are correctly predicted positive values, which means the value of actual and predicted classes are positive. 

2. True negatives (TN): These are correctly predicted negative values, which means the value of the actual and predicted classes are negative.

3. False positives (FP): This occurs when the value of the actual class is negative and the value of the predicted class is positive.

4. False negatives (FN): This occurs when the value of the actual class is positive and the value of the predicted class in negative. 

**Reminder:** When fitting and tuning classification models, aim to minimize false positives and false negatives.

What the four scores demonstrate about a model.

- Accuracy: The ratio of correctly predicted observations to total observations. 
- Precision: The ratio of correctly predicted positive observations to total predicted positive observations. 
- Recall: The ratio of correctly predicted positive observations to all observations in actual class.
- F1 score: The harmonic average of precision and recall, which takes into account both false positives and false negatives.

In [None]:
# Precision score on test data set.
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test), "for the test set,", "\nwhich means of all positive predictions,", "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test * 100))

In [None]:
# Recall score on test data set.
print("\nThe recall score is: {rc:.3f}".format(rc = rc_test), "for the test set,", "\nwhich means of which means of all real positive cases in test set,", "{rc_pct:.1f}% are  predicted positive.".format(rc_pct = rc_test * 100))

In [None]:
# Accuracy score on test data set.
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", "\nwhich means of all cases in test set,", "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))

In [None]:
# F1 score on test data set.
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))

**Question to answer:** How well does this model perform based on the four scores?

In [None]:
# Create table of results.
table = pd.DataFrame()
table = table.append({'Model': "Tuned Decision Tree",
                        'F1':  0.945422,
                        'Recall': 0.935863,
                        'Precision': 0.955197,
                        'Accuracy': 0.940864
                      },
                        ignore_index=True
                    )

table = table.append({'Model': "Tuned Random Forest",
                        'F1':  f1_test,
                        'Recall': rc_test,
                        'Precision': pc_test,
                        'Accuracy': ac_test
                      },
                        ignore_index=True
                    )
table

**Question to answer:** How does this random forest model compare to any other models that have been constructed to perform the same task.