# Week 8 Applying Machine Learning

# Pre-module

So far, you have learned the machine learning pipeline. In the main module this week, you will be challenged to solve a problem using any of the tools you have learned in this course. The purpose of this pre-module is to introduce you to two more models that you may find useful in tackling this challenge: random forest and decision trees. While in this module, we will not be covering the internal details of how each model works, we will provide enough detail regarding the hyperparameters so that you can use them on your dataset and provide a high-level description of the model.

As usual, we will use the breast cancer dataset:

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('bc_data.csv', index_col=0)

# Data cleaning
# remove the 'Unnamed: 32' column
df = df.drop('Unnamed: 32', axis=1)

# encode target feature to binary class and split target/predictor vars
y = df["diagnosis"].map({"B" : 0, "M" : 1})
X = df.drop("diagnosis", axis = 1)

# drop all "worst" columns
cols = ['radius_worst', 
        'texture_worst', 
        'perimeter_worst', 
        'area_worst', 
        'smoothness_worst', 
        'compactness_worst', 
        'concavity_worst',
        'concave points_worst', 
        'symmetry_worst', 
        'fractal_dimension_worst']
X = X.drop(cols, axis=1)

# drop perimeter and area (keep radius)
cols = ['perimeter_mean',
        'perimeter_se', 
        'area_mean', 
        'area_se']
X = X.drop(cols, axis=1)


**Q*1: Split the data into a train and test set.**

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
from sklearn.model_selection import train_test_split
# Your Code Here.
# NOTE: please use X_train, X_test, y_train, and y_test as your variable names.


### Decision Trees

In Week 5, we explored basic machine learning models like Logistic Regression. This week, we will introduce a new model: **Decision Trees**.

Decision Trees are a type of supervised learning model used for both **classification** and **regression** tasks. They work by recursively splitting the data into subsets based on the most significant feature at each step, creating a tree-like structure. The model makes predictions by following the branches of the tree based on input features, ultimately arriving at a prediction at the leaves.


#### **Key Definitions**

- **Node**: A node in a decision tree represents a decision point, where a dataset is split based on the value of a specific feature. There are two types of nodes:
  - **Root Node**: The topmost node, where the data is initially split.
  - **Leaf Node**: The end points of the tree, which represent the final prediction or output.
  
- **Splitting**: Splitting refers to dividing the dataset into smaller subsets based on the value of a feature. The goal is to separate the data into homogeneous groups, where the data points in each group are similar.
  
- **Feature**: A feature is an attribute or characteristic of the data used to make decisions at each node. For example, in a dataset of customers, features could include age, income, or location.

- **Impurity**: Impurity measures the homogeneity of the data at each node. The most common metrics used for impurity are **Gini Impurity** (used for classification tasks) and **Mean Squared Error (MSE)** (used for regression tasks). The goal is to minimize impurity as the tree grows. (Refer to the second picture for a depiction of impurity)

- **Pruning**: Pruning is the process of removing branches or nodes that do not improve the model's accuracy or performance. It helps to prevent overfitting by simplifying the tree and making it more generalizable.

---

In essence, Decision Trees are intuitive models that break down complex decision-making processes into a series of yes/no questions, each based on the most relevant feature. The result is a flowchart-like structure that predicts the outcome based on how the input features align with the conditions at each node. While Decision Trees are easy to interpret, they are prone to overfitting, especially with very deep trees. However, techniques like **pruning** or ensemble methods like **Random Forests** can help mitigate this.

<img src="decision_tree.png" alt="decision tree" width="900"/>

Image retrieved from [here](https://medium.com/@glennlenormand/decision-tree-in-machine-learning-simplifying-complex-decisions-3657f9f2e48a)

<img src="impurity.png" alt="decision tree impurity" width="500"/>

Image retrieved from [here](https://medium.com/analytics-vidhya/decision-trees-on-mark-need-why-when-quick-hands-on-conclude-ce10dac51e3)

#### **Parameters to Tune**

The Decision Tree model has several key parameters that you can tune to optimize its performance:

1. **`max_depth`**: This parameter controls the maximum depth of the tree. Setting a lower value can prevent overfitting by limiting the complexity of the tree, while a higher value allows the tree to capture more intricate patterns. Be cautious with large values, as they can lead to overfitting.
   
2. **`min_samples_split`**: This parameter determines the minimum number of samples required to split an internal node. Higher values prevent the tree from creating splits that result in very small branches, which helps in reducing overfitting.

3. **`min_samples_leaf`**: This controls the minimum number of samples required to be at a leaf node. A higher value ensures that leaf nodes represent more general patterns and reduces the risk of overfitting to small details in the data.

4. **`max_features`**: This parameter sets the number of features to consider when looking for the best split. Using fewer features can make the model more robust and reduce overfitting. For example, setting this to `sqrt` or `log2` will use a random subset of features at each node.

5. **`criterion`**: This parameter defines the function used to measure the quality of a split. Common options include:
   - **`gini`** (default) for Gini impurity, commonly used for classification tasks.
   - **`entropy`** for information gain, another method used to split data based on the reduction of entropy.

6. **`max_leaf_nodes`**: This parameter limits the number of leaf nodes in the tree. Limiting the number of leaves can help prevent overfitting by simplifying the model.

7. **`splitter`**: This controls the strategy used to split at each node. Options include:
   - **`best`**: This finds the best split.
   - **`random`**: This chooses the best random split.

---

Below we provide an example of training a Decision Tree on a dataset:

In this example, the `max_depth` and `min_samples_split` parameters are tuned to help control the complexity of the decision tree and prevent overfitting. For other parameters, refer to the documentation for [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize and train a Decision Tree model with tuned parameters
model_dt = DecisionTreeClassifier(max_depth=5, min_samples_split=10, criterion='gini')
model_dt.fit(X_train, y_train)

**Q*2: Make predictions on the train and test sets using Decision Trees, and provide the training and testing accuracy.**

Hint: refer to the documentation [DecisionTreeClassifier.score](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.score).

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
train_acc = 
test_acc = 
print(f"Train Accuracy: {train_acc:.2f}")
print(f"Test Accuracy: {test_acc:.2f}")

**Q*3: Loop through 8 different value combinations of `max_depth` and `min_samples_split` for the model. Compare the test accuracy. Include a legend.**

Ex: max_depth=50, max_depth=2

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
import matplotlib.pyplot as plt

max_depth_array = [50, 100, 200, 300]
min_samples_split_array = [2, 4]
accuracy = []
legend = []

# YOUR CODE HERE



# Plot the accuracy vs. the hyperparameter combinations
plt.figure(figsize=(10, 6))

# Create a line plot with markers
plt.plot(legend, accuracy, marker='o', linestyle='-', color='b')

# Add labels and title
plt.xlabel('Hyperparameter Combinations')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Hyperparameter Tuning')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.tight_layout()
plt.show()

Let's visualize the decision tree. The root node is the node at the very top. 

What each node will display:

* **Feature and Condition:** The feature and the condition used to split the data at that node (e.g., sepal length (cm) <= 2.45).
* **Gini Value:** The Gini impurity at that node (a value between 0 and 1 indicating the impurity of the node, where 0 means pure, and values closer to 1 mean more mixed).
* **Number of Samples:** The number of samples at that node, i.e., how many data points are at this particular node.
* **Class Distribution:** The distribution of classes (how many samples from each class are present in that node), for example, [50, 50] means 50 samples from each of the two classes.

In [None]:
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(model_dt, filled=True, feature_names=X.columns)
plt.show()

---

### Random Forests

A model that builds on top of **Decision Trees** is **Random Forest**.

Random Forest is an **ensemble learning** method that can be used for both **classification** and **regression** tasks. It builds a collection of decision trees and merges their outputs to improve accuracy and robustness. Instead of relying on a single decision tree, which might be prone to overfitting, the Random Forest algorithm aggregates the predictions of multiple trees to make a more accurate and reliable final prediction.


#### **Key Definitions:**

- **Decision Tree**: A decision tree is a model that splits data into subsets based on feature values, creating a tree-like structure with nodes representing decisions and leaves representing predictions. Each decision point uses a feature to divide the data into smaller groups.  
- **Ensemble Learning**: Ensemble learning combines the predictions of multiple models (like decision trees) to improve overall performance. Random Forest uses this approach to reduce overfitting and increase model accuracy.
- **Bootstrapping**: This is the process of creating different datasets from the original data by sampling with replacement. Random Forest generates multiple bootstrapped datasets, and each decision tree is trained on one of these datasets.
- **Feature Bagging**: In addition to bootstrapping the data, Random Forest also selects a random subset of features to split on at each decision tree node. This helps prevent overfitting and ensures diversity among trees in the forest.

---

In essence, Random Forest models are made up of several decision trees, and each tree votes for a class or a value. The final prediction is determined by the majority vote (for classification) or average (for regression). This ensemble approach helps Random Forest to handle complex datasets with high accuracy and reduces the risk of overfitting that a single decision tree might encounter.

![random forest](random_forest.png)

Image retrieved from [here](https://www.sciencedirect.com/topics/computer-science/ideal-protocol)

#### **Hyperparameters**

The Random Forest model has several key parameters that you can tune to optimize its performance:

1. **`n_estimators`**: This parameter controls the number of trees in the forest. A higher number of trees generally leads to better performance, but also increases computation time.

2. **`max_depth`**: This parameter controls the maximum depth of each tree in the forest. Similar to Decision Trees, limiting the depth helps reduce overfitting by preventing the trees from growing too complex.

3. **`min_samples_split`**: This parameter determines the minimum number of samples required to split an internal node. Higher values help prevent the trees from creating splits that result in small branches, reducing overfitting.

4. **`min_samples_leaf`**: This controls the minimum number of samples required to be at a leaf node. Setting a higher value helps ensure that each leaf node represents a more general pattern, which can reduce overfitting to noisy data.

5. **`max_features`**: This parameter defines the maximum number of features to consider when splitting a node. Lower values can help reduce overfitting by introducing randomness in the splitting process. Common options include `sqrt` or `log2`.

6. **`criterion`**: This defines the function used to measure the quality of a split. Random Forest commonly uses the following criteria:
   - **`gini`** (default) for Gini impurity, typically used for classification tasks.
   - **`entropy`** for information gain, another criterion used to measure how well the splits separate the data.

---

Below we provide an example of training a Random Forest model on a dataset:

In this example, the `n_estimators`, `max_depth`, and `min_samples_split` parameters are tuned to help control the complexity and generalization of the Random Forest model, ensuring it performs well without overfitting. For other parameters, refer to the documentation for [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Load sample dataset
# Initialize and train a Random Forest model with tuned parameters
model_rf = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_split=10, criterion='gini', random_state=42)
model_rf.fit(X_train, y_train)

**Q*4: Make predictions on the train and test sets using Random Forest, and provide the training and testing accuracy.**

Hint: refer to the documentation [RandomForestClassifier.score](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score).

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
train_acc = 
test_acc = 
print(f"Train Accuracy: {train_acc:.2f}")
print(f"Test Accuracy: {test_acc:.2f}")

**Q*5: Loop through 8 different value combinations of n_estimators and max_depth for the model. Compare the test accuracy. Include a legend.**

Ex: n_estimators=50, max_depth=100

<span style="background-color: #FFD700">**Write your code below**</span> 


In [None]:
import matplotlib.pyplot as plt

n_estimators_array = [50, 100, 200, 300]
max_depth_array = [1, 4]
accuracy = []
legend = []

# YOUR CODE HERE



# Plot the accuracy vs. the hyperparameter combinations
plt.figure(figsize=(10, 6))

# Create a line plot with markers
plt.plot(legend, accuracy, marker='o', linestyle='-', color='b')

# Add labels and title
plt.xlabel('Hyperparameter Combinations')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Hyperparameter Tuning')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.tight_layout()
plt.show()

## **Graded Exercise: (2 marks)** 

**GQ*1: What is the difference between Decision Trees and Random Forest? (2 marks)**

<span style="background-color: #FFD700">**Write your answer below**</span>


Answer here:

---

## Conclusion
This week, you learned about two new models, Decision Trees and Random Forest. You have gained practical experience in training, evaluating, and interpreting these models. In the next module, you'll apply these newly learned skills to solve a biological case study problem by yourself.