### CS 3840 Applied Machine Learning - Lab Assignment 3

# <center>Random Forest and Neural Networks</center>

### 1. Overview
An ensemble of decision tree is called random forest. Despite its simplicity, this is one of the most powerful shallow learning algorithms available today. Neural networks are at very core of deep learning, which are versatile, powerful, and scalable, making them ideal to tackle large and highly complex machine learning tasks. The learning objective of this lab assignment is for students to understand random forest and neural networks, including how to train these models with the impacts of key parameters, how to evaluate their classification performances, and how to compare these results among different classification models.

#### Lecture notes and code demonstrations. 
Detailed coverage of these topics can be found in the following:
<li>Lecture 2022-03-16-W-Ensemble Learning</li>
<li>Lecture 2022-03-21-M-Ensemble Learning and Random Forest</li>
<li>Code-Ensemble Learning and Random Forest</li>
<li>Lecture 2022-03-23-W-Neural Networks</li>
<li>Lecture 2022-03-28-M-Neural Networks-2</li>
<li>Lecture 2022-03-30-W-Neural Networks-3</li>
<li>Lecture 2022-04-04-M-Neural Networks-4</li>
<li>Lecture 2022-04-06-W-Neural Networks-5</li>
<li>Code-Neural Networks</li>

### 2. Submission
You need to submit a detailed lab report with code, running results, and answers to the questions. If you submit <font color='red'>a jupyter notebook (“Firstname-Lastname-Lab3.ipynd”)</font>, please fill in this file directly and place the code, running results, and answers in order for each question. If you submit <font color='red'>a PDF report (“Firstname-Lastname-Lab3.pdf”) with code file (“Firstname-Lastname-Lab3.py”)</font>, please include the screenshots (code and running results) with answers for each question in the report.  

### 3. Questions (50 points)

For this lab assignment, you will be using the `MNIST dataset` to complete the following tasks and answer the questions. The MNIST dataset is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. You will use these features to build random forest and deep neural network models to predict the `digit` of an image. First, please place load the data.   

#### Load and plot the data

Loading MNIST data of 70,000 images may take some time.

In [None]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist["data"], mnist["target"]

In [None]:
print(X.shape, y.shape)

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

# Plot the data
def plot_digits(instances, images_per_row=5, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    n_rows = (len(instances) - 1) // images_per_row + 1
    n_empty = n_rows * images_per_row - len(instances)
    padded_instances = np.concatenate([instances, np.zeros((n_empty, size * size))], axis=0)

    # Reshape the array so it's organized as a grid containing 28×28 images:
    image_grid = padded_instances.reshape((n_rows, images_per_row, size, size))
    big_image = image_grid.transpose(0, 2, 1, 3).reshape(n_rows * size,
                                                         images_per_row * size)
    
    # Now that we have a big image, we just need to show it:
    plt.imshow(big_image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

plt.figure(figsize=(4,4))
example_images = X[:25]
plot_digits(example_images, images_per_row=5)
plt.show()

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

<font color='red'><b>About the data used in this assignment: </b></font><br>
**The MNIST dataset is actually already split into a training set (the first 60,000 images) and a test set (the last 10,000 images). All the classification models in this lab are trained on `X_train`, `y_train`, and evaluated on `X_test`, `y_test`.**


#### Question 1 (14 points):  
Please train a random forest model using <b>random patches</b> in function `answer_one( )`. Random patches is to sample training instances for each decision tree with or without replacement, which can be implemented using bagging ensemble method `BaggingClassifier` over a number of decision trees `DecisionTreeClassifier`; each tree is built upon a subset of training instances. After the random forest is trained, evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `DecisionTreeClassifier()`, `n_estimators=100`, `bootstrap=True` and `random_state=42` in `BaggingClassifier` to formulate random forest through a number of decision trees**

**Adjust the option `max_samples=` in `BaggingClassifier` to set the size for each subset of training instances as `1,000`, `2,000` and `3,000`**

In [None]:
import time
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score

def answer_one():
    random_patches = 
    random_patches.
    y_pred = 
    
    #Accuracy: use accuracy_score 
    accuracy = 
    
    #Micro F1 score: use f1_score with average='micro' 
    micro_f1 = 
    
    #Micro F1 score: use f1_score with average='macro' 
    macro_f1 = 
    
    return accuracy, micro_f1, macro_f1

#Run your function in the cell to return the results
time1_1 = time.time()
accuracy_1, microf1_1, macrof1_1 = answer_one()
time2_1 = time.time()
time_1 = time2_1 - time1_1 

#Print your results here
print(accuracy_1, microf1_1, macrof1_1, time_1)

<font color='red'><b>Double click here to answer the questions in this cell: </b></font> <br>
Report the performance and time used by three training subset sizes of 1,000, 2,000, and 3,000: <br>
<b>`max_samples=1000`</b>: Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ), Time used: ( ) <br>
<b>`max_samples=2000`</b>: Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ), Time used: ( ) <br>
<b>`max_samples=3000`</b>: Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ), Time used: ( )

#### Question 2 (14 points):  
Please train a random forest model using <b>random subspaces</b> in function `answer_two( )`. Random subspaces is to sample training features while keeping all training instances for each decision tree, which can be implemented using bagging ensemble method `BaggingClassifier` over a number of decision trees `DecisionTreeClassifier`; each tree is built upon a subset of training features. After the random forest is trained, evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `DecisionTreeClassifier()`, `n_estimators=100` and `random_state=42` in `BaggingClassifier` to formulate random forest through a number of decision trees**

**Adjust the option `max_features=` in `BaggingClassifier` to set the size for each subset of features as `10`, `30` and `50`**

In [None]:
def answer_two():
    random_subspaces = 
    random_subspaces.
    y_pred = 
    
    accuracy = 

    micro_f1 = 

    macro_f1 = 
    
    return accuracy, micro_f1, macro_f1

#Run your function in the cell to return the results
time1_2 = time.time()
accuracy_2, microf1_2, macrof1_2 = answer_two()
time2_2 = time.time()
time_2 = time2_2 - time1_2 

#Print your results here
print(accuracy_2, microf1_2, macrof1_2, time_2)

<font color='red'><b>Double click here to answer the questions in this cell: </b></font> <br> 
Report the performance by three feature subset sizes of 10, 30, and 50: <br>
<b>`max_features=10`</b>: Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ), Time used: ( ) <br>
<b>`max_features=30`</b>: Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ), Time used: ( ) <br>
<b>`max_features=50`</b>: Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ), Time used: ( )

#### Question 3 (7 points):  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
In Question 1, random forest uses random subset of training instances with all features for each decision tree. In Question 2, random forest randomly samples a very small subset of features while keeping all training instances for each decision tree. Both random forests have 100 decision trees. Please <b>compare the best results of these two models</b>: <br> 

Summarize your observations about their classification performance and time used: ( ) <br>

Also briefly explain why one model outperforms the other one: ( )

#### Install Tensorflow and Keras

You need to use Keras to build neural networks in the next question. As Keras has been integrated into Tensorflow package, please install Tensorflow (version ≥2.0 is required) as follows if you havn't done so yet:

**`python3 -m pip install --upgrade tensorflow`**

In [None]:
# TensorFlow ≥2.0 is required
import tensorflow as tf
assert tf.__version__ >= "2.0", "The version of Tensorflow needs to be ≥2.0"

print(tf.__version__)
print("The version of Tensorflow you installed is ≥2.0")

#### Preprocess MNIST

In [None]:
X_train, X_test = X_train / 255., X_test / 255.

y_train, y_test = y_train.astype(np.uint8), y_test.astype(np.uint8)

#### Question 4 (15 points):  
Please train a deep neural network (i.e., a multi-layer perceptron) in function `answer_four( )`. After the neural network is trained, evaluate the accuracy.

**Set dropout rate in `keras.layers.Dropout()` as `0.3`, `0.5` and `0.7` respectively to compare the different performance**  

In [None]:
import tensorflow as tf
from tensorflow import keras

tf.random.set_seed(42)

def answer_four():
    model = keras.models.Sequential([
        keras.layers.Input(shape=(784,)),
                                                     #Dense hidden layer, 300 neurons, ReLU
                                                     #Dropout layer to address ovefitting
                                                     #Dense hidden layer, 100 neurons, ReLU 
                                                     #Dropout layer to address overfitting
                                                     #Dense output layer, 10 neurons, softmax
    ])
    
    model.compile(
                                                     #Loss function 
                                                     #Optimization algorithm: adam
                                                     #Batch size for gradient descent: 64
                                                     #Evaluatuion metrics: accuracy 
    )
    
    model.fit(X_train, y_train, epochs=30,
             validation_data=(X_test, y_test))
    
    loss, accuracy_4 = model.evaluate(X_test, y_test)   
    
    return accuracy_4

#Run your function in the cell to return the results
accuracy_4 = answer_four()

#Print your results here
print("\nThe test accuracy is: ", accuracy_4)

<font color='red'><b>Double click here to answer the questions in this cell: </b></font> <br>
Report the performance by three dropout rate of 0.3, 0.5, and 0.7: <br>
<b>`keras.layers.Dropout(0.3)`</b>: Test accuracy is: ( ) <br>
<b>`keras.layers.Dropout(0.5)`</b>: Test cccuracy is: ( ) <br>
<b>`keras.layers.Dropout(0.7)`</b>: Test accuracy is: ( ) 

Based on the test accuracy, and the difference between training accuracy and validation accuracy printed out in the training log, please summarize the impact of dropout on the model performance: ( )

Based on the best performance of neural networks and the best performance of random forest, which model outperforms the other one: ( )