## **HW3 Problem 2 (15 points): Artifical Neural Networks [TA: Sogol Mansouri]**


### 1) Neural Network Playground

First, go to Tensorflow's [Neural Network Playground](https://playground.tensorflow.org/). This website is an interactive and exploratory visualization of how the features, number of layers, training time, etc, influence the classification boundries of an ANN. Right now, we'll only worry ourselves with *classification* problems.

Play with the visualization, and then answer the following questions below.

#### Scenarios

1. Using the default network topology, try training the network with the different activation functions (ReLU, Tanh, Sigmoid, Linear). What effect does the activation function have on the training time? What effect does the activation function have on the shape of the classification boundries?
2. Take a look at [this setup](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=2,2&seed=0.21855&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false). Train until the classification boundry converges. This is one of the rare cases where the nodes in an ANN can be (semi) interpreted. What do the nodes in the first hidden layer represent? What about the second hidden layer? How do you think the ANN uses these learned "features" to make a decision?

#### Exploration
For each of the following questions:
* Make a prediction before you begin exploring and testing.
* Include a link to your scenario.
* Explain why you think this scenario has this property.

**Questions**

3. Find a scenario where a simple model (fewer neurons) outperforms a complex model. (In regards to overfitting)
4. Find a scenario where no hidden layers perform well.
5. Find a scenario where a model with no hidden layers performs poorly no matter the features.
6. Find a scenario where it takes a lot of training time to get a correct solution.

1. [Answer] 

By testing, the training time for each activate function are:
Tanh: 35 epochs
ReLU: 40 epochs
Sigmoid: 500 epochs
Linear: boundary can't coverage

For Training time:
ReLU: Usually converges faster due to its non-saturating nature and efficient gradient propagation.
Tanh and Sigmoid: Tend to converge slower, especially with deep networks, due to gradient saturation (vanishing gradients).
Linear: Struggles with classification tasks as it cannot model non-linear relationships, often resulting in poor performance and longer training times without meaningful convergence.

For Boundary:
ReLU: Produces sharp, piecewise linear boundaries.
Tanh: Creates smoother, more curved boundaries due to its non-linear nature.
Sigmoid: Similar to Tanh but often less effective due to weaker gradients.
Linear: Fails to create non-linear boundaries, resulting in linear separations that are ineffective for complex datasets.

2. [Answer]
3. [Answer]
4. [Answer]
5. [Answer]
6. [Answer]

## 2) Training and Testing a Neural Network (Group)

For this problem, you'll be looking at a subset of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), which contains images of hand-written digits: 10 classes where each class refers to a digit.

Each data entry is a input matrix of 8x8 where each element is an integer in the range 0..16. The matrix is flattened in the dataset.


For this question, **you have enough experience to do the entire model pipeline yourself**. That means *loading the data, creating splits, scaling the data, training and tuning the model, and evaluating the model.*

In [None]:
#Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

random_state = 42

### Step 1: Load the data. Use `np.unique()` to check the class balance.

In [None]:
from sklearn.datasets import load_digits
df = load_digits()

In [None]:
df.data

In [None]:
# Get a distribution of the class label (target)
np.unique(df.target, return_counts=True)

### Step 2: Split the data into X (feautres) and Y (class)

Assign the variables below to split the dataset in to X (features) and Y (target)

In [None]:
X = Y = None
# Y should be target and X should be features
#TODO: add your code here
X = X[:200]
Y = Y[:200]

### Step 3: Create your train/test split. Use the provided random_state.

**Note**: You should use a `train_size` of 0.8, or 80%, and make sure to use the `random_state` to ensure test cases work.

In [None]:
from sklearn.model_selection import train_test_split

X_train = X_test = y_train = y_test = None
#use a `train_size` of 0.8, or 80%, and make sure to use the `random_state` to ensure test cases work
#TODO: add your code here

In [None]:
assert X_train.shape == (160, 64)
assert y_train.shape == (160,)
assert X_test.shape == (40, 64)
assert y_test.shape == (40,)

### Step 4: Use a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize the image data. 

Pixel data, like other data we've encountered, should often be scaled before classification. While in practice scaling image data can be more complex, in this exercise we'll continue to use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Fit the scaler only the the training X features, and then apply it to both training and test X features. We do this because in practice, we wouldn't be able to see data in the test X, so it shouldn't affect feature transformation. We therefore only use X_train for feature transformation.

In [None]:
from sklearn.preprocessing import StandardScaler

# Assign these variables the standardized training and test datasets
X_stand_train = X_stand_test = None
#TODO: add your code here

In [None]:
X_stand_train.shape

In [None]:
# Go through each attribute
for i in range(X_stand_train.shape[1]):
    # Calculate the mean of that attribute: it should be 0
    np.testing.assert_almost_equal(np.mean(X_stand_train[:, i]), 0)
    # Calculate the standard deviation: it should be 1
    std = np.std(X_stand_train[:, i])
    # However, if the std was already 0, standardization won't change that,
    # so skip this case
    if abs(std) < 0.01:
        continue
    np.testing.assert_almost_equal(std, 1)

### Step 5:  Train an MLP with default hyperparameters.

For the following, you'll be using sklearn's built in Multi-layer Perceptron classifier [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

Use the default hyperparams aside from `max_iter`. `max_iter` is how many iterations of training the ANN goes though until it manually stops. The default `max_iter=200` is too long for our data currently. 

**Use random_state as the random_states and max_iter=20**. The detault parameters will use a single hidden layer.



In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
clf=None
# Tip: if you pass your MLP the parameter verbose=True, you can see each iteration of its backpropagation
#TODO: add your code here

### Step 6:  Evaluate the model on the test dataset using a confusion matrix and a classification report

Like all classifiers, the MLP has a [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.predict) function that is used to make predictions on trianing or test data.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# Evaluate the classifier and assign mlp_cm to the confusion matrix of the evaluation
mlp_cm = None
#TODO: add your code here
mlp_cm

In [None]:
np.testing.assert_almost_equal(mlp_cm, [[4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 4, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 3, 0, 0, 0, 0],
       [2, 0, 2, 0, 0, 1, 1, 0, 0, 2],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 2, 0, 0, 0, 2],
       [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]])

In [None]:
# Similarly generate a classification report for the test dataset
mlp_clf_report = None
#TODO: add your code here
print(mlp_clf_report)

In [None]:
assert mlp_clf_report == classification_report(y_test, clf.predict(X_stand_test))

In [None]:
# For comparison, generate a classification report for the *training* dataset
mlp_clf_report = None
#TODO: add your code here
print(mlp_clf_report)

In [None]:
assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.89      0.94      0.91        17\n           1       0.93      0.87      0.90        15\n           2       0.67      0.67      0.67        15\n           3       1.00      0.95      0.97        19\n           4       0.82      0.82      0.82        17\n           5       0.89      0.94      0.91        17\n           6       0.92      0.92      0.92        13\n           7       0.94      1.00      0.97        17\n           8       0.80      0.29      0.42        14\n           9       0.67      1.00      0.80        16\n\n    accuracy                           0.85       160\n   macro avg       0.85      0.84      0.83       160\nweighted avg       0.86      0.85      0.84       160\n'


How well did the classifier do? What digit did it do best on? Which digits did it confuse the most? Do you think the classifier is likely over-fitting, underfitting or neither?

## 3) Hyperparameters

**Hyperparams**:

ANNs have *a lot* of hyperparams. This can include simple things such as the number of layers and nodes, up to tuning the learning rate and the gradient descent algorithm used. 

This process can require a lot of experimentation and intution through experience, but it can be automated to some extent using hyperparameter tuning. When we have multiple hyperparameters, we use an approach called GridSearch, where we try all combinations of various hyperparameters to find the combination that works best.

For the following, you will practice the hyperparamater tuning for the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) with sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function, you should explore different combination of the following parameters:

* `activation`: The activation function of the the ANN. Defaults to ReLU.
* `max_iter`: The ANN will train iterations until either the loss stops improving by a specified threshold, or `max_iters` is reached. Warning: the more you increase this, the more the training time will take! Patience is a virtue.
* `hidden_layer_sizes`: A tuple representing the structure of the hidden layers. For example, giving the tuple `(100,50)` means that there's two hidden layers: the first being of size 100, and the second being of size 50. The tuple (100,) would mean a single hidden layer of size 100.

Normally we would try many more possible combinations (and larger networks), but we've kept the list short to reduce computation time.

**Try different permutations of these hyperprams and see how it affects the classification scores of your model.**

In [None]:
# import the library
from sklearn.model_selection import GridSearchCV

In [None]:
# The parameter list you will explore
parameters = {'activation':['logistic', 'relu'], 'max_iter':[5, 10], 'hidden_layer_sizes':[(50,),(20,)]}

Now it's your turn, first initialize an MLPClassifier, make sure to **use "random_state" as the random_states**, then feed the parameter list defined above as well as the training data (**use "X_stand_train"**) to GridSearchCV to create a classifier with the best combination of the parameters. To do so, it uses cross-validation within the training dataset, so you never have to peek at your test dataset. Then fit the final classifier to the whole standardized training dataset.

**Note**: You should use cv=2 in your grid search, to reduce the number of folds tested.

In [None]:
# Assign clf to the optimized (with grid search) MLP model
# TIP: Again, if you want to track the trianing progress, try passing "verbose = True" to the MLP
clf = None
#TODO: add your code here

In [None]:
# Now let's see the parameters of the winning model of our grid search
# This model is the one clf actually uses when you call clf.fit
clf.best_estimator_

In [None]:
assert list(clf.cv_results_['rank_test_score']) == [3, 2, 8, 6, 4, 1, 6, 5]
np.testing.assert_almost_equal(round(clf.best_score_,4), 0.2312)
assert clf.best_params_['hidden_layer_sizes'] == (50,)
assert clf.best_index_ == 5


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Now you will use the estimator with the best found parameters to generate predictions (stored as "y_pred") on testing dataset, **remember to use "X_stand_test"**

In [None]:
y_pred = None
#TODO: add your code here

In [None]:
assert list(confusion_matrix(y_test,y_pred)[0]) == [0, 0, 0, 0, 1, 0, 1, 0, 0, 2]

In [None]:
print(classification_report(y_test,y_pred))

Note that in this toy example, we used a very limited set of hyperparmeters to reduce training time, and so our tuned model will actually do worse than our original. However, in practice, the tuned model will generally have better generalization to the test dataset.