# Backpropagation Lab





In [1]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.linear_model import Perceptron
import numpy as np
import matplotlib.pyplot as plt

## 1. (40%) Correctly implement and submit your own code for the backpropagation algorithm. 

## Code requirements 
- Ability to create a network structure with at least one hidden layer and an arbitrary number of nodes.
- Random weight initialization with small random weights with mean of 0 and a varience of 1.
- Use Stochastic/On-line training updates: Iterate and update weights after each training instance (i.e. do not attempt batch updates)
- Implement a validation set based stopping criterion.
- Shuffle training set at each epoch.
- Option to include a momentum term

You may use your own random train/test split or use the scikit-learn version if you want.

Use your Backpropagation algorithm to solve the Debug data. We provide you with several parameters, and you should be able to replicate our results every time. When you are confident it is correct, run your script on the Evaluation data with the same parameters, and include your final weights in your report PDF.

In [2]:
class PerceptronClassifier(BaseEstimator,ClassifierMixin):

    def __init__(self, lr=.1, shuffle=True):
        """ 
            Initialize class with chosen hyperparameters.
        Args:
            lr (float): A learning rate / step size.
            shuffle: Whether to shuffle the training data each epoch. DO NOT 
            SHUFFLE for evaluation / debug datasets.
        """
        self.lr = lr
        self.shuffle = shuffle

    def fit(self, X, y, initial_weights=None):
        """ 
            Fit the data; run the algorithm and adjust the weights to find a 
            good solution
        Args:
            X (array-like): A 2D numpy array with the training data, excluding
            targets
            y (array-like): A 2D numpy array with the training targets
            initial_weights (array-like): allows the user to provide initial 
            weights
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """
        self.initial_weights = self.initialize_weights() if not initial_weights else initial_weights

        return self

    def predict(self, X):
        """ 
            Predict all classes for a dataset X
        Args:
            X (array-like): A 2D numpy array with the training data, excluding 
            targets
        Returns:
            array, shape (n_samples,)
                Predicted target values per element in X.
        """
        pass

    def initialize_weights(self):
        """ Initialize weights for perceptron. Don't forget the bias!
        Returns:
        """

        return [0]

    def score(self, X, y):
        """ 
            Return accuracy of model on a given dataset. Must implement own 
            score function.
        Args:
            X (array-like): A 2D numpy array with data, excluding targets
            y (array-like): A 2D numpy array with targets
        Returns:
            score : float
                Mean accuracy of self.predict(X) wrt. y.
        """

        return 0

    def _shuffle_data(self, X, y):
        """ 
            Shuffle the data! This _ prefix suggests that this method should 
            only be called internally.
            It might be easier to concatenate X & y and shuffle a single 2D 
            array, rather than shuffling X and y exactly the same way, 
            independently.
        """
        pass

    ### Not required by sk-learn but required by us for grading. Returns the weights.
    def get_weights(self):
        pass

## 1.1 Debug 

Debug your model by running it on the [Debug Dataset](https://raw.githubusercontent.com/rmorain/CS472-1/master/datasets/perceptron/linsep2nonorigin.arff)

Parameters:

Learning Rate = 0.1\
Momentum = 0.5\
Deterministic = 10 [This means run it 10 epochs and should be the same everytime you run it]\
Shuffle = False\
Validation size = 0
Initial Weights = All zeros

---

Expected Results: The weights do not need to be in this order or shape.

Results if the # of outputs nodes = 1 AKA binary Node.

debug_bp_0.csv

Results if the # of outputs nodes = 2 Using a One Hot Encoding.

debug_bp_2outs.csv

In [None]:
# Load debug data

# Train on debug data

# Check weights

## 1.2 Evaluation

We will evaluate your model based on it's performance on the [Evaluation Dataset](https://raw.githubusercontent.com/rmorain/CS472-1/master/datasets/perceptron/data_banknote_authentication.arff)

In [27]:
# Load evaluation data

# Train on evaluation data

# Print weights

## 2. (13%) Backpropagation on the Iris Classification problem.

Load the Iris Dataset [Iris Dataset](https://raw.githubusercontent.com/rmorain/CS472-1/master/datasets/perceptron/iris.arff)

Parameters:
- One layer of hidden nodes with the number of hidden nodes being twice the number of inputs.
- Use a 75/25 split of the data for the training/test set.
- Use a learning rate of 0.1
- Use a validation set taken from the training set for your your stopping criteria
- Create one graph with the MSE (mean squared error) on the training set, the MSE on the VS, and the classification accuracy (% classified correctly) of the VS on the y-axis, and number of epochs on the x-axis. (Note there are two scales on the y-axis).

The results for the different measurables should be shown with a different color, line type, etc. Typical backpropagation accuracies for the Iris data set are 85-95%.

---

In [None]:
# Iris Classification

## 3. (4%) Working with the Vowel Dataset - Learning Rate

Load the Vowel Dataset [Vowel Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/vowel.arff)

- Use one layer of hidden nodes with the number of hidden nodes being twice the number of inputs.
- Use random 75/25 splits of the data for the training/test set.
- Try some different learning rates (LR). 
- For each LR find the best VS solution (in terms of VS MSE).
- Create one graph with MSE for the training set, VS, and test set, at your chosen VS stopping epoch for each tested learning rate on the x-axis.  
- Create another graph showing the number of epochs needed to get to the best VS solution on the y-axis for each tested learning rate on the x-axis. 

In general, whenever you are testing a parameter such as LR, # of hidden nodes, etc., test values until no more improvement is found. For example, if 20 hidden nodes did better than 10, you would not stop at 20, but would try 40, etc., until you no longer get improvement.

In [20]:
# Train on each dataset

## 3.1 (8%) Working with the Vowel Dataset - Intuition
- Discuss the effect of varying learning rates. 
- Discuss why the vowel data set might be more difficult and report the baseline accuracy. 
- Consider which of the given input features you should actually use (Train/test, speaker, gender, ect) and discuss why you chose the ones you did.

Typical backpropagation accuracies for the Vowel data set are above 75%.


Note that each LR will probably require a different number of epochs to learn.


Also note that the proper approach in this case would be to average the results of multiple random initial conditions (splits and initial weight settings) for each learning rate. To minimize work you may just do each learning rate once with the same initial conditions.


If you would like you may average the results of multiple initial conditions (e.g. 3) per LR, and that obviously would give more accurate results.

*Discuss intuition here*



*Explanation goes here*

## 3.2 (10%) Working with the Vowel Dataset - Hidden Layer Nodes

Using the best LR you discovered, experiment with different numbers of hidden nodes.

- Start with 1 hidden node, then 2, and then double them for each test until you get no more improvement in accuracy. 
- For each number of hidden nodes find the best VS solution (in terms of VS MSE).  
- Create one graph with MSE for the training set, VS, and test set, on the y-axis and # of hidden nodes on the x-axis.

*Discuss Hidden Layer Nodes here*



*Explanation goes here*

## 3.3 (10%) Working with the Vowel Dataset - Momentum

Try some different momentum terms in the learning equation using the best number of hidden nodes and LR from your earlier experiments.

- Graph as in step 3.2 but with momentum on the x-axis and number of epochs until VS convergence on the y-axis.
- You are trying to see how much momentum speeds up learning. 

*Discuss momentum here*



*Explanation goes here*

## 4.1 (10%) Use the scikit-learn (SK) version of the MLP classifier on the iris and vowel data sets.  

You do not need to go through all the steps above, nor graph results. Compare results (accuracy and learning speed) between your version and theirs for some selection of hyper-parameters. Try different hyper-parameters and comment on their effect.

At a minimum, try

- number of hidden nodes and layers
- different activation functions
- learning rate
- regularization and parameters
- momentum (and try nesterov)
- early stopping

In [28]:
# Load sklearn perceptron

# Train on voting dataset

*Record impressions*

## 4.2 (5%) Pick a data set of your choice and learn it with the SK MLP version. 
- Use a grid and/or random search approach across a reasonable subset of hyper-parameters from the above 
- Report your best accuracy and hyper-parameters for your chosen data set. 

In [28]:
# Load sklearn perceptron

# Train on voting dataset

## 5. (Optional 5% Extra credit) For the vowel data set use both the grid and random search approaches to find the hyper-parameters LR, # of hidden nodes, and momentum.  

- Compare and discuss the values found with the ones you found in steps 3-5.


In [None]:
# Optional grid and random search

*Discuss findings here*