# Homework 2

For this assignment, you will be developing an artificial neural network to classify data given in the __[Dry Beans Data Set](https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset#)__. This data set was obtained as a part of a research study by Selcuk University, Turkey, in which a computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features. More details on the study can be found in the following __[research paper](https://www.sciencedirect.com/science/article/pii/S0168169919311573)__.

## About the Data Set
Seven different types of dry beans were used in a study in Selcuk University, Turkey, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features - 12 dimensions and 4 shape forms - were obtained from the grains.

Number of Instances (records in the data set): __13611__

Number of Attributes (fields within each record, including the class): __17__

### Data Set Attribute Information:

1. __Area (A)__ : The area of a bean zone and the number of pixels within its boundaries.
2. __Perimeter (P)__ : Bean circumference is defined as the length of its border.
3. __Major axis length (L)__ : The distance between the ends of the longest line that can be drawn from a bean.
4. __Minor axis length (l)__ : The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. __Aspect ratio (K)__ : Defines the relationship between L and l.
6. __Eccentricity (Ec)__ : Eccentricity of the ellipse having the same moments as the region.
7. __Convex area (C)__ : Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. __Equivalent diameter (Ed)__ : The diameter of a circle having the same area as a bean seed area.
9. __Extent (Ex)__ : The ratio of the pixels in the bounding box to the bean area.
10. __Solidity (S)__ : Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. __Roundness (R)__ : Calculated with the following formula: (4piA)/(P^2)
12. __Compactness (CO)__ : Measures the roundness of an object: Ed/L
13. __ShapeFactor1 (SF1)__
14. __ShapeFactor2 (SF2)__
15. __ShapeFactor3 (SF3)__
16. __ShapeFactor4 (SF4)__

17. __Classes : *Seker, Barbunya, Bombay, Cali, Dermosan, Horoz, Sira*__

### Libraries that can be used :
- NumPy, SciPy, Pandas, Sci-Kit Learn, TensorFlow, Keras
- Any other library used during the lectures and discussion sessions.

### Other Notes
- Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment.
- Discussion materials should be helpful for doing the assignments.
- The homework submission should be a .ipynb file.



## Exercise 1 : Building a Feed-Forward Neural Network(50 points)

### Exercise 1.1 : Data Preprocessing (10 points)

- As the classes are categorical, use one-hot encoding to represent the set of classes. You will find this useful when developing the output layer of the neural network.
- Normalize each field of the input data using the min-max normalization technique.



In [2]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split,cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, mean_squared_error
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
data = pd.read_csv('/content/sample_data/Dry_Beans_Dataset.csv')

In [4]:
one_hot = pd.get_dummies(data['Class'])
data = data.drop('Class', axis=1)
data = data.join(one_hot)
scaler = MinMaxScaler()
data_normalized = pd.DataFrame(scaler.fit_transform(data), columns = data.columns)
X_train, X_test, y_train, y_test = train_test_split(data_normalized, one_hot, test_size=0.1, random_state=42)

### Exercise 1.2 : Training and Testing the Neural Network (40 points)

Design a 4-layer artificial neural network, specifically a feed-forward multi-layer perceptron (using the sigmoid activation function), to classify the type of 'Dry Bean' given the other attributes in the data set, similar to the one mentioned in the paper above. Please note that this is a multi-class classification problem so select the right number of nodes accordingly for the output layer.

For training and testing the model, split the data into training and testing set by __90:10__ and use the training set for training the model and the test set to evaluate the model performance.

Consider the following hyperparameters while developing your model :

- Number of nodes in each hidden layer should be (12, 3)
- Learning rate should be 0.3
- Number of epochs should be 500
- The sigmoid function should be used as the activation function in each layer
- Stochastic Gradient Descent should be used to minimize the error rate

__Requirements once the model has been trained :__

- A confusion matrix for all classes, specifying the true positive, true negative, false positive, and false negative cases for each category in the class
- The accuracy and mean squared error (MSE) of the model
- The precision and recall for each label in the class

__Notes :__

- Splitting of the dataset should be done __after__ the data preprocessing step.
- The mean squared error (MSE) values obtained __should be positive__.


In [5]:
def create_model():
    model = Sequential()
    model.add(Dense(12, input_dim=data_normalized.shape[1], activation='sigmoid'))
    model.add(Dense(12, activation='sigmoid'))
    model.add(Dense(12, activation='sigmoid'))
    model.add(Dense(one_hot.shape[1], activation='sigmoid'))
    sgd = SGD(learning_rate=0.3)
    model.compile(loss='categorical_crossentropy', optimizer=sgd)
    return model

In [6]:
model = create_model()

model.fit(X_train, y_train, batch_size=10000,epochs=500)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

<keras.callbacks.History at 0x7fae95e87130>

In [7]:
y_pred = model.predict(X_test)

y_pred_classes = np.argmax(y_pred, axis = 1)
y_true = np.argmax(y_test.values, axis = 1)

cm = confusion_matrix(y_true, y_pred_classes)

accuracy = accuracy_score(y_true, y_pred_classes)
mse = mean_squared_error(y_true, y_pred_classes)

precision = precision_score(y_true, y_pred_classes, average=None)
recall = recall_score(y_true, y_pred_classes, average=None)

print(cm)
print("Acc:", accuracy)
print("MSE:", mse)
print("Precision:", precision)
print("Recall:", recall)

[[134   0   2   0   0   1   0]
 [ 63   0   0   0   0   0   0]
 [  1   0 173   0  21   0   0]
 [  0   0   0 342   0   0   0]
 [  0   0   0   0 181   0   0]
 [  0   0   0   0   0 200   0]
 [  0   0   0   0   0   0 244]]
Acc: 0.9353891336270191
MSE: 0.13509544787077826
Precision: [0.67676768 0.         0.98857143 1.         0.8960396  0.99502488
 1.        ]
Recall: [0.97810219 0.         0.88717949 1.         1.         1.
 1.        ]


  _warn_prf(average, modifier, msg_start, len(result))


## Exercise 2 : k-fold Cross Validation (20 points)

In order to avoid using biased models, use 10-fold cross validation to generalize the model based on the given data set.

__Requirements :__
- The accuracy and MSE values during each iteration of the cross validation
- The overall average accuracy and MSE value

__Note :__ The mean squared error (MSE) values obtained should be positive.

In [11]:
k = 10
kf = KFold(n_splits=k, shuffle=True, random_state=42)
accuracy_scores = []
mse_scores = []

for train_index, test_index in kf.split(data_normalized):

    model = create_model()

    model.fit(X_train, y_train, batch_size=10000, epochs=500)

    y_pred = model.predict(X_test)

    y_pred_classes = np.argmax(y_pred, axis=1)
    y_true = np.argmax(y_test.values, axis=1)

    accuracy = accuracy_score(y_true, y_pred_classes)
    mse = mean_squared_error(y_true, y_pred_classes)

    accuracy_scores.append(accuracy)
    mse_scores.append(mse)

print("Acc of each Iter: ", accuracy_scores)
print("MSE of each Iter: ", mse_scores)
average_accuracy = np.mean(accuracy_scores)
average_mse = np.mean(mse_scores)

print("Overall Avg Acc:", average_accuracy)
print("Overall Avg MSE:", average_mse)




[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epo

## Exercise 3 : Hyperparameter Tuning (25 points)

Use either grid search or random search methodology to find the optimal number of nodes required in each hidden layer, as well as the optimal learning rate and the number of epochs, such that the accuracy of the model is maximum for the given data set.

__Requirements :__
- The set of optimal hyperparameters
- The maximum accuracy achieved using this set of optimal hyperparameters

__Note :__ Hyperparameter tuning takes a lot of time to execute. Make sure that you choose the appropriate number of each hyperparameter (preferably 3 of each), and that you allocate enough time to execute your code.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(data_normalized, one_hot, test_size=0.1, random_state=42)
hidden_layer_sizes = [10, 20, 30]
learning_rates = [0.1, 0.01, 0.001]
num_epochs = [100, 200, 300]
best_hidden_layer_size = None
best_learning_rate = None
best_num_epochs = None
max_accuracy = 0.0

for hidden_size in hidden_layer_sizes:
    for learning_rate in learning_rates:
        for epochs in num_epochs:
          model = MLPClassifier(hidden_layer_sizes=(hidden_size,), learning_rate_init=learning_rate, max_iter=epochs)

          model.fit(X_train, y_train)

          y_pred = model.predict(X_test)

          accuracy = accuracy_score(y_test, y_pred)

          if accuracy > max_accuracy:
              max_accuracy = accuracy
              best_hidden_layer_size = hidden_size
              best_learning_rate = learning_rate
              best_num_epochs = epochs
print('max Acc: ', max_accuracy)
print('best hidden size: ', best_hidden_layer_size)
print('best learning rate:', best_learning_rate)
print('best num epochs:', best_num_epochs)

max Acc:  1.0
best hidden size:  10
best learning rate: 0.1
best num epochs: 100


## Exercise 4 - Collaborative Statement (5 points)

It is mandatory to include a Statement of Collaboration in each submission, that follows the guidelines below.
Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.
All students are required to follow the academic honesty guidelines posted on the course website. For
programming assignments in particular, I encourage students to organize (perhaps using Piazza) to discuss the
task descriptions, requirements, possible bugs in the support code, and the relevant technical content before they
start working on it. However, you should not discuss the specific solutions, and as a guiding principle, you are
not allowed to take anything written or drawn away from these discussions (no photographs of the blackboard,
written notes, referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict
the discussion to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.

I did not collaborate with anyone on this homework. However, I did get some tips from piazza such as question https://piazza.com/class/lkj66eb9vpy7am/post/63 as well as https://piazza.com/class/lkj66eb9vpy7am/post/45 and then to increase batch size to quicken the epoch speed through discord https://discord.com/channels/1136433873534853143/1136433874998665335/1145228511083237427.
