# Machine Learning Assignment 4 - Neural Networks (Dry Bean)

## Install required packages

In [None]:
%pip install -r requirements.txt

In [68]:
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import StandardScaler
from sklearn.model_selection import train_test_split

## Load the data

In [69]:
# Load the dataset
data = pd.read_csv('./data/Lecture_10_Dry_Bean_Dataset.csv')

## Explorative Data Analysis (EDA)

In [70]:
print(data.dtypes)

print(data.head())

print(data.isnull().sum())

print(data.describe())

# Create a correlation matrix
correlation_matrix = data.drop('Class', axis=1).corr()
print(correlation_matrix)

# Visualize the correlation matrix
# plt.figure(figsize=(10, 10))
# sns.heatmap(correlation_matrix, annot=True, cmap='Blues')
# plt.show()

# Check Outliers
for column in data.columns:
    if data[column].dtype != 'object':
        z_score = (data[column] - data[column].mean()) / data[column].std()
        threshold = 3
        outliers = data[abs(z_score) > threshold]
        print(f'Number of outliers in {column} : {len(outliers)}')

Area                 int64
Perimeter          float64
MajorAxisLength    float64
MinorAxisLength    float64
AspectRation       float64
Eccentricity       float64
ConvexArea           int64
EquivDiameter      float64
Extent             float64
Solidity           float64
roundness          float64
Compactness        float64
ShapeFactor1       float64
ShapeFactor2       float64
ShapeFactor3       float64
ShapeFactor4       float64
Class               object
dtype: object
    Area  Perimeter  MajorAxisLength  MinorAxisLength  AspectRation   
0  28395    610.291       208.178117       173.888747      1.197191  \
1  28734    638.018       200.524796       182.734419      1.097356   
2  29380    624.110       212.826130       175.931143      1.209713   
3  30008    645.884       210.557999       182.516516      1.153638   
4  30140    620.134       201.847882       190.279279      1.060798   

   Eccentricity  ConvexArea  EquivDiameter    Extent  Solidity  roundness   
0      0.549812       2

### Conclusion of EDA

![image.png](img/corr_heatmap.png)

- Class is a categorical variable
- All other variables are numerical
- We will need to re-scale the numerical variables to follow a normal distribution
- There are some highly correlated variables (e.g. `Area` and `Perimeter`)
- We have a fairly number of outliers in the data

## Preprocess the data and Create the Model

In [78]:
X = data.drop('Class', axis=1).values
y = data['Class'].values

# Perform one-hot encoding on the target variable
y = pd.get_dummies(y).values

# Split the data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the data using Z-score normalization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert the data to TensorFlow tensors
X_train = tf.convert_to_tensor(X_train, dtype=tf.float32)
X_test = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_train = tf.convert_to_tensor(y_train, dtype=tf.float32)
y_test = tf.convert_to_tensor(y_test, dtype=tf.float32)

# Define the neural network architecture
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(y_train.shape[1], activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x14d0ac62800>

## Evaluate the Model

In [None]:

# Evaluate the model on the training set
train_loss, train_accuracy = model.evaluate(X_train, y_train, verbose=0)
print('Training Loss:', train_loss)
print('Training Accuracy:', train_accuracy)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_accuracy)

## Conclusion

### Justification of TensorFlow over PyTorch
I chose TensorFlow over PyTorch because I have more experience with TensorFlow. I have used TensorFlow in the past for a few projects and I am more comfortable with it. I have also used Keras in the past, which is a high-level API for TensorFlow. I have not used PyTorch before, but I have heard that it is very similar to Keras. I also found that TensorFlow has better documentation and more resources online.

### Justification of the number of epochs
I chose 10 epochs because I found that the model was already performing well after 10 epochs. I also found that the model was not improving much after 10 epochs. I also tried 20 epochs. The model performed slightly better, but the difference was not significant. I decided to stick with 10 epochs because it was faster to train and the model was already performing well.

### 10 epochs vs 20 epochs

Results for 10 epochs
- Training Loss: 0.18735215067863464
- Training Accuracy: 0.9314842224121094
- Test Loss: 0.20492908358573914
- Test Accuracy: 0.9269188642501831
    

Results for 20 epochs
- Training Loss: 0.17226286232471466
- Training Accuracy: 0.9359845519065857
- Test Loss: 0.1969957947731018
- Test Accuracy: 0.9287550449371338   