# Εργασία στο Μάθημα Αναγνώριση Προτύπων - Μέρος Δ
## Ομάδα 28
### Ονοματεπώνυμα Φοιτητών:  Μαχμουτάϊ Έλενα, Τσουκαλά Ναταλία 

## Part D

In this part, our goal to develop a classification algorithm using the `datasetC.csv` as our training set. With 5000 samples, each boasting 400 features and labeled from 1 to 5, our collective task is to implement and train a classification method.

Following the training phase, we will apply our honed model to the unlabeled `datasetCTest.csv` test set. Our output will poduce a numpy vector named `labels28`, encapsulating the predictions generated by our collaborative efforts.

### First Experiment

For our first experiment, we decided to train and evaluated a neural network based on our `datasetC.csv` using `TensorFlow` and `Keras`. The dataset is split into training and testing sets, and the neural network architecture is systematically varied by exploring different sizes for two hidden dense layers with ReLU activation functions. The network is trained using the Adam optimizer and sparse categorical cross-entropy loss for 10 epochs, and its performance is evaluated on the test set. TensorBoard is employed to log the training process and facilitate the analysis of model training and performance metrics. The experiment aims to identify the optimal configuration of hidden layer sizes that maximizes accuracy on the given dataset. The results, including test loss and accuracy, are printed for each configuration.

We first have to set up the TensorBoard default port.

In [6]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [7]:
import tensorflow as tf
from keras import layers
from keras.callbacks import TensorBoard
from sklearn.model_selection import train_test_split
import time
import pandas as pd
import os

# Load the dataset from the specified path
data = pd.read_csv(r"D:\projects\Pattern-Recognition\datasetC.csv", header=None)

# Extract feature matrix (X) and target variable (y) from the dataset
layer_sizes = [10, 64, 128, 256]
data = data.values

X = data[:, :-1]
y = data[:, -1]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Iterate over different layer sizes for the neural network
for layer1_size in layer_sizes:
    for layer2_size in layer_sizes:
        # Create a unique name for the experiment based on layer sizes and current timestamp
        NAME = "{}-layer1-{}-layer2-{}".format(layer1_size, layer2_size, int(time.time()))
        
        logdir = os.path.join("logs", NAME)
        tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

        # Create a sequential model (feedforward neural network)
        model = tf.keras.models.Sequential()
            
        # Flatten layer to transform input data into a 1D array
        model.add(layers.Flatten(input_shape=(400,)))
        
        # First dense layer with ReLU activation
        model.add(layers.Dense(layer1_size, activation='relu'))
        
        # Second dense layer with ReLU activation
        model.add(layers.Dense(layer2_size, activation='relu'))
        
        # Output layer with softmax activation for multiclass classification
        model.add(layers.Dense(10, activation='softmax'))

        # Compile the model with Adam optimizer, sparse categorical crossentropy loss, and accuracy metric
        model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])
        
        # Train the model on the training data for 10 epochs, with 30% validation split and TensorBoard callback
        model.fit(X_train, y_train, epochs=10, validation_split=0.3, callbacks=[tensorboard_callback])
       
        # Evaluate the trained model on the test set
        test_loss, test_acc = model.evaluate(X_test, y_test)
        print(f'Test loss: {test_loss}')
        print(f'Test accuracy: {test_acc}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.691868782043457
Test accuracy: 0.7576000094413757
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.7926264405250549
Test accuracy: 0.7523999810218811
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.7473664283752441
Test accuracy: 0.7703999876976013
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.8901172876358032
Test accuracy: 0.7552000284194946
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.6902957558631897
Test accuracy: 0.7835999727249146
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.84

Now we start the TensorBoard within the notebook using [magics](https://ipython.readthedocs.io/en/stable/interactive/magics.html):

In [10]:
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 17032), started 0:12:50 ago. (Use '!kill 17032' to kill it.)

Using Tensorboard we notice that our best results in validation accuracy for our first experiment are the following:

```
|--Validation accuracy--|--Layer1 size--|--Layer2 size--|
|      0.8053           |      64       |       128     |
|       0.8             |      64       |       10      |

## Second Experiment

For this experiment we tried to determined that if by increasing the number of layers in the previous experiment we would notice drastic improvement in our validation accuracy. Thats why we performed our investigation to a 3 Layer Neural Network.

In [11]:
# Clear any logs from previous runs
!rmdir /s /q .\logs

In [12]:
import tensorflow as tf
from keras import layers
from keras.callbacks import TensorBoard
from sklearn.model_selection import train_test_split
import time
import pandas as pd
import numpy as np

data = pd.read_csv(r"D:\projects\Pattern-Recognition\datasetC.csv", header=None)

layer_sizes = [10, 64, 128, 256]
data = data.values

X = data[:, :-1]
y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

for layer1_size in layer_sizes:
    for layer2_size in layer_sizes:
        for layer3_size in layer_sizes:
            NAME = "{}-layer1-{}-layer2-{}-layer3-{}".format(layer1_size, layer2_size, layer3_size, int(time.time()))
            logdir = os.path.join("logs", NAME)
            tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

            model = tf.keras.models.Sequential()
            
            model.add(layers.Flatten(input_shape=(400,)))
            model.add(layers.Dense(layer1_size, activation='relu'))
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(layer2_size, activation='relu'))
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(layer3_size, activation='relu'))
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(10, activation='softmax'))

            model.compile(optimizer='adam',
                          loss='sparse_categorical_crossentropy',
                          metrics=['accuracy'])
            
            model.fit(X_train, y_train, epochs=10, validation_split=0.3, callbacks=[tensorboard_callback])
           
            test_loss, test_acc = model.evaluate(X_test, y_test)
            print(f'Test loss: {test_loss}')
            print(f'Test accuracy: {test_acc}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.6339843273162842
Test accuracy: 0.319599986076355
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.3178486824035645
Test accuracy: 0.5260000228881836
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.1949197053909302
Test accuracy: 0.5619999766349792
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.0677372217178345
Test accuracy: 0.6967999935150146
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.3365092277526855
Test accuracy: 0.49720001220703125
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.9

Now we start the TensorBoard within the notebook using [magics](https://ipython.readthedocs.io/en/stable/interactive/magics.html):

In [13]:
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 17032), started 0:22:14 ago. (Use '!kill 17032' to kill it.)

Using Tensorboard we notice that our best results in validation accuracy for our second experience are the following:

```
|--Validation accuracy--|--Layer1 size--|--Layer2 size--|--Layer3 size--|
|       0.824           |      256      |       256     |      128      |
|       0.8074          |      256      |       256     |      256      |

By performing our second experiment we notice that the accuracy has not increased much by increasing the number of the layers.

## Third Experiment

Since our previous results were pretty much the same we descide to change our layer type and try various different combination. (Need to write this better!)

In [14]:
# Clear any logs from previous runs
!rmdir /s /q .\logs

In [1]:
# Import necessary libraries
import tensorflow as tf
from keras import layers
from keras.callbacks import TensorBoard
from sklearn.model_selection import train_test_split
import time
import pandas as pd
import numpy as np

# Load data from CSV file
data = pd.read_csv(r"D:\projects\Pattern-Recognition\datasetC.csv", header=None)

# Preprocess data by converting it to NumPy array
data = data.values
X = data[:, :-1]
y = data[:, -1]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Define hyperparameters and architecture variations
layer_sizes = [64, 128, 256]

# Experiment loop for different layer sizes
for layer1_size in layer_sizes:
    for layer2_size in layer_sizes:
        for layer3_size in layer_sizes:
            # Generate a unique name for each experiment based on timestamp
            NAME = "{}-layer1-{}-layer2-{}-layer3-{}".format(
                layer1_size, layer2_size, layer3_size, int(time.time())
            )
            
            logdir = os.path.join("logs", NAME)
            tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

            # Build the sequential model
            model = tf.keras.models.Sequential()
            model.add(layers.Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(400, 1)))
            model.add(layers.MaxPooling1D(pool_size=2))
            model.add(layers.Flatten())
            model.add(layers.Dense(layer1_size, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(layer2_size, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(layer3_size, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(10, activation='softmax'))

            # Compile the model with Adam optimizer and categorical crossentropy loss
            model.compile(optimizer='adam',
                          loss='sparse_categorical_crossentropy',
                          metrics=['accuracy'])

            # Reshape input data for Conv1D layer
            X_train_reshaped = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
            X_test_reshaped = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

            # Train the model with 10 epochs, 30% validation split, and TensorBoard callback
            model.fit(X_train_reshaped, y_train, epochs=10, validation_split=0.3, callbacks=[tensorboard_callback])

            # Evaluate the model on the test set
            test_loss, test_acc = model.evaluate(X_test_reshaped, y_test)
            
            # Print results for each experiment
            print(f'Model: {NAME}')
            print(f'Test loss: {test_loss}')
            print(f'Test accuracy: {test_acc}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-64-layer2-64-layer3-1704138993
Test loss: 0.6250858306884766
Test accuracy: 0.7871999740600586
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-64-layer2-128-layer3-1704139007
Test loss: 0.6600789427757263
Test accuracy: 0.7832000255584717
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-64-layer2-256-layer3-1704139023
Test loss: 0.6833888292312622
Test accuracy: 0.7716000080108643
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-128-layer2-64-layer3-1704139038
Test loss: 0.7028277516365051
Test accuracy: 0.7788000106811523
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 

Using Tensorboard we notice that our best results in validation accuracy for our third experience are the following:

```
|--Validation accuracy--|--Layer1 size--|--Layer2 size--|--Layer3 size--|
|       0.8035          |      256      |       128     |      64       |
|       0.801           |      128      |       64      |      128      |

We notice that there is not much accuracy increase and the results are lower than before.

## Fourth Experiment

In [2]:
import tensorflow as tf
from keras import layers
from keras.callbacks import TensorBoard, ReduceLROnPlateau
from sklearn.model_selection import train_test_split
import time
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv(r"D:\projects\Pattern-Recognition\datasetC.csv", header=None)

# Preprocess data
data = data.values
X = data[:, :-1]
y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Define hyperparameters and architecture variations
layer_sizes = [64, 128, 256]

# Experiment loop
for layer1_size in layer_sizes:
    for layer2_size in layer_sizes:
        for layer3_size in layer_sizes:
            NAME = "{}-layer1-{}-layer2-{}-layer3-{}".format(
                layer1_size, layer2_size, layer3_size, int(time.time())
            )
            tensorboard = TensorBoard(log_dir='D:/old-files/Desktop/thmmu/9o/Anagnorhsh Protypwn/fourth-experiment/Logs/{}'.format(NAME))

            # Define the model
            model = tf.keras.models.Sequential()
            model.add(layers.Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(400, 1)))
            model.add(layers.MaxPooling1D(pool_size=2))
            model.add(layers.Flatten())
            model.add(layers.Dense(layer1_size, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(layer2_size, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(layer3_size, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.5))
            model.add(layers.Dense(10, activation='softmax'))

            model.compile(optimizer='adam',
                          loss='sparse_categorical_crossentropy',
                          metrics=['accuracy'])

            # Reshape input for Conv1D
            X_train_reshaped = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
            X_test_reshaped = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

            # Define the learning rate scheduler
            reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.001)

            # Train the model with the learning rate scheduler
            model.fit(X_train_reshaped, y_train, epochs=10, validation_split=0.3, callbacks=[tensorboard, reduce_lr])

            # Evaluate the model on the test set
            test_loss, test_acc = model.evaluate(X_test_reshaped, y_test)
            print(f'Model: {NAME}')
            print(f'Test loss: {test_loss}')
            print(f'Test accuracy: {test_acc}')


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-64-layer2-64-layer3-1704142005
Test loss: 0.6191369295120239
Test accuracy: 0.7928000092506409
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-64-layer2-128-layer3-1704142021
Test loss: 0.752289354801178
Test accuracy: 0.7540000081062317
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-64-layer2-256-layer3-1704142036
Test loss: 0.6521402597427368
Test accuracy: 0.7847999930381775
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: 64-layer1-128-layer2-64-layer3-1704142049
Test loss: 0.7342736721038818
Test accuracy: 0.777999997138977
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10

## Final Experiment

Based on our investigation the experiment that provided the best accuracy results on the validating data with the accuracy of {add the % accuracy} is the following. 

We pedict our model and save our data to a npy file named `labels28.npy`.  

In [None]:
# Load the unlabeled test data
test_data = pd.read_csv(r"D:\projects\Pattern-Recognition\datasetCTest.csv", header=None)

# Apply the trained model to the unlabeled test data
labels28 = np.argmax(model.predict(test_data.values), axis=1)

# Save the labelsX vector in numpy format
np.save('labels28.npy', labels28)

We load our labels to make sure everything works well!

In [9]:
# Load the labels28 vector
labels28 = np.load('labels28.npy')

# Print labels28 vector 
print(labels28)

[5 2 4 1 4 1 2 4 3 4 3 5 1 2 1 3 1 3 5 1 5 5 1 2 1 2 2 3 2 2 2 2 2 1 2 2 3
 3 3 3 1 2 2 5 4 2 1 5 2 3 3 2 3 2 1 3 2 3 2 4 2 5 3 2 1 3 3 1 3 4 1 4 3 1
 1 2 4 3 5 1 2 5 4 4 1 5 3 4 3 2 4 2 3 3 2 4 3 2 4 1 4 1 1 5 3 2 2 4 3 2 5
 2 1 4 2 3 5 3 3 4 2 1 1 1 4 5 3 2 5 2 1 3 5 1 4 1 2 2 4 3 4 4 2 4 3 3 2 1
 4 2 3 1 2 3 3 3 4 5 4 1 1 1 2 3 4 5 5 2 3 1 1 1 2 2 3 5 4 5 1 1 5 5 3 2 4
 5 4 3 1 3 5 3 2 1 3 2 1 2 3 1 3 1 2 1 3 5 3 2 3 5 4 5 2 4 5 1 1 5 3 3 1 1
 4 4 5 2 3 4 4 1 3 2 1 2 3 4 4 5 3 4 2 5 1 2 4 1 1 5 3 3 1 1 3 4 4 2 3 2 2
 2 1 4 1 3 1 5 4 2 3 1 4 4 1 1 3 2 1 3 2 4 2 5 1 1 2 2 2 5 4 1 4 3 1 1 2 3
 3 4 3 5 4 1 1 4 2 4 1 4 1 2 1 4 4 2 4 3 4 5 4 4 5 3 3 5 3 2 1 2 1 5 3 1 3
 1 4 2 5 1 3 3 3 4 2 1 1 1 4 2 4 1 4 4 3 2 5 1 4 2 1 1 2 1 2 4 3 3 3 2 4 2
 5 1 3 3 5 5 5 4 3 4 5 3 5 1 2 4 4 3 2 1 5 2 5 1 2 3 3 5 3 3 2 1 4 2 3 4 5
 4 5 3 1 1 3 3 3 4 3 2 4 1 1 1 3 3 3 1 3 4 4 1 3 1 4 4 5 3 5 5 1 3 3 4 1 1
 5 4 5 1 5 5 1 2 4 3 3 5 5 1 4 4 1 4 2 3 1 4 1 4 2 1 5 5 2 4 5 2 1 2 4 3 5
 2 5 2 4 5 2 4 3 4 1 2 2 