# Assignment 4. Deep Learning

*Foundations of Data Science*  
*Dr. Khalaj (Fall 2023)*  

*For questions 2-4 refer to @alregamo on Telegram.*

### Description  
This homework consists of four questions, each aimed at one category in the world of Deep Learning.   
1. Getting familiarized with sentiment analysis (A subject also covered in the course project).
   
2. Multi-layer perceptron (MLP). 
   
3. Convolutional Neural Networks (CNN).
   
4. Variational Autoencoders (VAE).

### Information  
Complete the information box below.

In [None]:
full_name = 'Danial Ataie'
student_id = '99100455'

### Note
The questions are not necessarily in order of difficulty. You are obligated to answer **3 out of 4** questions. The fourth question is optional and is considered as bonus.

## 2 Multi-layer Perceptron (MLP)

**In this assignment you'll be working with Dorothea Dataset.**

DOROTHEA is a drug discovery dataset. Chemical compounds represented by structural molecular features must be classified as active (binding to thrombin) or inactive. To find out more about dataset, refer to this link: https://archive.ics.uci.edu/ml/datasets/Dorothea

You should implement a classifier with Neural Networks and for this purpose we will be using PyTorch as framework.

### 2.1 Importing libraries, modules and Dataset.

In this part, import all the libraries and modules needed to solve the problem.

In [None]:
import torch 
import torch.nn as nn
import numpy as np

Now import the train and test data from dataset.

In [None]:
def read_sparse(file_path , rows, cols):
  matrix =np.zeros((rows, cols), dtype = int)
  with open (file_path , 'r') as file:
    for row_idx, line in enumerate(file):
      column_indices = map(int, line.split())
      matrix[row_idx , np.array(list(column_indices))-1] = 1
      return matrix


n_features = 100000
train_count, test_count = 800, 350

train_data = read_sparse('dorothea_train.data', train_count, n_features)
train_labels = np.genfromtxt('dorothea_train.labels')
test_data = read_sparse('dorothea_valid.data', test_count, n_features)
test_labels = np.genfromtxt('dorothea_valid.labels')

### 2.2 Normalize
You can normalize your data using <code>Scikit-Learn</code> modules here.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
train_data_normalized = scaler.fit_transform(train_data)
test_data_normalized = scaler.transform(test_data)

### 2.3 Dimensionality Reduction
There are too many attributes for each instance of dataset. We will suffer from sparse data and long training phase. Thus you can reduce dimensions to get better accuracy. 

Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data.

Apply PCA on Dorothea dataSet using <code>Scikit-Learn</code>.

In [None]:
from sklearn.decomposition import PCA

n_components = 50
pca = PCA(n_components=n_components)
train_data_pca = pca.fit_transform(train_data)
test_data_pca = pca.transform(test_data)

### 2.4 Define Model



In [None]:
# Define your model in here
# You can change the code below.

class ClassifierModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(ClassifierModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax()
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, hidden_size)
        self.fc4 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        x = self.relu(x)
        x = self.fc4(x)
        x = self.softmax(x)
        return x

### 2.5 Train the model

**Initialize model, define hyperparameters, optimizer, loss function, etc.**



In [None]:
import torch.optim as optim

input_size = 50
hidden_size = 128
output_size = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = ClassifierModel(input_size, hidden_size, output_size).to(device)

learning_rate = 0.001
epochs = 10
batch_size = 64

optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

train_data_loader = torch.utils.data.DataLoader(dataset=train_data_pca, batch_size=batch_size, shuffle=True)

train_hist = []

for epoch in range(epochs):
    for batch in train_data_loader:
        inputs, labels = batch
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    train_hist.append(loss.item())
    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item()}')

**After the training process, plot metrics such as loss function values.**

In [None]:
import matplotlib.pyplot as plt

plt.plot(train_hist, label='Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Epochs')
plt.legend()
plt.show()

### 2.6 Testing
After training, test your model on test dataset and compute performance metrics. 

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

test_data_loader = torch.utils.data.DataLoader(dataset=test_data_pca, batch_size=batch_size, shuffle=False)
model.eval()
true_labels = []
predicted_labels = []

with torch.no_grad():
    for batch in test_data_loader:
        inputs, labels = batch
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        true_labels.extend(labels.cpu().numpy())
        predicted_labels.extend(predicted.cpu().numpy())

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Show confusion matrix of your model.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(true_labels, predicted_labels)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1], yticklabels=[0, 1])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()