<a href="https://colab.research.google.com/github/Amambayeva/Amambayeva/blob/main/assigment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/Christina1281995/demo-repo/blob/main/InterimAssignmentHeader.png?raw=true">

<b>Overview</b>

This interim assignment will focus on the topics that we covered in the first two units of this course.

You will try to improve the performance of a bi-directional LSTM model that will learn to classify text into 2 sentiment categories: ```positive, negative```.

Most of the code is prepared for you, so your job is to really focus on the model training and performance!

<br>

<b>The Task</b>

To improve model performance, you will adjust the <u>hyperparameters</u> and optionally the <u>optimizer</u> and <u>loss function</u>.

Your aim is to end up with a model that fulfills these criteria:
- Its accuracy on the validation data is within a generally accepted range (80% or higher)
- It also perfroms well on test data (to show that the model didn't overfit on the training data)

<br>

<b>Before you get started</b>

It might be worth changing the runtime settings ("Change runtime type") in the top-right hand corner of Google Colab to "T4 GPU". Depending on what hyperparameters you set, your Google Colab session might crash occasionally (if the settings require too much memory, e.g. a too large batch size). Using a GPU can help avoid session crashes (but even the GPU will reach its memory limits if you push the training too much with the hyperparameters).


<br>

<b>What to Document, Save, and Submit</b>

Please document your process (i.e. what your tried out and how it went), your reasoning, and your final results! When you are satisfied with your results, save and download your notebook and attach it to your submission!

In your documentation, please make sure to also cover these questions:
1. What is an obvious short-coming of this training data?
2. Which hyperparameters seem to have the largest influence (i.e. caused the largest changes in the model training)?
3. Give a brief description of overfitting and how you can tell if that happened to your model during training. If you observe overfitting in during experiment, include a screenshot of the accuracies chart to show it!
4. Which hyperparameter settings did you ultimately choose for your submission (and how did you come to those final settings)?
5. After you have satisfactorily trained the model, try to come up with some custom input text (at the bottom of this notebook) where the model obviously struggles - include a few examples in your submission and try to explain why the model might struggle.


<br>

<b>Grading</b>

For this interim assignment, you can get a total of 25 points. Here's the breakdown:

<br>

<table>
  <tr>
    <th>Task</th>
    <th>Points</th>
  </tr>
  <tr>
    <td>Validation Acc. above 80%</td>
    <td>6</td>
  </tr>
  <tr>
    <td>Test Acc. above 70%</td>
    <td>7</td>
  </tr>
  <tr>
    <td>Reasoning / decisions</td>
    <td>6</td>
  </tr>
  <tr>
    <td>Question answers and documentation</td>
    <td>6</td>
  </tr>
</table>



<br>
<br>


In [1]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# General Data Handling and Import
import xml.etree.cElementTree as ET                                             # Parses and creates XML documents. ET is a lightweight and efficient XML API
import urllib.request                                                           # Module for opening URLs, mainly useful for reading data across the web
import pandas as pd                                                             # Data manipulation and analysis library, offers data structures like DataFrame
import numpy as np                                                              # Fundamental package for scientific computing with Python, supports large, multi-dimensional arrays and matrices


# PyTorch Libraries
import torch                                                                    # Main PyTorch library, used for building deep learning models
import torch.nn as nn                                                           # Provides a set of modules and loss functions to build neural networks
from torch import optim                                                         # Optimization algorithms like SGD, Adam, etc., for training models
from torch.utils.data import Dataset, DataLoader                                # Utilities for wrapping data for training, such as batching, shuffling
from torch.nn.utils.rnn import pad_sequence                                     # Utility function for padding sequences to the same length for batch processing
import torch.nn.functional as F


# Other Libraries useful for Natural Language Processing
import nltk                                                                     # Natural Language Toolkit, a set of libraries for symbolic and statistical natural language processing
nltk.download('punkt')                                                          # Downloads the Punkt tokenizer models, which is a pre-trained sentence tokenizer
from nltk.tokenize import word_tokenize                                         # Tokenizes a text into a list of words, used for processing natural language text
from gensim.models import Word2Vec                                              # Implementation of the Word2Vec word embedding model for generating word vectors
import gensim.downloader as api                                                 # API for downloading pre-trained word embedding models from Gensim's repository
from sklearn.model_selection import train_test_split                            # Utility function for splitting data arrays into training and testing subsets


# Data Plotting
import matplotlib.pyplot as plt                                                 # Library for creating static, animated, and interactive visualizations in Python


# Counter for keeping track of things!
from collections import Counter                                                 # Dict subclass for counting hashable objects, useful for creating vocabularies or counting item


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


#### 2. Preparing the Training Data üìë

*This secion is prepared for you - there's **no need to edit** this code, but you do need to **run the cells** and, ideally, take some time to **understand** what is going on!*


Here, we'll load the data from a URL. This dataset is called "SemEval", and includes text and a sentiment classification. We'll also turn the labels into numbers.

In [2]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# URL for the data (this is a link to a CSV hosted on Google Drive)
data_url = "https://drive.google.com/file/d/1zy9GnyHNZlhuytN8tpTtNehONy1EX5qb/view?usp=sharing"
data_url ='https://drive.google.com/uc?id=' + data_url.split('/')[-2]

# Read the data from the URL and turn it into a pandas DataFrame
df = pd.read_csv(data_url, encoding='utf-8')
print(f"The whole dataset is quite large. {len(df)} rows to be exact!\n\nIf we use all of that data for model training, Google Colab will crash because it will run out of memory!")
print("Instead, we'll use a subset of just 5000 randomly sampled rows.\n")

# Randomly sample 5000 rows from the df
df = df.sample(5000)

# Display the columns that this DataFrame has
df.head()


The whole dataset is quite large. 50000 rows to be exact!

If we use all of that data for model training, Google Colab will crash because it will run out of memory!
Instead, we'll use a subset of just 5000 randomly sampled rows.



Unnamed: 0.1,Unnamed: 0,text,sentiment
6849,1082829,really really happy with the LAKERS win!!! woo...,1
2427,99032,4 days till Philippines... Makati City we are ...,0
21180,68120,Oh thats so sad!,0
19613,499391,ummmmm how do u get the thing to send udates t...,0
40089,1578258,Good morning! How are my fellow McFly fans tod...,1


Next up, we need to prepare the text data so that it can be turned into numbers (and ultimately tensors).

Here is what we will do:
1. **Tokenize the text**:this means we split the text into a list of individual tokens, which are usually just the individual words, but long words are sometimes also split into sub-words. To do this, we are using the function ```word_tokenize()```, which we can simply import from a Python package called ```nltk``` (by the way, nltk stands for natural language toolkit).
2. **Build a vocab**: this is just a big dictionary where each word in the entire dataset is listed and given a unique number. Very simple!

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# A function that will build the vocabulary
def build_vocab(texts):
    token_freqs = Counter()
    for text in texts:
        tokens = word_tokenize(text.lower())                                    # word_tokenize is an nltk library that breaks down text into tokens
        token_freqs.update(tokens)                                              # keeping track of the frequency of each token
    vocab = {token: idx + 1 for idx, token in enumerate(token_freqs)}           # Start indexing from 1
    vocab['<pad>'] = 0                                                          # Padding token
    return vocab

# Here, we call the function and build the actual vocab
vocab = build_vocab(df['text'])

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# A function that will turn the SemEval text into a simple numeric format (each word is turned into a unqiue number using the vocab!)
def tokenize_and_convert_to_indices(text):
    tokens = word_tokenize(text.lower())
    return [vocab.get(token, 0) for token in tokens]                            # Use 0 for unknown tokens

# Create a new column in the Dataframe that contains the numeric format of the text
df['indexed_text'] = df['text'].apply(tokenize_and_convert_to_indices)

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# Now, let's take a quick look at the new column
df.head(2)

Now, when we get ready for training a model, we usually split the data into:
* training data (70%)
* validation data (10%)
* testing data (20%)

Below, we create those 3 subsets! Keep in mind that even after the split, they are still "only" pandas Dataframes, not yet PyTorch Datasets! That will come in the step afterwards.

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# First split: Separate out the test dataset (20% of the total data)
train_and_val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Second split: Split the train_and_val_df into training and validation sets
# Note: 0.125 of the remaining data (which is 80% of the total) will be 10% of the total dataset
train_df, val_df = train_test_split(train_and_val_df, test_size=0.125, random_state=42)

print(f"The training dataset is now of size: {len(train_df)}")
print(f"The validation dataset is now of size: {len(val_df)}")
print(f"The testing dataset is now of size: {len(test_df)}")

Next up: Creating a PyTorch **Dataset** object! <img src="https://cdn.icon-icons.com/icons2/2699/PNG/512/pytorch_logo_icon_169823.png" align="right" width="150px">

This should be somewhat familiar to you now! The Dataset stores the data in a way that is convenient for the entire pytorch library to work with!

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# Defining the pytorch Dataset
class TextDataset(Dataset):
    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text = torch.tensor(self.df.iloc[idx]['indexed_text'], dtype=torch.long)
        sentiment = torch.tensor(self.df.iloc[idx]['sentiment'], dtype=torch.long)
        return text, sentiment


# Turn our training, validation and test data into Dataset objects
train_dataset = TextDataset(train_df)
val_dataset = TextDataset(val_df)
test_dataset = TextDataset(test_df)

Next up: we define a funtion that will be needed for the DataLoader later. It prepares the batches by adding padding to all elements in the batch, so that they are all the same length. It will also put the sentiment label (which is either a 0 or a 1) in a tensor format.

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# A function to prepare batches for input to the model
# It will match the length of all inputs with padding and turns the labels into tensors

def collate_fn(batch):
    # Unzip the batch to separate the text and sentiment into individual variables
    texts, sentiments = zip(*batch)

    # Pad the sequence of texts so they all have the same length
    texts = pad_sequence(texts, batch_first=True, padding_value=vocab['<pad>'])

    # Convert the list of sentiment labels into a tensor of long integers
    sentiments = torch.tensor(sentiments, dtype=torch.long)

    # Return the padded texts and sentiment tensors
    return texts, sentiments


For the embeddings, we will use a pre-trained embedding model called "Word2Vec". This model was trained to learn the semantics of words (sentence structure, word contexts, etc). This way, when the model turns text into embeddings, the numbers represent the semantic meaning! This is quite useful for any following tasks!

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# Download pre-trained Word2Vec embeddings trained on Google News corpus
word2vec_model = api.load('word2vec-google-news-300')

# Assuming 'vocab' is a dictionary mapping your vocabulary words to unique indices
vocab_size = len(vocab)
embedding_dim = 300  # Dimensionality of Google News Word2Vec embeddings

# Initialize an embedding matrix that will be used to set the weights in your embedding layer
embedding_matrix = torch.zeros((vocab_size, embedding_dim))

for word, index in vocab.items():
    try:
        # Update the row in embedding matrix with the Word2Vec vector if the word is found
        embedding_matrix[index] = torch.tensor(word2vec_model[word])
    except KeyError:
        # If the word is not found in the Word2Vec model, initialize the row with random values
        embedding_matrix[index] = torch.tensor(np.random.normal(scale=0.6, size=(embedding_dim, )))


#### 3. The Model ü§ñ

*This secion is prepared for you - there's **no need to edit this code, <u>UNLESS</u> you want to adjust the model architecture and feel confident in doing so!** Either way, you do need to **run the cells** and, ideally, take some time to **understand** what is going on!*

In [None]:
class Model(nn.Module):
    def __init__(self, vocab_size, embedding_dim, lstm_dim, hidden_dim1, hidden_dim2, lstm_layers, pretrained_embeddings, activation_function):
        super(Model, self).__init__()

        # Embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight = nn.Parameter(pretrained_embeddings)             # Load the pre-trained embeddings from Word2Vec
        self.embedding.weight.requires_grad = True                              # True means the embeddings will be fine-tuned as the model trains! (False would leave these as they are)

        # Bi-Directional LSTM layer
        self.lstm = nn.LSTM(embedding_dim, lstm_dim, num_layers=lstm_layers, batch_first=True, dropout=0.5 if lstm_layers > 1 else 0, bidirectional=True)

        # Linear layers
        self.linear1 = nn.Linear(2 * lstm_dim, hidden_dim1)                     # *2 for bidirectional LSTM output
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.linear3 = nn.Linear(hidden_dim2, 2)

        # Activation function
        self.actfn = activation_function

        # Dropout for regularization
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        x = self.embedding(x)                                                   # Embed the input batch
        x, (hidden, cell) = self.lstm(x)                                        # Get the output from the LSTM
        x = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)                # Use the last hidden state (from both directions, since this is a bi-directional LSTM)
        x = self.dropout(self.actfn(self.linear1(x)))                           # Apply the linear layers and activation function
        x = self.dropout(self.actfn(self.linear2(x)))
        logits = self.linear3(x)
        return logits


#### 4. Setting the Hyperparameters üß∞üõ†

‚ùó *This is the part where you need to make some decisions! Read the instructions below carefully.* ‚ùó

In the code-cell below, you see a list of hyperparameters. They are all set to a "dummy" value of ```1```. Your job is to change these values. You can also change the optimizer and the loss function!

<br>

After you have set the hyperparameters, you will run the training code-cell further down in this notebook. The code there is prepared and will automatically load the model with your hyperparametes and run the training. You will then be shown the statistics of your achieved loss and accuracy. The goal is to optimize these! After you've run the training code-cell and viewed the results, you can go back to the hyperparameter code-cell to adjust the hyperparameters and try the model training again. You can re-adjust the hyperparameters and then re-run the model training as often as you want!

<br>

A quick reminder: hyperparameters are the model and/or training settings that we can pre-define and they will influence either the model itself or how it is trained. Depending on the hyperparameters, the results of a model can vary a lot!

<br>

Your job is to experiment with the hyperparameter settings and to try to optimize the model so that it fulfills both of the criteria below:

<br>

1. **It achieves a training accuracy above 80%**
2. **It does not overfit on the training data: The accuracy on the test data should be above 70%**


<br>

<br>

Here are a few links that might be useful for your hyperparameter setting:

- Take a look at the available optimizers <a href="https://pytorch.org/docs/stable/optim.html">here</a>

- Take a look at the available loss functions <a href="https://pytorch.org/docs/stable/nn.functional.html#loss-functions">here</a>, and for some additional information <a href="https://blog.paperspace.com/pytorch-loss-functions/">here</a>

- Take a look at the available activation functions <a href="https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity">here</a>

In [None]:
# ----------------- ADJUST THESE PARAMETERS -----------------------

# Hyperparameters for the model architecture
lstm_layers =     1
lstm_dim =        1
activation_function = nn.ReLU()
hidden_dim1 =     1
hidden_dim2 =     1
# output_dim is already set to 2 in the model architecture (since there are only two output classes!)

# Hyperparameters for the training loop
learning_rate =   1
epochs =          1
batch_size =      1

def get_optimizer_and_loss(model):

    # Change the optimizer if you want to (replace the "AdamW" with a different optimizer from torch.optim.)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    # Change the loss function if you want to (check out the loss functions on the torch.nn page)
    loss_fn = torch.nn.CrossEntropyLoss()

    return optimizer, loss_fn


#### 5. The Training Loop üìàüîÅ

*This secion is prepared for you - there's **no need to edit** this code. This is where you will monitor how well the model performs on the training data and the validation data. The statistics at that will be created in the end will show you*


In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------


# ----------------- Prepare the Dataloader for Training -----------------------

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)


# ------------------------ Instantiate the model ------------------------------

assignment_model = Model(vocab_size, embedding_dim, lstm_dim, hidden_dim1, hidden_dim2, lstm_layers, embedding_matrix, activation_function)


# ------------------------ Optimizer and Loss Function ------------------------

optimizer, loss_fn = get_optimizer_and_loss(assignment_model)


# ------------------------ Training Loop --------------------------------------

# Lists to keep track of loss and accuracy for each epoch
epoch_losses = []
epoch_accuracies = []

# Training Loop with 5 Epochs
for epoch in range(epochs):
    total_loss, total_acc = 0, 0
    for texts, sentiments in train_loader:                                      # get next iteration from our dataloader
        optimizer.zero_grad()                                                   # reset the gradients that have been calculated before
        outputs = assignment_model(texts)                                       # get the model's output
        loss = loss_fn(outputs, sentiments)                                     # calculte the new gradients according to the loss function
        loss.backward()                                                         # the optimizer uses the gradients to update the model's parameters!
        optimizer.step()

        # Accumulate loss and accuracy (for monitoring)
        total_loss += loss.item()
        total_acc += (outputs.argmax(1) == sentiments).float().mean().item()

    # Average loss and accuracy for the epoch
    avg_loss = total_loss / len(train_loader)
    avg_acc = total_acc / len(train_loader)

    # Append to lists
    epoch_losses.append(avg_loss)
    epoch_accuracies.append(avg_acc)

    print(f'Epoch {epoch+1}:\tLoss: {avg_loss:.2f},\tAccuracy: {avg_acc:.2f}')

if epoch_accuracies[-1] < 0.8:
  print(f"\nThe final accuracy on the validation data is less than 0.80. See if you can adjust the model/ the hyperparameters to get this value to 0.80!")
else:
  print(f"\nThe final accuracy on the validation data is above 0.80!! Well done! The model seems to be able to recognise positive and negative sentiments on the data!")

# ------------------- Evaluate Model on Test Data -----------------------------

assignment_model.eval()                                                         # Set the model to evaluation mode
with torch.no_grad():                                                           # This means we are NOT calculating the gradients (since this is not a step for the training)
    total_acc_test = 0
    for texts, sentiments in test_loader:
        outputs = assignment_model(texts)                                       # Get the model outputs (i.e. predictions)
        total_acc_test += (outputs.argmax(1) == sentiments).float().mean().item() #

    avg_acc_test = total_acc_test / len(test_loader)                            # Get an average accuracy from all the batches that were part of the
    if avg_acc_test < 0.7:                                                      # Check if the average accuracy is lower than 0.7
      print(f"\nAccuracy on the Test Data: {avg_acc_test:.2f}\nSeems the model is struggling with new (never seen before data)! Try to get it above 0.70!\n")
    else:
      print(f"\nAccuracy on the Test Data: {avg_acc_test:.2f}\nNice work! The model handles new data quite well!!\n")


# ----------------- Create Charts for Loss and Accuracy -----------------------

# Set up for the Loss and Accuracy Sub-plots
plt.figure(figsize=(14, 6))

# Plotting the loss
plt.subplot(1, 2, 1)
plt.plot(range(1, epochs + 1), epoch_losses, marker='o', color='darkred', label='Training Loss')
plt.title('Loss during Training')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.ylim(0, max(epoch_losses) * 1.1)
plt.legend()
plt.grid(axis='y', linestyle='--', color='grey', alpha=0.7)

# Plotting the accuracy
plt.subplot(1, 2, 2)
plt.plot(range(1, epochs + 1), epoch_accuracies, marker='o', label='Training Accuracy', color='darkblue')
plt.bar(epochs + 1, avg_acc_test, color='lightblue', label='Test Accuracy', width=0.5)
plt.title('Accuracy on the Training and Test Data')
plt.xlabel('Epoch')
plt.xticks(list(range(1, epochs + 2)))
plt.ylabel('Accuracy')
plt.ylim(0, 1)
epoch_ticks = list(range(1, epochs + 1)) + ["Test"]
plt.xticks(list(range(1, epochs + 2)), epoch_ticks)
plt.legend()
plt.grid(axis='y', linestyle='--', color='grey', alpha=0.7)

plt.tight_layout()
plt.show()

#### 6. Test the Trained Model on your own Input Text üïµ

In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

def predict_sentiment(model, sentence, vocab):
    model.eval()  # Set the model to evaluation mode

    # Tokenize and convert to indices
    tokenized = tokenize_and_convert_to_indices(sentence)

    # Convert to tensor and add batch dimension
    indexed = torch.tensor([tokenized], dtype=torch.long)

    # Prediction
    with torch.no_grad():
        predictions = model(indexed)

    # Apply softmax to convert logits to probabilities
    probabilities = torch.softmax(predictions, dim=1)

    # Get the predicted class
    prediction = torch.argmax(probabilities, dim=1).item()

    return prediction


In [None]:
# --------------------- PREPARED FOR YOU, DO NOT EDIT --------------------------

# Test the trained model on your own input!
sentence = input("Now let's test the trained model on your own custom text!\nEnter your own input:\t")

predicted_sentiment = predict_sentiment(assignment_model, sentence, vocab)
sentiment_mapping = {0: 'Negative', 1: 'Positive'}

print(f"\nPredicted sentiment:\t{sentiment_mapping[predicted_sentiment]}")