<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/course_project_2023_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s):
- Date:
- Chosen Corpus:
- Contributions (if group project):

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
- Paper(s) and other published materials related to the corpus: 

State-of-the-art leaderboard: https://paperswithcode.com/sota/sentiment-analysis-on-imdb

Related sentiment analysis paper: Sentiment Analysis for Movies Reviews Dataset Using Deep Learning Models Nehal Mohamed Ali, Marwa Mostafa Abd El Hamid and Aliaa Youssif

- State-of-the-art performance (best published results) on this corpus:

1
RoBERTa-large with LlamBERT
96.68
LlamBERT: Large-scale low-cost data annotation in NLP
2024

2
RoBERTa-large
96.54
LlamBERT: Large-scale low-cost data annotation in NLP
2024

3
XLNet
96.21
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019
Transformer

---

## 1. Setup

In [63]:
# Your code to install and import libraries etc. here
import datasets 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import torch 
import transformers
import evaluate

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [64]:
# Your code to download the corpus here
dset = datasets.load_dataset('imdb')
#load the dataset from cache
display(dset)

Using the latest cached version of the dataset since imdb couldn't be found on the Hugging Face Hub


ValueError: Couldn't find cache for imdb for config 'default'
Available configs in the cache: ['plain_text']

### 2.2. Preprocessing

In [None]:
# Your code for any necessary preprocessing here
#Shuffle the dataset
dset = dset.shuffle(seed=42)
#Remove the unsupervised data part of the dataset as we dont need it for this task
del dset['unsupervised']

In [None]:
vectorizer = CountVectorizer(binary=True, max_features=25000)
text_list = [i['text'] for i in dset['train']]
vectorizer.fit(text_list)

def vectorize_example(examples, vectorizer): 
    vectorized = vectorizer.transform([examples["text"]])
    non_zero = vectorized.nonzero()[1]
    non_zero += 1
    return {'input_ids': non_zero}

# Conversion vocabulary 
idx2word = {v: k for k, v in vectorizer.vocabulary_.items()}

tokenized_data = dset.map(vectorize_example, num_proc=4, fn_kwargs={'vectorizer': vectorizer})

Map (num_proc=4): 100%|██████████| 25000/25000 [00:06<00:00, 4134.59 examples/s]
Map (num_proc=4): 100%|██████████| 25000/25000 [00:06<00:00, 4093.63 examples/s]


In [None]:
test_row = tokenized_data['train'][0]['input_ids']
convered_text = [idx2word[i] for i in test_row]
print(convered_text)

['above', 'action', 'actress', 'allah', 'americana', 'anders', 'area', 'argument', 'atari', 'beulah', 'bother', 'butch', 'bye', 'characterisation', 'clairvoyant', 'classical', 'compared', 'complicating', 'criminal', 'englishman', 'enjoyment', 'evaluated', 'factions', 'faraway', 'fur', 'goodbye', 'handbook', 'haven', 'howard', 'ifc', 'isaac', 'italian', 'judged', 'justice', 'languages', 'likeable', 'looming', 'maine', 'mayberry', 'moreau', 'noah', 'notable', 'onassis', 'oral', 'others', 'peoples', 'plotline', 'plotted', 'policeman', 'preferable', 'primed', 'quits', 'realm', 'relations', 'serio', 'similarity', 'simpler', 'spirited', 'spotlight', 'superficiality', 'suspected', 'thank', 'thatch', 'theater', 'thereafter', 'thick', 'things', 'thinker', 'tho', 'toad', 'took', 'violently', 'wayans', 'weak', 'weaken', 'weirdos', 'writings']


In [None]:
def collator(examples):
    batch = {"labels":torch.tensor(list(example["label"] for example in examples))}
    tensors = []
    max_len = max(len(example["input_ids"]) for example in examples)
    for example in examples:
        ids = torch.tensor(example["input_ids"])
        padded = torch.nn.functional.pad(ids, (0, max_len - ids.shape[0]))
        tensors.append(padded)
    batch["input_ids"] = torch.vstack(tensors)
    return batch

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

class MLPConfig(transformers.PretrainedConfig):
    pass
class MLP(transformers.PreTrainedModel):
    config_class=MLPConfig
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) 
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)

    def forward(self,input_ids,labels=None):
        embedded=self.embedding(input_ids)
        embedded_summed=torch.sum(embedded,dim=1)
        projected=torch.tanh(embedded_summed) 
        logits=self.output(projected)
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss()
            return (loss(logits,labels),logits)
        else:
            return (logits,)

### 3.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here
learning_rates = [1e-5, 1e-4, 1e-3, 1e-2]
batch_sizes = [16, 32, 64, 128]

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

accuracy = evaluate.load("accuracy")

mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=2)

best_learning_rate = None
best_batch_size = None
best_accuracy = 0
for lr in learning_rates:
    for batch_size in batch_sizes:
        print(f"Training with lr={lr} and batch_size={batch_size}")
        mlp=MLP(mlp_config)
        trainer_args = transformers.TrainingArguments(
            "mlp_checkpoints", #save checkpoints here
            evaluation_strategy="steps",
            logging_strategy="steps",
            eval_steps=500,
            logging_steps=500,
            learning_rate=lr, #learning rate of the gradient descent
            max_steps=20000,
            load_best_model_at_end=True,
            per_device_train_batch_size=batch_size,
        )
        early_stopping = transformers.EarlyStoppingCallback(5)
        trainer = transformers.Trainer(
            model=mlp,
            args=trainer_args,
            train_dataset=tokenized_data["train"].select(range(10000)),
            eval_dataset=tokenized_data["test"].select(range(1000)), #make a smaller subset to evaluate on
            compute_metrics=compute_accuracy,
            data_collator=collator,
            callbacks=[early_stopping]
        )
        trainer.train()
        eval_result = trainer.evaluate(tokenized_data["test"])
        if eval_result["eval_accuracy"] > best_accuracy:
            best_accuracy = eval_result["eval_accuracy"]
            best_learning_rate = lr
            best_batch_size = batch_size
            

Using the latest cached version of the module from C:\Users\omarn\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--accuracy\f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Sun Mar 31 18:07:34 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.


NameError: name 'MLPConfig' is not defined

In [None]:
#Save model 
#trainer.save_model("mlp_model")

### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here
test_results = trainer.predict(tokenized_data["test"].select(range(5000)))
test_accuracy = compute_accuracy((test_results.predictions, test_results.label_ids))
print(f"Test accuracy: {test_accuracy}")

100%|██████████| 625/625 [00:00<00:00, 768.58it/s]

Test accuracy: {'accuracy': 0.8802}





In [None]:
#Convert the 10 first predictions to labels 
print("Predictions:", "\n", test_results.predictions[:10])
print("Binary predicted labels:", np.argmax(test_results.predictions[:10], axis=-1))


#Print first 10 true labels
true_labels = test_results.label_ids[:10]
print("True labels:", true_labels)


Predictions: 
 [[-1.4783843   1.2700074 ]
 [-0.16219756  0.06732692]
 [-0.01269698 -0.06483601]
 [-1.0602162   0.89951706]
 [ 0.37185746 -0.41186056]
 [-1.1537267   0.98373985]
 [ 0.09102534 -0.15580702]
 [ 1.1402507  -1.1289095 ]
 [ 0.29495418 -0.34836176]
 [-0.6786371   0.54820937]]
Binary predicted labels: [1 1 0 1 0 1 0 0 0 1]
True labels: [1 1 0 1 0 1 1 0 0 1]


---

## 4. Results and summary

### 4.1 Corpus insights

The corpus "imdb" consists of movie reviews from IMDB: 25,000 positive and 25,000 negative reviews. Each entry consists of the review and the corresponding sentiment label. Only highly polarizing reviews are considered in this dataset - no neutral reviews are included. People have written movie reviews and given the movie a score from 1 to 10. Reviews with a score of <= 4 are labeled as negative and reviews with a score of >= 7 are labeled positive. No more than 30 reviews per movie are included.

### 4.2 Results

We got an evaluation accuracy of 88.02%. We performed hyperparameter tuning on a subset of the data and found the best learning rate () and batch size (), which got us to an accuracy of __.

### 4.3 Relation to state of the art

The state-of-the-art results of binary classifiers of the "imdb" dataset reach an accuracy of 96.68% with a RoBERTa-large with LlamBERT model. BERT is the standard state-of-the-art model in all NLP. BERT is a language model develped by Google so our accuracy with such a simple model can be viewed as a success. Nehal et al. got an accuracy of 86.74% in their paper using an MLP model.

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [None]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [None]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [None]:
# Include your annotated out-of-domain data here