<a href="https://colab.research.google.com/github/Iispar/hlt-project/blob/main/course_project_2023_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT 2023 Project (Template)

- Student(s) Name(s): Iiro Partanen
- Date: -
- Chosen Corpus: emotion
- Contributions (if group project):

### Corpus information

- Description of the chosen corpus: 
Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.
- Paper(s) and other published materials related to the corpus: 
  - CARER: Contextualized Affect Representations for Emotion Recognition (Saravia et al., EMNLP 2018)
  - https://paperswithcode.com/dataset/emotion
- State-of-the-art performance (best published results) on this corpus: 95% f1

---

## 1. Setup

In [71]:
!pip3 install -q transformers datasets evaluate
!pip install trankit
import datasets
import sklearn.feature_extraction
import torch
import transformers
import numpy as np
import evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [72]:
dset = datasets.load_dataset("emotion");
# check it works
print(dset);



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


### 2.2. Preprocessing

In [73]:
# vectorizes one item
def vectorize_item(item):
    vectorized = vectorizer.transform([item["text"]]); # vectorize. Initialized below...
    non_zero_features = vectorized.nonzero()[1]; # get the nonzeros and we take only the columns of the nonzeros because our matrix is only one row.
    non_zero_features += 1; # index zero is for padding so let's avoid it by adding 1 to all.

    return {"input_ids":non_zero_features} 

In [74]:
dset.shuffle(); #shuffle dataset for safety

# vectorization.
vectorizer = sklearn.feature_extraction.text.CountVectorizer( # get the vectorizer.
    binary = True,
    max_features = 20000, # Selected 20k to start with.
    token_pattern = r"(?u)\b\w+\b" # Token pattern to include one char words.
    )

texts=[item["text"] for item in dset["train"]]; # get all texts from train
vectorizer.fit(texts); # fitting the vectorizer

# vectorize the whole dataset.
dset_tokenized = dset.map(vectorize_item,num_proc=4);
# check it works
print(dset_tokenized["train"][0]);



{'text': 'i didnt feel humiliated', 'label': 0, 'input_ids': [3620, 4931, 6438, 6495]}


In [75]:
# padding and batching
def collator(list_of_items):
    allLabels = [item["label"] for item in list_of_items]; # list of all labels.
    batch = {"labels": torch.tensor(allLabels)}; # create a tenstor for the item (batch)
    tensors = [];
    max_len = max(len(item["input_ids"]) for item in list_of_items); # longest example in the batch. Pad to here.
    for item in list_of_items:
        ids = torch.tensor(item["input_ids"]); # input ids to tensor
        padded = torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])); # actual padding. Pads ids, from + to max with 0.
        tensors.append(padded); # appended ids to tensors
    batch["input_ids"] = torch.vstack(tensors); # stacks items as they are now same len. Now these are matrixes.
    return batch;

# check it works
batch=collator([dset_tokenized["train"][2],dset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
print(batch["labels"])
print(batch["input_ids"])
     

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 13])
tensor([3, 4])
tensor([[    1,  4931,  5739,  5800,  6495,  6563,  8456, 10157, 13643, 15061,
             0,     0,     0],
        [    1,    34,   749,  2680,  4931,  6495,  7076,  7712,  8076,  9245,
          9325, 13339, 15116]])


---

## 3. Machine learning model

### 3.1. Model training

In [76]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

# needs a config, we wil just pass it.
class MLPConfig(transformers.PretrainedConfig):
    pass;

# model
class MLP(transformers.PreTrainedModel):
      config_class = MLPConfig; # sets config
      #initilazition
      def __init__(self,config):
        super().__init__(config); # call the super with out config which is now pass..
        self.vocab_size = config.vocab_size; # embedding matrix row count
        # Build embedding of vocab size +1 x hidden size. +1 again because of padding.
        self.embedding = torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0);
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001); # initialization of the embedding values
        self.output = torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels); # output layer is the size of the labels x hidden size.

      # forward
      def forward(self,input_ids,labels=None):
        embedded = self.embedding(input_ids); # sum up all the embeddings
        embedded_summed = torch.sum(embedded,dim=1); # sum up across word dimension
        projected = torch.tanh(embedded_summed); # non-linearity
        logits = self.output(projected); # apply the outer layer
      

        ## calculates the loss
        if labels is not None:
            # calculates the loss.
            loss = torch.nn.CrossEntropyLoss();
            return (loss(logits,labels),logits);
        else:
            # if no labels, just return the logits
            return (logits,);
  
# config
mlp_config = MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=6);

In [77]:
# training

# Set training arguments
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy = "steps",
    logging_strategy = "steps",
    eval_steps = 500,
    logging_steps = 500,
    learning_rate = 1e-4, #learning rate of the gradient descent
    max_steps = 20000,
    load_best_model_at_end = True,
    per_device_train_batch_size = 128
)


# evaluating
accuracy = evaluate.load("accuracy");
def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels;
    predictions = np.argmax(outputs, axis=-1); #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels);

# actual training
mlp = MLP(mlp_config); # Make a the actual model  
early_stopping = transformers.EarlyStoppingCallback(5); # stop training if the eval loss is not getting better.

# params
trainer = transformers.Trainer(
    model = mlp,
    args = trainer_args,
    train_dataset = dset_tokenized["train"],
    eval_dataset = dset_tokenized["test"],
    compute_metrics = compute_accuracy,
    data_collator = collator,
    callbacks = [early_stopping]
)

# FINALLY!
trainer.train();




Step,Training Loss,Validation Loss,Accuracy
500,1.7152,1.577946,0.536
1000,1.4954,1.446714,0.554
1500,1.3427,1.322141,0.5985
2000,1.1744,1.188143,0.657
2500,1.0019,1.056616,0.7285
3000,0.8416,0.936069,0.778
3500,0.7022,0.830737,0.818
4000,0.5859,0.741789,0.838
4500,0.4911,0.6678,0.849
5000,0.4149,0.607373,0.861


In [78]:
def vectorize_item(item):
    vectorized = vectorizer.transform([item["text"]]); # vectorize. Initialized below...
    non_zero_features = vectorized.nonzero()[1]; # get the nonzeros and we take only the columns of the nonzeros because our matrix is only one row.
    non_zero_features += 1; # index zero is for padding so let's avoid it by adding 1 to all.

    return {"input_ids":non_zero_features};

In [79]:
# creates a dataset with one example
def create_example(example):
  text = example,
  label = 0,
  data = {
      'text': text,
      'label': label,
  }
  return datasets.Dataset.from_dict(data);

In [88]:
# just having fun. testing random sentences that could be tweeted to see what they give as the result :)
text = 'The concert last night was so cool!';
labels = ['sadness', 'joy', 'love', 'anger', 'fear', 'suprise'];
example = create_example(text); #creates a dict with just the example
example_tokenized = example.map(vectorize_item,num_proc=4); # tokenize
prediction = trainer.predict(example_tokenized).predictions[0]; # predicts the label
print(prediction);
largest = max(prediction); # label with largest value
labelOfLargest = labels[list(prediction).index(largest)]; # name of label
print(f"{labelOfLargest} with confidence of: {largest}");



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

[-0.08271115  3.2682822  -1.5420005  -0.85238016 -1.1974103  -1.4385706 ]
joy with confidence of: 3.268282175064087


### 3.2 Hyperparameter optimization

In [81]:
# Your code for hyperparameter optimization here

### 3.3. Evaluation on test set

In [82]:
# Your code to evaluate the final model on the test set here

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

The corpus includes Twitter messages in english and they have been annotated with six basic emotions which are anger, fear, joy, love, sadness, and surprise. 

By the paper the tweets have been selected with some hashtags and then annotated. The selected hastags can be seen from the paper.


### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [83]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [84]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [85]:
# Include your annotated out-of-domain data here