## Mateusz Borowicz (2202830) - CE4145 NLP Coursework
04/11/2025

## Introduction (200 Words)

The rapid proliferation of social media platforms, such as Twitter, has generated vast amounts of unstructured text data, rich with emotional insights that can inform fields ranging from mental health monitoring to customer sentiment analysis. The emotion dataset from Hugging Face, comprising approximately 20,000 short tweets labeled with one of six emotions (sadness, anger, love, surprise, fear, joy), presents a valuable opportunity to classify these emotional expressions automatically. The problem we aim to resolve is the accurate classification of tweet emotions, enabling applications like real-time sentiment tracking, targeted marketing, or psychological trend analysis. Manual analysis of such data is infeasible due to its volume and subjectivity, making an NLP system essential for efficiently processing and interpreting these texts. By leveraging the emotion dataset’s noisy, informal tweets, we can evaluate two distinct NLP pipelines—a classical machine learning approach (TF-IDF with Logistic Regression) and a modern deep learning approach (fine-tuned DistilBERT)—to determine the most effective strategy for capturing nuanced emotional cues in short-text data, balancing accuracy, computational efficiency, and interpretability. This comparative evaluation will provide insights into the strengths and trade-offs of each approach, addressing the need for robust, scalable solutions in emotion recognition tasks

## Load Dataset
For now I used a Tweet Eval Emotion dataset from Huggingface. It's not a popular dataset but it seems to contain the necessary data for a basic evaluation. It has 3257 records sourced from X (Previously Twitter) and contains premade labels.

https://huggingface.co/datasets/vimal-quilt/tweet-eval-emotion

In [36]:
import numpy as np #import numpy for array functionality
import pandas as pd #import pandas for dataframes

from sklearn.pipeline import Pipeline # pipeline functionality
from sklearn.feature_extraction.text import CountVectorizer # simple pre-processing vectorizer
from sklearn.feature_extraction.text import TfidfTransformer #representation learner
from sklearn.neighbors import KNeighborsClassifier #simple classifier model
from sklearn.svm import SVC # Import Support Vector Classifier
from sklearn.model_selection import StratifiedKFold #The stratified version ensures that classes have equal representation across folds
from sklearn.metrics import accuracy_score #import an accuracy metric to tell us how well the model is doing

splits = {'train': 'train.csv', 'test': 'test.csv'}
df = pd.read_csv("hf://datasets/vimal-quilt/tweet-eval-emotion/" + splits["train"])

# Display basic information about the dataset
display(df.shape) # rows, columns
display(df.head()) # first few records
display(df.info()) # Check types and null values

(3257, 2)

Unnamed: 0,text,label
0,“Worry is a down payment on a problem you may ...,2
1,My roommate: it's okay that we can't spell bec...,0
2,No but that's so cute. Atsu was probably shy a...,1
3,Rooneys fucking untouchable isn't he? Been fuc...,0
4,it's pretty depressing when u hit pan on ur fa...,3


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3257 entries, 0 to 3256
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3257 non-null   object
 1   label   3257 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 51.0+ KB


None

## Representation Learning Attempt #1

Firstly I will split data into variables. Then I will approach to pre-process tweets with Tokenization, Normalization, Stop Words Removal and Lemminization

In [37]:
# Use the 'text' and 'label' columns directly from the DataFrame
x = df['text'].values
y = df['label'].values

print(x.shape) #if we print the shape, we can see we have 1000 examples, each of which is associated with label
print(y.shape)

for i in range(0,5):
  print(x[i] +"\n") #first let's double check our input data is just a simple sentence

(3257,)
(3257,)
“Worry is a down payment on a problem you may never have'.  Joyce Meyer.  #motivation #leadership #worry 

My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs 

No but that's so cute. Atsu was probably shy about photos before but cherry helped her out uwu 

Rooneys fucking untouchable isn't he? Been fucking dreadful again, depay has looked decent(ish)tonight 

it's pretty depressing when u hit pan on ur favourite highlighter 



+ Pre-process tweets with Tokenization, Normalization, Stop Words Removal and Lemminization

In [41]:
# List of classifiers to compare
classifiers = {
    'SVM': SVC(),
    'kNN': KNeighborsClassifier()
}

results = {}

# Pipeline
for name, classifier in classifiers.items():
  print(f"Training and evaluating {name}...")
  text_clf = Pipeline([ #the pipeline object allows us to organise a series of functions which will be applied to our text data as though they were a single function
    ('count', CountVectorizer()), #we will use a simple count vectorizer for our pre-processing (which cheats a little by combining numerous pre-processing steps)
    ('rep', TfidfTransformer()), #and a representation learning method using tf-idf
    ('mod', classifier), # Use the current classifier
    ])

  acc_score = [] #create a list to store the accuracy values

  kf = StratifiedKFold(n_splits=5) #we instantiate the kfold instance, and set the number of folds to 5
  for train, test in kf.split(x,y): #we use a for loop to iterate through each fold using the train and test indexes from the dataset

    x_train, x_test, y_train, y_test = x[train], x[test], y[train], y[test] #things can get a bit weird when inputting indexes to functions, so lets save them as variables
    #print(train)
    #print(test) #this will print the train and test indexes respectively, if you want to be sure they do not overlap

    text_clf.fit(x_train, y_train) # Fit data to pipeline models
    predictions = text_clf.predict(x_test) # Save predictions
    acc = accuracy_score(predictions, y_test) # Calculate accuracy
    acc_score.append(acc) #we can append it to our list

  results[name] = np.mean(acc_score)
  print(f"Mean Accuracy for {name}: {results[name]:.4f}")


print("\n--- Comparison ---")
for name, acc in results.items():
  print(f"{name}: {acc:.4f}")

Training and evaluating SVM...
Mean Accuracy for SVM: 0.6343
Training and evaluating kNN...
Mean Accuracy for kNN: 0.6009

--- Comparison ---
SVM: 0.6343
kNN: 0.6009


## Algorithms (500 Words)
Describe the Machine Learning models which you will compare to perform the
selected task. Consider reviewing Weeks 4 – 8 for ideas.

## Evaluation (200 Words)
Depending on the task you adopt, there are several different ways you may wish to
evaluate the proposed NLP system. You should demonstrate your understanding by suggesting an appropriate
evaluation strategy here (which suits both the task and the data). The report should then conclude with a
description of the outcome of the comparison, alongside any appropriate visualisation. Review Weeks 4 – 8
for specific ideas, or Week 9 for ideas relating to explaining the outcomes of NLP systems