<a href="https://colab.research.google.com/github/MuchMarts/Assignment_TextClassification/blob/main/AS2_mp22042.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2 - **Text classification**

**Author:** Mārtiņš Patjanko  
**Student ID:** mp22042

---

## Create and evaluate a text classifier

Step 1: Download GoEmotions datasets - train/test/dev  
Step 1.5: Download ekman emoption mapping needed later

In [None]:
# Clean up directory
![ -f train.tsv ] && rm train.tsv
![ -f dev.tsv ] && rm dev.tsv
![ -f test.tsv ] && rm test.tsv
![ -f train_filtered.tsv ] && rm train_filtered.tsv
![ -f dev_filtered.tsv ] && rm dev_filtered.tsv
![ -f test_filtered.tsv ] && rm test_filtered.tsv
![ -f ekman_mapping.json ] && rm ekman_mapping.json
![ -f emotions.txt ] && rm emotions.txt

# Datasets
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/train.tsv
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/test.tsv
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/dev.tsv

# Ekman mapping
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/ekman_mapping.json

# Emotion list
!wget https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/emotions.txt

--2025-06-12 06:19:07--  https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3519053 (3.4M) [text/plain]
Saving to: ‘train.tsv’


2025-06-12 06:19:08 (9.98 MB/s) - ‘train.tsv’ saved [3519053/3519053]

--2025-06-12 06:19:09--  https://raw.githubusercontent.com/google-research/google-research/master/goemotions/data/test.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 436706 (426K) [text/plain]
Saving to: ‘test.tsv’


2025-06-12 06:19:0

Step 2: Convert all emotions to base emotions in mapping and remove entires with multiple emotions

In [None]:
import json
import pandas as pd

# Get all emotions, to know which id coresponds to what label
with open('emotions.txt') as emotions:
  goemotions_labels = emotions.read().splitlines()

#print(goemotions_labels)

# Load in ekman mapping and create a label_to_ekam mapping, which effectivly is a reverse index
with open('ekman_mapping.json') as mapping:
  ekman_mapping = json.load(mapping)

#print(ekman_mapping)

label_to_ekman = {}
for ekman, labels in ekman_mapping.items():
    for lbl in labels:
        label_to_ekman[lbl] = ekman

# Add neutral as it is not present in ekman_mapping
label_to_ekman['neutral'] = 'neutral'


def map_ids_to_ekman(id_list):
    ekman_emotions = set()
    for idx in id_list:
        label = goemotions_labels[int(idx)]
        ekman = label_to_ekman.get(label)
        if ekman:
            ekman_emotions.add(ekman)
    return list(ekman_emotions)

datasets = ['train', 'test', 'dev']

for data_set in datasets:
  df = pd.read_csv(f'{data_set}.tsv', sep='\t', header=None)
  # Split emotion label ids into a list and store in coll 2
  df[2] = df[1].apply(lambda x: x.split(','))

  # Replace emotion label ids with corresponding labels
  df[2] = df[2].apply(map_ids_to_ekman)

  # Store only entries where there is a singular emotion
  df = df[df[2].apply(lambda x: len(x) == 1)]
  df[2] = df[2].apply(lambda x: x[0])

  df.to_csv(f'{data_set}_filtered.tsv', sep='\t', header=False, index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[2] = df[2].apply(lambda x: x[0])


In [None]:
!pip install --quiet spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [None]:
def clean_tokenize(doc):
  tokens = []

  for token in doc:
    # Remove punctuaction and stop words
    # could also remove token.is_stop
    # Temp removed token.is_punct check
    if token.is_space:
      continue

    # Add placeholders
    if token.like_num:
      tokens.append('<NUM>')
    elif token.like_url:
      tokens.append('<URL>')
    elif token.like_email:
      tokens.append('<EMAIL>')
    else:
      # Lower case and do lemmas
      tokens.append(token.lemma_)

  return tokens

In [None]:
from tqdm import tqdm

# All datasets are stored in memory and arent saved to file, because I think it is not nescessary to write it to file
# This means that if you want to read the dataset data, you use tokenize_datasets, key is any value from list datasets (train, test, dev)
tokenized_datasets = {}

for data_set in datasets:
  df = pd.read_csv(f'{data_set}_filtered.tsv', sep='\t', header=None)
  texts = df[0].tolist()

  # Batching to speed it up
  docs = list(tqdm(nlp.pipe(texts, batch_size=500), total=len(texts)))

  # Adjusted tokenization, to lower all text. Making capitalization not matter.
  df["tokens"] = [clean_tokenize(doc) for doc in docs]

  tokenized_datasets[data_set] = df.copy()
  #print(df)

100%|██████████| 39555/39555 [01:12<00:00, 544.06it/s]
100%|██████████| 4968/4968 [00:07<00:00, 622.63it/s]
100%|██████████| 4946/4946 [00:08<00:00, 612.14it/s]


# Analyze tokenizer freq and pruning

In [None]:
# Helper function for NGrams

def get_ngrams(tokens, n):
    return zip(*[tokens[i:] for i in range(n)])

In [None]:
from collections import Counter

for data_set in datasets:
  dataset_tokens = tokenized_datasets[data_set]

  unigrams = Counter()
  for entry in dataset_tokens["tokens"]:
    for token in entry:
      unigrams[token] += 1

  print(f"Dataset: {data_set}")
  print(f"Top 20 unigrams")
  print(unigrams.most_common(20))

  bigrams = Counter()
  for entry in dataset_tokens['tokens']:
    bigrams.update(get_ngrams(entry, 2))

  print(f"Top 20 bigrams")
  print(bigrams.most_common(20))


Dataset: train
Top 20 unigrams
[('.', 33321), ('be', 27433), ('I', 21262), ('the', 16233), (',', 12724), ('to', 11812), ('a', 11158), ('you', 10343), ('it', 9526), ('not', 9038), ('that', 8897), ('and', 8102), ('!', 8089), ('name', 7268), (']', 7191), ('[', 7179), ('do', 6948), ('<NUM>', 6767), ('of', 6429), ('have', 6125)]
Top 20 bigrams
[(('[', 'name'), 7010), (('name', ']'), 7010), (('do', 'not'), 3428), (('I', 'be'), 2900), (('it', 'be'), 2823), (('.', 'I'), 2387), (('be', 'a'), 2197), (('that', 'be'), 1735), (('be', 'not'), 1677), (('!', '!'), 1483), ((',', 'I'), 1471), (('you', 'be'), 1333), (('I', 'do'), 1331), (('this', 'be'), 1316), (('I', 'have'), 1262), (('in', 'the'), 1252), (('be', 'the'), 1191), ((',', 'but'), 1007), (('of', 'the'), 998), (('to', 'be'), 943)]
Dataset: test
Top 20 unigrams
[('.', 4140), ('be', 3348), ('I', 2680), ('the', 1934), (',', 1557), ('to', 1454), ('a', 1389), ('you', 1358), ('it', 1270), ('not', 1186), ('that', 1078), ('!', 1059), ('and', 1036), ('

In [None]:
# Remove rare tokens that appear less than min_freq times, to reduce noise
min_freq = 2

for data_set in datasets:
    dataset_tokens = tokenized_datasets[data_set]  # DataFrame with a 'tokens' column
    tokens_series = dataset_tokens['tokens']

    token_freq = Counter(token for entry in tokens_series for token in entry)

    dataset_tokens['pruned_tokens'] = tokens_series.apply(
        lambda tokens: [token for token in tokens if token_freq[token] >= min_freq]
    )


In [None]:
for data_set in datasets:
  dataset_tokens = tokenized_datasets[data_set]

  tokens_series_normal = dataset_tokens['tokens']
  tokens_series_pruned = dataset_tokens['pruned_tokens']

  token_freq_normal = Counter(token for entry in tokens_series_normal for token in entry)
  token_freq_pruned = Counter(token for entry in tokens_series_pruned for token in entry)

  print(f"Dataset: {data_set}")
  print(f"Unique tokens before pruning: {len(token_freq_normal)}, Highest freq: {token_freq_normal.most_common(1)}, Lowest freq: {token_freq_normal.most_common()[::-1][:1]}")
  print(f"Unique tokens after pruning: {len(token_freq_pruned)}, Highest freq: {token_freq_pruned.most_common(1)}, Lowest freq: {token_freq_pruned.most_common()[::-1][:1]}")
  print(f"Tokens pruned: {len(token_freq_normal) - len(token_freq_pruned)}")
  print()

Dataset: train
Unique tokens before pruning: 24493, Highest freq: [('.', 33321)], Lowest freq: [('Huskies', 1)]
Unique tokens after pruning: 10777, Highest freq: [('.', 33321)], Lowest freq: [('okie', 2)]
Tokens pruned: 13716

Dataset: test
Unique tokens before pruning: 7207, Highest freq: [('.', 4140)], Lowest freq: [('overdose', 1)]
Unique tokens after pruning: 2913, Highest freq: [('.', 4140)], Lowest freq: [('Elmo', 2)]
Tokens pruned: 4294

Dataset: dev
Unique tokens before pruning: 7339, Highest freq: [('.', 4160)], Lowest freq: [('/tiny', 1)]
Unique tokens after pruning: 2941, Highest freq: [('.', 4160)], Lowest freq: [('shitbag', 2)]
Tokens pruned: 4398



# Training: a multi-class Naïve Bayes classifier

In [None]:
!pip install scikit-learn



In [None]:
def nb_eval(x, y):

    X_train, X_test, y_train, y_test = train_test_split(
        x,
        y,
        test_size = 0.2,
        random_state = 42,
        stratify = y) # Stratify because the emotions arent evenly distributed, this helps

    # ngram_range adds unigrams and bigrams
    # min_df, max_df, max_features do thingies

    vectorizer = CountVectorizer(
        #ngram_range=(1, 2),
        min_df=1,
        max_df=0.9,
        max_features=15000
    )

    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)


    nb = MultinomialNB()
    nb.fit(X_train_vec, y_train)

    y_pred = nb.predict(X_test_vec)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification report:\n", classification_report(y_test, y_pred, zero_division=0))
    print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
    return nb, vectorizer

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


df = tokenized_datasets[datasets[0]]

print("----------------------------------------")
print(f"Dataset: {datasets[0]}")
print("----------------------------------------")

df['normal_tokens_to_str'] = df['tokens'].apply(lambda tokens: ' '.join(tokens))
df['pruned_tokens_to_str'] = df['pruned_tokens'].apply(lambda tokens: ' '.join(tokens))

print("############################################")
print("Eval with training data")
print("###########################################")
print()

nb_normal, vectorizer_normal = nb_eval(df['normal_tokens_to_str'], df[2])
print()
print("################# Pruned ##################")
print()
nb_pruned, vectorizer_pruned = nb_eval(df['pruned_tokens_to_str'], df[2])

if (nb_normal):
  print(nb_normal)

if (nb_pruned):
  print(nb_pruned)

----------------------------------------
Dataset: train
----------------------------------------
############################################
Eval with training data
###########################################

Accuracy: 0.5723675894324358
Classification report:
               precision    recall  f1-score   support

       anger       0.60      0.21      0.32       859
     disgust       0.67      0.02      0.04       100
        fear       1.00      0.01      0.02       108
         joy       0.62      0.83      0.71      3043
     neutral       0.50      0.64      0.56      2564
     sadness       0.76      0.12      0.21       465
    surprise       0.57      0.13      0.22       772

    accuracy                           0.57      7911
   macro avg       0.67      0.28      0.30      7911
weighted avg       0.59      0.57      0.53      7911

Confusion matrix:
 [[ 184    1    0  225  431    1   17]
 [  14    2    0   37   45    1    1]
 [   2    0    1   43   60    0    2]
 [  15

In [None]:
from sklearn.metrics import precision_recall_fscore_support

def helper_evaluation_stats(y_true, y_pred):
  # Confusion matrix
  print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

  # Classification report (includes precision, recall, f1 for each class)
  print(classification_report(y_true, y_pred, digits=3))

  # Micro and Macro averages
  precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='micro')
  print(f"Micro Avg - Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}")

  precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='macro')
  print(f"Macro Avg - Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}")

  # Average accuracy
  accuracy = accuracy_score(y_true, y_pred)
  print(f"Accuracy: {accuracy:.3f}")

In [None]:
df_test = tokenized_datasets[datasets[1]]
df_test['normal_tokens_to_str'] = df_test['tokens'].apply(lambda tokens: ' '.join(tokens))
df_test['pruned_tokens_to_str'] = df_test['pruned_tokens'].apply(lambda tokens: ' '.join(tokens))

y_true = df_test[2]

X_test_normal = vectorizer_normal.transform(df_test['normal_tokens_to_str'])
y_pred_normal = nb_normal.predict(X_test_normal)

print("----------------------------- NORMAL --------------------------------")
helper_evaluation_stats(y_true, y_pred_normal)
print()

X_test_pruned = vectorizer_pruned.transform(df_test['pruned_tokens_to_str'])
y_pred_pruned = nb_pruned.predict(X_test_pruned)

print("----------------------------- PRUNED --------------------------------")
helper_evaluation_stats(y_true, y_pred_pruned)

----------------------------- NORMAL --------------------------------
Confusion Matrix:
 [[ 111    0    0  151  302    2    6]
 [   9    2    0   34   30    0    1]
 [   1    0    0   36   40    1    2]
 [  13    0    0 1541  293    0   16]
 [  51    0    0  523 1005    6   21]
 [  12    0    0  137   99   30    5]
 [   8    0    0  154  252    1   73]]
              precision    recall  f1-score   support

       anger      0.541     0.194     0.286       572
     disgust      1.000     0.026     0.051        76
        fear      0.000     0.000     0.000        80
         joy      0.598     0.827     0.694      1863
     neutral      0.497     0.626     0.554      1606
     sadness      0.750     0.106     0.186       283
    surprise      0.589     0.150     0.239       488

    accuracy                          0.556      4968
   macro avg      0.568     0.276     0.287      4968
weighted avg      0.563     0.556     0.507      4968

Micro Avg - Precision: 0.556, Recall: 0.556, F1

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# RESULTS
Regular accuracy:


*   Pruned: 56.7%
*   Not Pruned: 55.6%




With Bigrams added


*   Pruned: 58.5%
*   Not Pruned: 58.7%



The Achieved results are lower than the reference research paper, and as can be seen pruning the dataset reduces accuracy. Which means that useful tokens might be removed.