# TRAINING & EVALUATION

Let us train some models!

For reference the list of models and training data are found below:

#### SIMPLE MODELS
1) tf-idf + Logistic Regression
2) tf-idf + Multinomial Naive Bayes
3) tf-idf + Decision Tree
4) tf-idf + Random Forest
5) tf-idf + kNN
6) DistilBERT embeddings + Logistic Regression
7) DistilBERT embeddings + Multinomial Naive Bayes
8) DistilBERT embeddings + Decision Tree
9) DistilBERT embeddings + Random Forest
10) DistilBERT embeddings + kNN

#### TRANSFORMER MODELS
1) DistilBERT

#### TRAINING DATA
1) Cleaned training data
2) Cleaned training data with topic information
3) Cleaned training data with topic information and with balanced sentiment representation via over- and undersampling.

We will evaluate using precision, recall and accuracy scores, which are the standard metrics for this task. For each model we will be saving the best performing one based on their training data composition.

Let's implement the methods for training and evaluation. Also implement a method to save the models. 

In [1]:
from time import time
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA 
import joblib
import os


def train_model(X_train, y_train, X_test, y_test, vectorizer, classifier, use_pca=False, n_components=100, log_reg_max_iter=1000):
    started = time()

    if classifier == "LogisticRegression":
        from sklearn.linear_model import LogisticRegression
        classifier = LogisticRegression(max_iter=log_reg_max_iter)

    if classifier == "MultinomialNB":
        from sklearn.naive_bayes import MultinomialNB
        classifier = MultinomialNB()
        if vectorizer is None:
            X_train = MinMaxScaler().fit_transform(X_train)
            X_test = MinMaxScaler().fit_transform(X_test)

    if classifier == "DecisionTreeClassifier":
        from sklearn.tree import DecisionTreeClassifier
        classifier = DecisionTreeClassifier()
    
    if classifier == "RandomForestClassifier":
        from sklearn.ensemble import RandomForestClassifier
        classifier = RandomForestClassifier()

    if classifier == "KNeighborsClassifier":
        from sklearn.neighbors import KNeighborsClassifier
        classifier = KNeighborsClassifier()
    

    steps = []
    if vectorizer is not None:
        steps.append(('vectorizer', vectorizer))
    if use_pca:
        # Add PCA to the pipeline if requested
        steps.append(('pca', PCA(n_components=n_components)))
        # steps.append(('scaler', MinMaxScaler()))
    steps.append(('classifier', classifier))

    # Construct the pipeline from the specified steps
    pipeline = make_pipeline(*[step for _, step in steps])


    pipeline.fit(X_train, y_train)
    
    finished = time()

    print(f'Training time: {finished - started:.2f}s')


    return pipeline

def evaluate_model(pipeline, X_test, y_test):

    started = time()
    y_pred = pipeline.predict(X_test)
    print(classification_report(y_test, y_pred))
    finished = time()
    print(f'Prediction time: {finished - started:.2f}s')

def save_model(pipeline, model_path):
    if not os.path.exists(os.path.dirname(model_path)):
        os.makedirs(os.path.dirname(model_path))
    joblib.dump(pipeline, model_path)



The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.


## SIMPLE MODELS

In [2]:
# lets load the data
import pandas as pd

training_df_cleaned = pd.read_csv('data/training_cleaned.csv')
validation_df_cleaned = pd.read_csv('data/validation_cleaned.csv')

training_df_topic_merged = pd.read_csv('data/training_topic_merged.csv')
validation_df_topic_merged = pd.read_csv('data/validation_topic_merged.csv')

training_df_balanced_us = pd.read_csv('data/training_balanced_us.csv')
# validation dfs for balanced datasets will be validation_df_topic_merged

training_df_balanced_os = pd.read_csv('data/training_balanced_os.csv')

In [3]:
# the models will only get the tweet or the tweet_topic as input, so lets make the partitions

# 1. Cleaned data
X_train_cleaned, y_train_cleaned = training_df_cleaned['tweet'], training_df_cleaned['sentiment']
X_test_cleaned, y_test_cleaned = validation_df_cleaned['tweet'], validation_df_cleaned['sentiment']

# 2. Topic merged data
X_train_topic_merged, y_train_topic_merged = training_df_topic_merged['topic_tweet'], training_df_topic_merged['sentiment']
X_test_topic_merged, y_test_topic_merged = validation_df_topic_merged['topic_tweet'], validation_df_topic_merged['sentiment']

# 3. Balanced undersampled data
X_train_balanced_us, y_train_balanced_us = training_df_balanced_us['topic_tweet'], training_df_balanced_us['sentiment']
X_test_balanced_us, y_test_balanced_us = validation_df_topic_merged['topic_tweet'], validation_df_topic_merged['sentiment']

# 4. Balanced oversampled data
X_train_balanced_os, y_train_balanced_os = training_df_balanced_os['topic_tweet'], training_df_balanced_os['sentiment']
X_test_balanced_os, y_test_balanced_os = validation_df_topic_merged['topic_tweet'], validation_df_topic_merged['sentiment']

### TF-IDF + LOGISTIC REGRESSION

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer
cleaned_vectorizer = TfidfVectorizer()
topic_merged_vectorizer = TfidfVectorizer()
balanced_us_vectorizer = TfidfVectorizer()
balanced_os_vectorizer = TfidfVectorizer()


In [91]:
cleaned_log_reg_model = train_model(X_train_cleaned, y_train_cleaned, X_test_cleaned, y_test_cleaned, cleaned_vectorizer, 'LogisticRegression')
topic_merged_log_reg_model = train_model(X_train_topic_merged, y_train_topic_merged, X_test_topic_merged, y_test_topic_merged, topic_merged_vectorizer, 'LogisticRegression')
balanced_us_log_reg_model = train_model(X_train_balanced_us, y_train_balanced_us, X_test_balanced_us, y_test_balanced_us, balanced_us_vectorizer, 'LogisticRegression')
balanced_os_log_reg_model = train_model(X_train_balanced_os, y_train_balanced_os, X_test_balanced_os, y_test_balanced_os, balanced_os_vectorizer, 'LogisticRegression')

evaluate_model(cleaned_log_reg_model, X_test_cleaned, y_test_cleaned)
evaluate_model(topic_merged_log_reg_model, X_test_topic_merged, y_test_topic_merged)
evaluate_model(balanced_us_log_reg_model, X_test_balanced_us, y_test_balanced_us)
evaluate_model(balanced_os_log_reg_model, X_test_balanced_os, y_test_balanced_os)

Training time: 103.96s
Training time: 138.90s
Training time: 39.10s
Training time: 144.90s
              precision    recall  f1-score   support

  Irrelevant       0.88      0.85      0.86       172
    Negative       0.86      0.94      0.89       266
     Neutral       0.94      0.85      0.89       285
    Positive       0.90      0.93      0.91       277

    accuracy                           0.89      1000
   macro avg       0.89      0.89      0.89      1000
weighted avg       0.90      0.89      0.89      1000

Prediction time: 0.19s
              precision    recall  f1-score   support

  Irrelevant       0.87      0.83      0.85       172
    Negative       0.85      0.93      0.89       266
     Neutral       0.95      0.87      0.90       285
    Positive       0.88      0.91      0.90       277

    accuracy                           0.89      1000
   macro avg       0.89      0.88      0.88      1000
weighted avg       0.89      0.89      0.89      1000

Prediction time:

BEST PERFORMANCE (Accuracy) = Cleaned with score 0.89

In [95]:
# lets save the best model
save_model(cleaned_log_reg_model, 'models/simple/tfidf/cleaned_log_reg_model.joblib')

### TF-IDF + MULTINOMIAL NAIVE BAYES

In [98]:
cleaned_mnb_model = train_model(X_train_cleaned, y_train_cleaned, X_test_cleaned, y_test_cleaned, cleaned_vectorizer, 'MultinomialNB')
topic_merged_mnb_model = train_model(X_train_topic_merged, y_train_topic_merged, X_test_topic_merged, y_test_topic_merged, topic_merged_vectorizer, 'MultinomialNB')
balanced_us_mnb_model = train_model(X_train_balanced_us, y_train_balanced_us, X_test_balanced_us, y_test_balanced_us, balanced_us_vectorizer, 'MultinomialNB')
balanced_os_mnb_model = train_model(X_train_balanced_os, y_train_balanced_os, X_test_balanced_os, y_test_balanced_os, balanced_os_vectorizer, 'MultinomialNB')

evaluate_model(cleaned_mnb_model, X_test_cleaned, y_test_cleaned)
evaluate_model(topic_merged_mnb_model, X_test_topic_merged, y_test_topic_merged)
evaluate_model(balanced_us_mnb_model, X_test_balanced_us, y_test_balanced_us)
evaluate_model(balanced_os_mnb_model, X_test_balanced_os, y_test_balanced_os)

Training time: 4.92s
Training time: 4.76s
Training time: 1.94s
Training time: 7.95s
              precision    recall  f1-score   support

  Irrelevant       0.95      0.53      0.68       172
    Negative       0.66      0.93      0.77       266
     Neutral       0.91      0.64      0.75       285
    Positive       0.74      0.88      0.80       277

    accuracy                           0.77      1000
   macro avg       0.82      0.75      0.75      1000
weighted avg       0.80      0.77      0.76      1000

Prediction time: 0.11s
              precision    recall  f1-score   support

  Irrelevant       0.95      0.53      0.68       172
    Negative       0.68      0.93      0.78       266
     Neutral       0.91      0.64      0.75       285
    Positive       0.73      0.89      0.80       277

    accuracy                           0.77      1000
   macro avg       0.82      0.75      0.75      1000
weighted avg       0.80      0.77      0.76      1000

Prediction time: 0.14s


BEST PERFORMANCE (Accuracy) = Balanced w/ Oversample with score 0.81


In [99]:
# lets save the best model
save_model(balanced_os_mnb_model, 'models/simple/tfidf/balanced_os_mnb_model.joblib')

### TF-IDF + DECISION TREE

In [101]:
cleaned_dt_model = train_model(X_train_cleaned, y_train_cleaned, X_test_cleaned, y_test_cleaned, cleaned_vectorizer, 'DecisionTreeClassifier')
topic_merged_dt_model = train_model(X_train_topic_merged, y_train_topic_merged, X_test_topic_merged, y_test_topic_merged, topic_merged_vectorizer, 'DecisionTreeClassifier')
balanced_us_dt_model = train_model(X_train_balanced_us, y_train_balanced_us, X_test_balanced_us, y_test_balanced_us, balanced_us_vectorizer, 'DecisionTreeClassifier')
balanced_os_dt_model = train_model(X_train_balanced_os, y_train_balanced_os, X_test_balanced_os, y_test_balanced_os, balanced_os_vectorizer, 'DecisionTreeClassifier')

evaluate_model(cleaned_dt_model, X_test_cleaned, y_test_cleaned)
evaluate_model(topic_merged_dt_model, X_test_topic_merged, y_test_topic_merged)
evaluate_model(balanced_us_dt_model, X_test_balanced_us, y_test_balanced_us)
evaluate_model(balanced_os_dt_model, X_test_balanced_os, y_test_balanced_os)

Training time: 132.02s
Training time: 105.73s
Training time: 87.60s
Training time: 297.81s
              precision    recall  f1-score   support

  Irrelevant       0.93      0.91      0.92       172
    Negative       0.93      0.96      0.94       266
     Neutral       0.93      0.91      0.92       285
    Positive       0.92      0.92      0.92       277

    accuracy                           0.93      1000
   macro avg       0.93      0.93      0.93      1000
weighted avg       0.93      0.93      0.93      1000

Prediction time: 0.32s
              precision    recall  f1-score   support

  Irrelevant       0.92      0.92      0.92       172
    Negative       0.92      0.95      0.93       266
     Neutral       0.94      0.88      0.91       285
    Positive       0.92      0.94      0.93       277

    accuracy                           0.92      1000
   macro avg       0.92      0.92      0.92      1000
weighted avg       0.92      0.92      0.92      1000

Prediction time:

BEST PERFORMANCE (Accuracy) = Cleaned with score 0.93

In [102]:
# lets save the best model
save_model(cleaned_dt_model, 'models/simple/tfidf/cleaned_dt_model.joblib')

### TF-IDF + RANDOM FOREST

In [103]:
cleaned_rf_model = train_model(X_train_cleaned, y_train_cleaned, X_test_cleaned, y_test_cleaned, cleaned_vectorizer, 'RandomForestClassifier')
topic_merged_rf_model = train_model(X_train_topic_merged, y_train_topic_merged, X_test_topic_merged, y_test_topic_merged, topic_merged_vectorizer, 'RandomForestClassifier')
balanced_us_rf_model = train_model(X_train_balanced_us, y_train_balanced_us, X_test_balanced_us, y_test_balanced_us, balanced_us_vectorizer, 'RandomForestClassifier')
balanced_os_rf_model = train_model(X_train_balanced_os, y_train_balanced_os, X_test_balanced_os, y_test_balanced_os, balanced_os_vectorizer, 'RandomForestClassifier')

evaluate_model(cleaned_rf_model, X_test_cleaned, y_test_cleaned)
evaluate_model(topic_merged_rf_model, X_test_topic_merged, y_test_topic_merged)
evaluate_model(balanced_us_rf_model, X_test_balanced_us, y_test_balanced_us)
evaluate_model(balanced_os_rf_model, X_test_balanced_os, y_test_balanced_os)

Training time: 2244.70s
Training time: 1724.63s
Training time: 401.65s
Training time: 2057.45s
              precision    recall  f1-score   support

  Irrelevant       0.99      0.97      0.98       172
    Negative       0.97      0.98      0.98       266
     Neutral       0.97      0.97      0.97       285
    Positive       0.97      0.97      0.97       277

    accuracy                           0.97      1000
   macro avg       0.97      0.97      0.97      1000
weighted avg       0.97      0.97      0.97      1000

Prediction time: 7.01s
              precision    recall  f1-score   support

  Irrelevant       0.98      0.99      0.99       172
    Negative       0.96      0.98      0.97       266
     Neutral       0.99      0.97      0.98       285
    Positive       0.98      0.98      0.98       277

    accuracy                           0.98      1000
   macro avg       0.98      0.98      0.98      1000
weighted avg       0.98      0.98      0.98      1000

Prediction t

BEST PERFORMANCE (Accuracy) = Topic Merged with score 0.98

In [104]:
# lets save the best model
save_model(topic_merged_rf_model, 'models/simple/tfidf/topic_merged_rf_model.joblib')

### TF-IDF + kNN

In [105]:
cleaned_knn_model = train_model(X_train_cleaned, y_train_cleaned, X_test_cleaned, y_test_cleaned, cleaned_vectorizer, 'KNeighborsClassifier')
topic_merged_knn_model = train_model(X_train_topic_merged, y_train_topic_merged, X_test_topic_merged, y_test_topic_merged, topic_merged_vectorizer, 'KNeighborsClassifier')
balanced_us_knn_model = train_model(X_train_balanced_us, y_train_balanced_us, X_test_balanced_us, y_test_balanced_us, balanced_us_vectorizer, 'KNeighborsClassifier')
balanced_os_knn_model = train_model(X_train_balanced_os, y_train_balanced_os, X_test_balanced_os, y_test_balanced_os, balanced_os_vectorizer, 'KNeighborsClassifier')

evaluate_model(cleaned_knn_model, X_test_cleaned, y_test_cleaned)
evaluate_model(topic_merged_knn_model, X_test_topic_merged, y_test_topic_merged)
evaluate_model(balanced_us_knn_model, X_test_balanced_us, y_test_balanced_us)
evaluate_model(balanced_os_knn_model, X_test_balanced_os, y_test_balanced_os)

Training time: 6.55s
Training time: 7.40s
Training time: 3.06s
Training time: 10.88s
              precision    recall  f1-score   support

  Irrelevant       0.93      0.98      0.95       172
    Negative       0.95      0.97      0.96       266
     Neutral       0.95      0.96      0.96       285
    Positive       0.98      0.92      0.95       277

    accuracy                           0.95      1000
   macro avg       0.95      0.96      0.95      1000
weighted avg       0.95      0.95      0.95      1000

Prediction time: 6.21s
              precision    recall  f1-score   support

  Irrelevant       0.96      0.99      0.98       172
    Negative       0.98      0.98      0.98       266
     Neutral       0.99      0.97      0.98       285
    Positive       0.99      0.98      0.98       277

    accuracy                           0.98      1000
   macro avg       0.98      0.98      0.98      1000
weighted avg       0.98      0.98      0.98      1000

Prediction time: 5.67s

BEST PERFORMANCE (Accuracy) = Topic Merged with score 0.98


In [106]:
# lets save the best model
save_model(topic_merged_knn_model, 'models/simple/tfidf/topic_merged_knn_model.joblib')

### DistilBERT Embeddings + Logistic Regression

In [4]:
# lets load the data
import torch

X_train_cleaned_bert = torch.load('data/X_train_cleaned_bert.pt')
X_test_cleaned_bert = torch.load('data/X_test_cleaned_bert.pt')

X_train_topic_merged_bert = torch.load('data/X_train_topic_merged_bert.pt')
X_test_topic_merged_bert = torch.load('data/X_test_topic_merged_bert.pt')

X_train_balanced_us_bert = torch.load('data/X_train_balanced_us_bert.pt')
X_test_balanced_us_bert = X_test_topic_merged_bert

X_train_balanced_os_bert = torch.load('data/X_train_balanced_os_bert.pt')
X_test_balanced_os_bert = X_test_topic_merged_bert

In [60]:
cleaned_lr_bertembed_model = train_model(X_train_cleaned_bert, y_train_cleaned, X_test_cleaned_bert, y_test_cleaned, None, 'LogisticRegression')
topic_merged_lr_bertembed_model = train_model(X_train_topic_merged_bert, y_train_topic_merged, X_test_topic_merged_bert, y_test_topic_merged, None, 'LogisticRegression')
balanced_us_lr_bertembed_model = train_model(X_train_balanced_us_bert, y_train_balanced_us, X_test_balanced_us_bert, y_test_balanced_us, None, 'LogisticRegression')
balanced_os_lr_bertembed_model = train_model(X_train_balanced_os_bert, y_train_balanced_os, X_test_balanced_os_bert, y_test_balanced_os, None, 'LogisticRegression')

evaluate_model(cleaned_lr_bertembed_model, X_test_cleaned_bert, y_test_cleaned)
evaluate_model(topic_merged_lr_bertembed_model, X_test_topic_merged_bert, y_test_topic_merged)
evaluate_model(balanced_us_lr_bertembed_model, X_test_balanced_us_bert, y_test_balanced_us)
evaluate_model(balanced_os_lr_bertembed_model, X_test_balanced_os_bert, y_test_balanced_os)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training time: 519.73s
              precision    recall  f1-score   support

  Irrelevant       0.60      0.38      0.46       172
    Negative       0.57      0.75      0.65       266
     Neutral       0.61      0.52      0.56       285
    Positive       0.63      0.69      0.66       277

    accuracy                           0.60      1000
   macro avg       0.60      0.58      0.58      1000
weighted avg       0.61      0.60      0.59      1000

Prediction time: 0.10s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training time: 533.97s
              precision    recall  f1-score   support

  Irrelevant       0.58      0.40      0.48       172
    Negative       0.61      0.75      0.67       266
     Neutral       0.65      0.57      0.61       285
    Positive       0.63      0.69      0.66       277

    accuracy                           0.62      1000
   macro avg       0.62      0.60      0.60      1000
weighted avg       0.62      0.62      0.62      1000

Prediction time: 0.10s


KeyboardInterrupt: 

This does not converge, lets raise the total number of iterations.

In [108]:
cleaned_lr_pca_bertembed_model = train_model(X_train_cleaned_bert, y_train_cleaned, X_test_cleaned_bert, y_test_cleaned, None, 'LogisticRegression', log_reg_max_iter=5000)
topic_merged_lr_pca_bertembed_model = train_model(X_train_topic_merged_bert, y_train_topic_merged, X_test_topic_merged_bert, y_test_topic_merged, None, 'LogisticRegression', log_reg_max_iter=5000)
balanced_us_lr_pca_bertembed_model = train_model(X_train_balanced_us_bert, y_train_balanced_us, X_test_balanced_us_bert, y_test_balanced_us, None, 'LogisticRegression', log_reg_max_iter=5000)
balanced_os_lr_pca_bertembed_model = train_model(X_train_balanced_os_bert, y_train_balanced_os, X_test_balanced_os_bert, y_test_balanced_os, None, 'LogisticRegression', log_reg_max_iter=5000)

evaluate_model(cleaned_lr_pca_bertembed_model, X_test_cleaned_bert, y_test_cleaned)
evaluate_model(topic_merged_lr_pca_bertembed_model, X_test_topic_merged_bert, y_test_topic_merged)
evaluate_model(balanced_us_lr_pca_bertembed_model, X_test_balanced_us_bert, y_test_balanced_us)
evaluate_model(balanced_os_lr_pca_bertembed_model, X_test_balanced_os_bert, y_test_balanced_os)

Training time: 849.36s
Training time: 867.21s
Training time: 275.43s
Training time: 2121.90s
              precision    recall  f1-score   support

  Irrelevant       0.61      0.38      0.47       172
    Negative       0.57      0.75      0.65       266
     Neutral       0.61      0.52      0.56       285
    Positive       0.63      0.69      0.66       277

    accuracy                           0.60      1000
   macro avg       0.61      0.58      0.58      1000
weighted avg       0.61      0.60      0.60      1000

Prediction time: 16.19s
              precision    recall  f1-score   support

  Irrelevant       0.57      0.40      0.47       172
    Negative       0.61      0.74      0.67       266
     Neutral       0.64      0.57      0.60       285
    Positive       0.62      0.68      0.65       277

    accuracy                           0.62      1000
   macro avg       0.61      0.60      0.60      1000
weighted avg       0.62      0.62      0.61      1000

Prediction ti

BEST PERFORMANCE (Accuracy) = Topic Merged with score 0.62

In [109]:
# lets save the best model
save_model(topic_merged_lr_pca_bertembed_model, 'models/simple/distilbert_embed/topic_merged_lr_pca_bertembed_model.joblib')

### DistilBERT Embeddings + Multinomial Naive Bayes

In [20]:
cleaned_mnb_bertembed_model = train_model(X_train_cleaned_bert, y_train_cleaned, X_test_cleaned_bert, y_test_cleaned, None, 'MultinomialNB')
evaluate_model(cleaned_mnb_bertembed_model, X_test_cleaned_bert, y_test_cleaned)

topic_merged_mnb_bertembed_model = train_model(X_train_topic_merged_bert, y_train_topic_merged, X_test_topic_merged_bert, y_test_topic_merged, None, 'MultinomialNB')
evaluate_model(topic_merged_mnb_bertembed_model, X_test_topic_merged_bert, y_test_topic_merged)

balanced_us_mnb_bertembed_model = train_model(X_train_balanced_us_bert, y_train_balanced_us, X_test_balanced_us_bert, y_test_balanced_us, None, 'MultinomialNB')
evaluate_model(balanced_us_mnb_bertembed_model, X_test_balanced_us_bert, y_test_balanced_us)

balanced_os_mnb_bertembed_model = train_model(X_train_balanced_os_bert, y_train_balanced_os, X_test_balanced_os_bert, y_test_balanced_os, None, 'MultinomialNB')
evaluate_model(balanced_os_mnb_bertembed_model, X_test_balanced_os_bert, y_test_balanced_os)

Training time: 3.43s
              precision    recall  f1-score   support

  Irrelevant       0.00      0.00      0.00       172
    Negative       0.50      0.61      0.55       266
     Neutral       0.51      0.42      0.46       285
    Positive       0.44      0.70      0.54       277

    accuracy                           0.48      1000
   macro avg       0.36      0.43      0.39      1000
weighted avg       0.40      0.48      0.43      1000

Prediction time: 0.06s


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Training time: 2.94s
              precision    recall  f1-score   support

  Irrelevant       0.00      0.00      0.00       172
    Negative       0.53      0.49      0.51       266
     Neutral       0.59      0.23      0.33       285
    Positive       0.37      0.86      0.52       277

    accuracy                           0.43      1000
   macro avg       0.37      0.39      0.34      1000
weighted avg       0.41      0.43      0.37      1000

Prediction time: 0.06s


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Training time: 1.18s
              precision    recall  f1-score   support

  Irrelevant       0.00      0.00      0.00       172
    Negative       0.41      0.89      0.56       266
     Neutral       0.53      0.26      0.35       285
    Positive       0.52      0.53      0.52       277

    accuracy                           0.46      1000
   macro avg       0.36      0.42      0.36      1000
weighted avg       0.40      0.46      0.39      1000

Prediction time: 0.11s


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Training time: 3.44s
              precision    recall  f1-score   support

  Irrelevant       0.00      0.00      0.00       172
    Negative       0.40      0.89      0.55       266
     Neutral       0.61      0.22      0.32       285
    Positive       0.49      0.53      0.51       277

    accuracy                           0.45      1000
   macro avg       0.37      0.41      0.35      1000
weighted avg       0.42      0.45      0.38      1000

Prediction time: 0.05s


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


BEST PERFORMANCE (Accuracy) = Cleaned with score 0.48

In [21]:
# lets save the best model
save_model(cleaned_mnb_bertembed_model, 'models/simple/distilbert_embed/cleaned_mnb_bertembed_model.joblib')

### DistilBERT Embeddings + Decision Tree

In [22]:
cleaned_dt_bertembed_model = train_model(X_train_cleaned_bert, y_train_cleaned, X_test_cleaned_bert, y_test_cleaned, None, 'DecisionTreeClassifier')
topic_merged_dt_bertembed_model = train_model(X_train_topic_merged_bert, y_train_topic_merged, X_test_topic_merged_bert, y_test_topic_merged, None, 'DecisionTreeClassifier')
balanced_us_dt_bertembed_model = train_model(X_train_balanced_us_bert, y_train_balanced_us, X_test_balanced_us_bert, y_test_balanced_us, None, 'DecisionTreeClassifier')
balanced_os_dt_bertembed_model = train_model(X_train_balanced_os_bert, y_train_balanced_os, X_test_balanced_os_bert, y_test_balanced_os, None, 'DecisionTreeClassifier')

evaluate_model(cleaned_dt_bertembed_model, X_test_cleaned_bert, y_test_cleaned)
evaluate_model(topic_merged_dt_bertembed_model, X_test_topic_merged_bert, y_test_topic_merged)
evaluate_model(balanced_us_dt_bertembed_model, X_test_balanced_us_bert, y_test_balanced_us)
evaluate_model(balanced_os_dt_bertembed_model, X_test_balanced_os_bert, y_test_balanced_os)

Training time: 411.99s
Training time: 421.14s
Training time: 133.85s
Training time: 477.78s
              precision    recall  f1-score   support

  Irrelevant       0.70      0.70      0.70       172
    Negative       0.73      0.82      0.77       266
     Neutral       0.75      0.66      0.70       285
    Positive       0.76      0.76      0.76       277

    accuracy                           0.74      1000
   macro avg       0.74      0.74      0.73      1000
weighted avg       0.74      0.74      0.74      1000

Prediction time: 0.12s
              precision    recall  f1-score   support

  Irrelevant       0.64      0.65      0.65       172
    Negative       0.72      0.85      0.78       266
     Neutral       0.78      0.64      0.70       285
    Positive       0.73      0.73      0.73       277

    accuracy                           0.72      1000
   macro avg       0.72      0.72      0.72      1000
weighted avg       0.73      0.72      0.72      1000

Prediction time

BEST PERFORMANCE (Accuracy) = Cleaned with score 0.74

In [23]:
# lets save the best model
save_model(cleaned_dt_bertembed_model, 'models/simple/distilbert_embed/cleaned_dt_bertembed_model.joblib')

### DistilBERT Embeddings + Random Forest

In [24]:
cleaned_rf_bertembed_model = train_model(X_train_cleaned_bert, y_train_cleaned, X_test_cleaned_bert, y_test_cleaned, None, 'RandomForestClassifier')
topic_merged_rf_bertembed_model = train_model(X_train_topic_merged_bert, y_train_topic_merged, X_test_topic_merged_bert, y_test_topic_merged, None, 'RandomForestClassifier')
balanced_us_rf_bertembed_model = train_model(X_train_balanced_us_bert, y_train_balanced_us, X_test_balanced_us_bert, y_test_balanced_us, None, 'RandomForestClassifier')
balanced_os_rf_bertembed_model = train_model(X_train_balanced_os_bert, y_train_balanced_os, X_test_balanced_os_bert, y_test_balanced_os, None, 'RandomForestClassifier')

evaluate_model(cleaned_rf_bertembed_model, X_test_cleaned_bert, y_test_cleaned)
evaluate_model(topic_merged_rf_bertembed_model, X_test_topic_merged_bert, y_test_topic_merged)
evaluate_model(balanced_us_rf_bertembed_model, X_test_balanced_us_bert, y_test_balanced_us)
evaluate_model(balanced_os_rf_bertembed_model, X_test_balanced_os_bert, y_test_balanced_os)

Training time: 993.55s
Training time: 778.36s
Training time: 273.92s
Training time: 1057.97s
              precision    recall  f1-score   support

  Irrelevant       0.96      0.65      0.78       172
    Negative       0.75      0.94      0.84       266
     Neutral       0.82      0.74      0.78       285
    Positive       0.82      0.86      0.84       277

    accuracy                           0.81      1000
   macro avg       0.84      0.80      0.81      1000
weighted avg       0.82      0.81      0.81      1000

Prediction time: 0.88s
              precision    recall  f1-score   support

  Irrelevant       0.97      0.67      0.79       172
    Negative       0.78      0.94      0.85       266
     Neutral       0.85      0.78      0.81       285
    Positive       0.82      0.88      0.85       277

    accuracy                           0.83      1000
   macro avg       0.85      0.82      0.83      1000
weighted avg       0.84      0.83      0.83      1000

Prediction tim

BEST PERFORMANCE (Accuracy) = Topic Merged with score 0.83

In [25]:
# lets save the best model
save_model(topic_merged_rf_bertembed_model, 'models/simple/distilbert_embed/topic_merged_rf_bertembed_model.joblib')

### DistilBERT Embeddings + kNN

In [27]:
cleaned_knn_bertembed_model = train_model(X_train_cleaned_bert, y_train_cleaned, X_test_cleaned_bert, y_test_cleaned, None, 'KNeighborsClassifier')
topic_merged_knn_bertembed_model = train_model(X_train_topic_merged_bert, y_train_topic_merged, X_test_topic_merged_bert, y_test_topic_merged, None, 'KNeighborsClassifier')
balanced_us_knn_bertembed_model = train_model(X_train_balanced_us_bert, y_train_balanced_us, X_test_balanced_us_bert, y_test_balanced_us, None, 'KNeighborsClassifier')
balanced_os_knn_bertembed_model = train_model(X_train_balanced_os_bert, y_train_balanced_os, X_test_balanced_os_bert, y_test_balanced_os, None, 'KNeighborsClassifier')

evaluate_model(cleaned_knn_bertembed_model, X_test_cleaned_bert, y_test_cleaned)
evaluate_model(topic_merged_knn_bertembed_model, X_test_topic_merged_bert, y_test_topic_merged)
evaluate_model(balanced_us_knn_bertembed_model, X_test_balanced_us_bert, y_test_balanced_us)
evaluate_model(balanced_os_knn_bertembed_model, X_test_balanced_os_bert, y_test_balanced_os)

Training time: 0.18s
Training time: 0.19s
Training time: 0.09s
Training time: 0.28s
              precision    recall  f1-score   support

  Irrelevant       0.85      0.84      0.85       172
    Negative       0.81      0.93      0.87       266
     Neutral       0.93      0.81      0.87       285
    Positive       0.87      0.86      0.87       277

    accuracy                           0.86      1000
   macro avg       0.86      0.86      0.86      1000
weighted avg       0.87      0.86      0.86      1000

Prediction time: 4.37s
              precision    recall  f1-score   support

  Irrelevant       0.88      0.85      0.87       172
    Negative       0.85      0.93      0.89       266
     Neutral       0.93      0.86      0.90       285
    Positive       0.88      0.89      0.88       277

    accuracy                           0.89      1000
   macro avg       0.89      0.88      0.88      1000
weighted avg       0.89      0.89      0.89      1000

Prediction time: 4.17s


BEST PERFORMANCE (Accuracy) = Topic Merged with score 0.89

In [28]:
# lets save the best model
save_model(topic_merged_knn_bertembed_model, 'models/simple/distilbert_embed/topic_merged_knn_bertembed_model.joblib')

## TRANSFORMER MODELS (DistilBERT)

Since we want SOTA technologies and efficient inference times in CPU, DistilBERT is a fine candidate for both of the constraints.

Let us prepare the dataset class for efficient loading into the transformer model using torch and also implement the same metrics that we used in simple models for validation scores.

In [None]:
import torch
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Convert to torch datasets
class SentimentDataset(torch.utils.data.Dataset):
    LABEL_MAP = {'Positive': 0, 'Neutral': 1, 'Negative': 2, 'Irrelevant': 3}
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = [self.LABEL_MAP[label] for label in labels]

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Define metrics
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='weighted')
    acc = accuracy_score(p.label_ids, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# we can quickly load the embeddings that we saved earlier

import pandas as pd

train_cleaned_encodings = torch.load('/gd/MyDrive/huk-cc/data/train_cleaned_encodings.pt')
test_cleaned_encodings = torch.load('/gd/MyDrive/huk-cc/data/validation_cleaned_encodings.pt')

train_topic_merged_encodings = torch.load('/gd/MyDrive/huk-cc/data/train_topic_merged_encodings.pt')
test_topic_merged_encodings = torch.load('/gd/MyDrive/huk-cc/data/validation_topic_merged_encodings.pt')

train_balanced_us_encodings = torch.load('/gd/MyDrive/huk-cc/data/train_balanced_us_encodings.pt')
test_balanced_us_encodings = torch.load('/gd/MyDrive/huk-cc/data/validation_balanced_us_encodings.pt')

train_balanced_os_encodings = torch.load('/gd/MyDrive/huk-cc/data/train_balanced_os_encodings.pt')
test_balanced_os_encodings = torch.load('/gd/MyDrive/huk-cc/data/validation_balanced_os_encodings.pt')


In [None]:
# lets create the datasets

train_cleaned_dataset = SentimentDataset(train_cleaned_encodings, train_df_cleaned['sentiment'].values)
test_cleaned_dataset = SentimentDataset(test_cleaned_encodings, test_df_cleaned['sentiment'].values)

train_topic_merged_dataset = SentimentDataset(train_topic_merged_encodings, train_df_topic_merged['sentiment'].values)
test_topic_merged_dataset = SentimentDataset(test_topic_merged_encodings, test_df_topic_merged['sentiment'].values)

train_balanced_us_dataset = SentimentDataset(train_balanced_us_encodings, train_df_balanced_us['sentiment'].values)
test_balanced_us_dataset = SentimentDataset(test_balanced_us_encodings, test_df_topic_merged['sentiment'].values)

train_balanced_os_dataset = SentimentDataset(train_balanced_os_encodings, train_df_balanced_os['sentiment'].values)
test_balanced_os_dataset = SentimentDataset(test_balanced_os_encodings, test_df_topic_merged['sentiment'].values)

experiments = [
    ("train_cleaned", train_cleaned_dataset, test_cleaned_dataset),
    ("train_topic_merged", train_topic_merged_dataset, test_topic_merged_dataset),
    ("train_balanced_us", train_balanced_us_dataset, test_balanced_us_dataset),
    ("train_balanced_os", train_balanced_os_dataset, test_balanced_os_dataset)
    ]

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


# Define custom tokens for topic merged data
SPECIAL_TOKENS = ['[TOPIC]', '[TWEET]']

# Initialize the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer.add_tokens(SPECIAL_TOKENS)

In [None]:
for experiment_name, training_dataset, test_dataset in experiments:
    print(f"Training on {experiment_name}")
    # Initialize the model
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)

    # Resize model embeddings to accommodate new tokens
    model.resize_token_embeddings(len(tokenizer))

    # Training arguments
    training_args = TrainingArguments(
        output_dir=f'models/distilbert_ft/{experiment_name}',
        num_train_epochs=3,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        learning_rate=2e-5,
        weight_decay=0.01,
        logging_steps=10,
        evaluation_strategy="epoch",
    )

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=training_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
    )

    # Train and evaluate the model
    trainer.train()
    trainer.evaluate()
    print("-"*50)

Training on train_cleaned


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8359,0.608662,0.77,0.765528,0.772475,0.77
2,0.522,0.321145,0.903,0.903014,0.905103,0.903
3,0.4047,0.244823,0.923,0.922881,0.923359,0.923


--------------------------------------------------
Training on train_topic_merged


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8104,0.59995,0.784,0.780825,0.787434,0.784
2,0.5111,0.311909,0.897,0.896793,0.899347,0.897
3,0.3616,0.231839,0.931,0.930925,0.93101,0.931


--------------------------------------------------
Training on train_balanced_us


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.0557,0.96207,0.61,0.586121,0.610405,0.61
2,0.846,0.708811,0.741,0.739903,0.741737,0.741
3,0.6276,0.626912,0.784,0.783019,0.78355,0.784


--------------------------------------------------
Training on train_balanced_os


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss


In [None]:
# final model crashed during training... lets try again

for experiment_name, training_dataset, test_dataset in experiments[3:]:
    print(f"Training on {experiment_name}")
    # Initialize the model
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=4)

    # Resize model embeddings to accommodate new tokens
    model.resize_token_embeddings(len(tokenizer))

    # Training arguments
    training_args = TrainingArguments(
        output_dir=f'/gd/MyDrive/models/{experiment_name}',
        num_train_epochs=3,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=64,
        warmup_steps=500,
        learning_rate=2e-5,
        weight_decay=0.01,
        logging_dir=f'/gd/MyDrive/logs/{experiment_name}',
        logging_steps=10,
        evaluation_strategy="epoch",
    )

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=training_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
    )

    # Train and evaluate the model
    trainer.train()
    trainer.evaluate()
    print("-"*50)

Training on train_balanced_os


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4241,0.351523,0.892,0.892111,0.89464,0.892
2,0.1691,0.213234,0.932,0.932067,0.933724,0.932
3,0.0795,0.159309,0.962,0.962002,0.962121,0.962


--------------------------------------------------


Final scores among finetuned DistilBERTs (Best Accuracy)

- Cleaned: 0.92
- Topic Merged: 0.93
- Balanced w/ Undersampling: 0.78
- Balanced w/ Oversampling: 0.96

### MODEL LEADERBOARD

For overview, here's a list of all saved models ranked by the accuracy score and then by training time.

| Model | Dataset | Embeddings | Training Time | Accuracy |
|----------|----------|----------|----------|----------|
| kNN | Topic Merged | TF-IDF | 0.98 | 7.40s |
| Random Forest | Topic Merged | TF-IDF | 0.98 | 1724.63s |
| Transformer (DistilBERT) | Balanced via Oversampling | DistilBERT Embeddings | 0.96 | 8887s |
| Decision Tree | Cleaned | TF-IDF | 0.93 | 132.02s |
| Transformer (DistilBERT) | Topic Merged | DistilBERT Embeddings | 0.93 | 1884s |
| Transformer (DistilBERT) | Cleaned | DistilBERT Embeddings | 0.92 | 1989s |
| kNN | Topic Merged | DistilBERT Embeddings | 0.89 | 0.19s |
| Logistic Regression | Cleaned | TF-IDF | 0.89 | 103.96s |
| Random Forest | Topic Merged | DistilBERT Embeddings | 0.83 | 778.36s |
| Multinomial Naive Bayes | Balanced via Oversampling | TF-IDF | 0.81 | 7.95s |
| Transformer (DistilBERT) | Balanced via Undersampling | DistilBERT Embeddings | 0.78 | 619s |
| Decision Tree | Cleaned | DistilBERT Embeddings | 0.74 | 411.99s |
| Logistic Regression | Topic Merged | DistilBERT Embeddings | 0.62 | 867.21s |
| Multinomial Naive Bayes | Cleaned | DistilBERT Embeddings | 0.48 | 3.43s |

## RESULTS

- Transformer is not the best model!
    - However one may argue that simple models overfit too much and are not able to generalize, perhaps the validation data is too similar to training data.
    - All transformer models, except undersampled, have >0.9 accuracy.
- kNN (k=5) is both fast and accurate and the best model.
- Merging topic into training data seems to boost performance.
- Balanced representation with oversampling is beneficial for transformers, but not that much for simple models.