## Text Classification

**Goal**: Assigning a label or class to input text



### Dataset

*rotten_tomatoes* dataset: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes

In [48]:
import torch
import numpy as np
from tqdm import tqdm #to show progress bar, wrap it around an iterable
from transformers.pipelines.pt_utils import KeyDataset
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import classification_report

In [34]:
data = load_dataset('rotten_tomatoes')

In [35]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

the data contains 3 splits: train, test, validation as a DatasetDict object which can be used the same as a dictionary

In [40]:
train = data['train']
validation = data['validation']
test = data['test']

### Text Classification with Representation Models
Encoder only models used for representing language e.g. creating embeddings and typically do not generate text

#### Using a task specific model
A base model finetuned for a specific task e.g. BERT for Text classification

In [42]:
model_path = 'cardiffnlp/twitter-roberta-base-sentiment-latest' # model to be used for classification

pipe = pipeline(
    model=model_path,
    tokenizer=model_path, #converts input to individual tokens
    return_all_scores=True #return all probability scores
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [52]:
y_pred_ts = []
for output in pipe(KeyDataset(test, 'text')):
    negative_score = output[0]['score']
    positive_score = output[2]['score']
    assignment = np.argmax([negative_score, positive_score])
    y_pred_ts.append(assignment)

In [54]:
def evaluate_performance(y_true, y_pred):
    performance = classification_report(y_true=y_true, y_pred=y_pred, target_names=['Negative Review', 'Positive Review'])
    print(performance)

In [55]:
evaluate_performance(test['label'], y_pred_ts)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



#### Using an Embedding model: Supervised Classification

what if there is no model finetuned for the specific task we need? We  can use a two step process
1. Generate embeddings
2. Train a classifier

In [66]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import cosine_similarity

In [57]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [58]:
train_embeddings = model.encode(train['text'], show_progress_bar=True)
test_embeddings = model.encode(test['text'], show_progress_bar=True)

Batches: 100%|██████████| 267/267 [01:04<00:00,  4.16it/s]
Batches: 100%|██████████| 34/34 [00:07<00:00,  4.38it/s]


In [60]:
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, train['label'])

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",42
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [62]:
y_pred_emb = clf.predict(test_embeddings)

In [64]:
evaluate_performance(test['label'], y_pred_emb)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



#### Zero Shot Classification

What if we don't have any labelled data? Predicting the labels of input text with no prior training

In [65]:
label_embeddings = model.encode(['A negative review', 'A positive review']) # create embeddings of the labels

In [67]:
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
sim_matrix.shape

(1066, 2)

In [68]:
y_pred_zsc = np.argmax(sim_matrix, axis=1)

In [70]:
evaluate_performance(test['label'], y_pred_zsc)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



In [71]:
label_embeddings_2 = model.encode(['A very negative movie review', 'A very positive movie review'])

In [74]:
sim_matrix = cosine_similarity(test_embeddings, label_embeddings_2)
y_pred_zsc2 = np.argmax(sim_matrix, axis=1)

In [75]:
evaluate_performance(test['label'], y_pred_zsc2)

                 precision    recall  f1-score   support

Negative Review       0.86      0.73      0.79       533
Positive Review       0.76      0.88      0.82       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



### Text Classification with Generative Models

#### Open source model

In [76]:
pipe = pipeline(
    'text2text-generation',
    model = 'google/flan-t5-small'
)

Device set to use mps:0


In [77]:
prompt = 'Is the following sentence positive or negative?'
data = data.map(lambda example: { 't5': prompt + example['text']}) # adding the prompt to the text data
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [79]:
y_pred_open = []
for output in tqdm(pipe(KeyDataset(data['test'], 't5')), total=len(data['test'])):
    text = output[0]['generated_text']
    y_pred_open.append(0 if text == 'negative' else 1)

100%|██████████| 1066/1066 [03:48<00:00,  4.66it/s]


In [80]:
evaluate_performance(test['label'], y_pred_open)

                 precision    recall  f1-score   support

Negative Review       0.83      0.84      0.83       533
Positive Review       0.84      0.83      0.83       533

       accuracy                           0.83      1066
      macro avg       0.83      0.83      0.83      1066
   weighted avg       0.83      0.83      0.83      1066



#### Close Source model

In [81]:
 import openai
 import os
 from dotenv import load_dotenv

In [108]:
load_dotenv()
api = os.getenv('openai')

In [83]:
client = openai.OpenAI(api_key=api)

In [84]:
def chatgpt_generation(prompt, document, model='gpt-3.5-turbo-0125'):
    messages=[
        {
            'role': 'system',
            'content': 'You are a hepful assistant'
        },
        {
            'role': 'user',
            'content': prompt.replace('DOCUMENT', document)
        }
    ]

    chat_completion = client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=0
    )
    return chat_completion.choices[0].message.content

In [85]:
prompt = """ Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers

"""

In [86]:
document = 'unpretentious, charming, quirky, original'
chatgpt_generation(prompt, document)

'1'

In [87]:
predictions = [
    chatgpt_generation(prompt, doc) for doc in tqdm(data['test']['text'])
]

100%|██████████| 1066/1066 [11:36<00:00,  1.53it/s]


In [88]:
y_pred_closed = [int(pred) for pred in predictions]

In [89]:
evaluate_performance(test['label'], y_pred_closed)

                 precision    recall  f1-score   support

Negative Review       0.86      0.97      0.91       533
Positive Review       0.97      0.84      0.90       533

       accuracy                           0.90      1066
      macro avg       0.91      0.90      0.90      1066
   weighted avg       0.91      0.90      0.90      1066



In [96]:
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support

In [107]:
methods = ['Task Specific Model', 'Embeddings Model', ' Zero Shot Classification','OS Generative Model', 'CS Generative Model']
predictions = [y_pred_ts, y_pred_emb, y_pred_zsc, y_pred_open, y_pred_closed]

results = []

for method, y_pred in zip(methods, predictions):
    accuracy = accuracy_score(test['label'], y_pred)
    macro_f1 = f1_score(test['label'], y_pred, average='macro')

    precision, recall, f1, _ = precision_recall_fscore_support(
        test['label'], y_pred, average=None, labels=[0,1]
    ) 

    results.append({
        'Method': method,
        'Accuracy': accuracy,
        'Macro F1': macro_f1,
        'Neg Precision': precision[0],
        'Neg Recall': recall[0],
        'Neg F1': f1[0],
        'Pos Precision': precision[1],
        'Pos Recall': recall[1],
        'Pos F1': f1[1]
    })

df_results = pd.DataFrame(results)
print(df_results)

# Create leaderboard sorted by Macro F1
leaderboard = df_results.sort_values('Macro F1', ascending=False).reset_index(drop=True)
leaderboard.insert(0, 'Rank', range(1, len(leaderboard) + 1))

print("\n" + "="*80)
print("MODEL LEADERBOARD - RANKED BY MACRO F1 SCORE")
print("="*80)
print(leaderboard[['Rank', 'Method', 'Macro F1', 'Accuracy', 'Neg F1', 'Pos F1']].round(3).to_string(index=False))
print("="*80)


# print("\n" + "="*50)
# print("BEST PERFORMANCE PER METRIC:")
# print("="*50)
# for col in df_results.columns[1:]:
#     best_idx = df_results[col].idxmax()
#     best_method = df_results.loc[best_idx, 'Method']
#     best_score = df_results.loc[best_idx, col]
#     print(f"{col:15s}: {best_method:12s} ({best_score:.3f})")

                      Method  Accuracy  Macro F1  Neg Precision  Neg Recall  \
0        Task Specific Model  0.800188  0.798909       0.758900    0.879925   
1           Embeddings Model  0.851782  0.851778       0.847866    0.857411   
2   Zero Shot Classification  0.778612  0.778592       0.783939    0.769231   
3        OS Generative Model  0.833021  0.833018       0.830540    0.836773   
4        CS Generative Model  0.904315  0.903877       0.856198    0.971857   

     Neg F1  Pos Precision  Pos Recall    Pos F1  
0  0.814944       0.857143    0.720450  0.782875  
1  0.852612       0.855787    0.846154  0.850943  
2  0.776515       0.773481    0.787992  0.780669  
3  0.833645       0.835539    0.829268  0.832392  
4  0.910369       0.967462    0.836773  0.897384  

MODEL LEADERBOARD - RANKED BY MACRO F1 SCORE
 Rank                    Method  Macro F1  Accuracy  Neg F1  Pos F1
    1       CS Generative Model     0.904     0.904   0.910   0.897
    2          Embeddings Model     0