SiEBERT - English-Language Sentiment Classification

This model ("SiEBERT", prefix for "Sentiment in English") is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment. The model was fine-tuned and evaluated on 15 data sets from diverse text sources to enhance generalization across different types of texts (reviews, tweets, etc.). Consequently, it outperforms models trained on only one type of text (e.g., movie reviews from the popular SST-2 benchmark) when used on new data as shown below.

https://huggingface.co/siebert/sentiment-roberta-large-english?text=I+like+you.+I+love+you

In [1]:
! pip install transformers
! pip install pandas
! pip install evaluate
# ! pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Import required packages
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate
from random import sample

# Create class for data preparation
class SimpleDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts
    
    def __len__(self):
        return len(self.tokenized_texts["input_ids"])
    
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.tokenized_texts.items()}

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [3]:
# Load tokenizer and model, create trainer
model_name = "siebert/sentiment-roberta-large-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model)

In [5]:
# Create list of texts (can be imported from .csv, .xls etc.)
train_raw = pd.read_csv('train_df_imbalanced.csv')
test_raw1 = pd.read_csv('test_df_Bryson.csv')
test_raw2 = pd.read_csv('test_df_Gx.csv')
test_raw3 = pd.read_csv('test_df_Kelvin.csv')

In [7]:
test1 = test_raw1.loc[test_raw1.Annotator_1 != 0]
test1['text'] = test1['reviewTitle'] + ' ' + test1['reviewDescription']
test1 = test1[['text', 'Annotator_1']].copy()
test1 = test1.rename(columns={'Annotator_1': 'polarity'})
test1.loc[test1['polarity'] == -1, 'polarity'] = 0
pred_test1 = test1['text'].dropna().astype('str').tolist()

test2 = test_raw2.loc[test_raw2.Annotator_1 != 0]
test2['text'] = test2['reviewTitle'] + ' ' + test2['reviewDescription']
test2 = test2[['text', 'Annotator_1']].copy()
test2 = test2.rename(columns={'Annotator_1': 'polarity'})
test2.loc[test2['polarity'] == -1, 'polarity'] = 0
pred_test2 = test2['text'].dropna().astype('str').tolist()

test3 = test_raw3.loc[test_raw3.Annotator_1 != 0]
test3['text'] = test3['reviewTitle'] + ' ' + test3['reviewDescription']
test3 = test3[['text', 'Annotator_1']].copy()
test3 = test3.rename(columns={'Annotator_1': 'polarity'})
test3.loc[test3['polarity'] == -1, 'polarity'] = 0
pred_test3 = test3['text'].dropna().astype('str').tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [None]:
# pos_neg = raw_data.loc[raw_data['polarity'] != 0] # only take pos and neg reviews
# true_labels = pos_neg[['concat_review', 'polarity']].copy()
# true_labels = true_labels.rename(columns={'concat_review': 'text'})
# true_labels.loc[true_labels['polarity'] == -1, 'polarity'] = 0
# pred_texts = pos_neg['concat_review'].dropna().astype('str').tolist()
# len(true_labels) == len(pred_texts)

In [8]:
# Tokenize texts and create prediction data set
tokenized_text1 = tokenizer(pred_test1,truncation=True,padding='max_length')
tokenized_text2 = tokenizer(pred_test2,truncation=True,padding='max_length')
tokenized_text3 = tokenizer(pred_test3,truncation=True,padding='max_length')

pred_dataset1 = SimpleDataset(tokenized_text1)
pred_dataset2 = SimpleDataset(tokenized_text2)
pred_dataset3 = SimpleDataset(tokenized_text3)

In [9]:
# Run predictions
prediction1 = trainer.predict(pred_dataset1)
prediction2 = trainer.predict(pred_dataset2)
prediction3 = trainer.predict(pred_dataset3)

***** Running Prediction *****
  Num examples = 758
  Batch size = 8


***** Running Prediction *****
  Num examples = 762
  Batch size = 8


***** Running Prediction *****
  Num examples = 707
  Batch size = 8


In [10]:
# Transform predictions to labels
preds1 = prediction1.predictions.argmax(-1)
labels1 = pd.Series(preds1).map(model.config.id2label)
scores1 = (np.exp(prediction1[0])/np.exp(prediction1[0]).sum(-1,keepdims=True)).max(1)

preds2 = prediction2.predictions.argmax(-1)
labels2 = pd.Series(preds2).map(model.config.id2label)
scores2 = (np.exp(prediction2[0])/np.exp(prediction2[0]).sum(-1,keepdims=True)).max(1)

preds3 = prediction3.predictions.argmax(-1)
labels3 = pd.Series(preds3).map(model.config.id2label)
scores3 = (np.exp(prediction3[0])/np.exp(prediction3[0]).sum(-1,keepdims=True)).max(1)

In [11]:
# Create DataFrame with texts, predictions, labels, and scores
df_test1 = pd.DataFrame(list(zip(pred_test1,preds1,labels1,scores1)), columns=['text','pred','label','score'])
df_test1

Unnamed: 0,text,pred,label,score
0,The content is all messed up I started this bo...,0,NEGATIVE,0.999510
1,Duplicate copy.Damaged book. Pages missing.,0,NEGATIVE,0.999499
2,Awful I gave up after 38% of my Kindle. Yes we...,0,NEGATIVE,0.999505
3,Syrupy Overload The book is an example of lead...,0,NEGATIVE,0.999471
4,"Couldn‚Äôt read it; type too small! Beware, th...",0,NEGATIVE,0.999497
...,...,...,...,...
753,Actually Excited About A New Edition Been play...,1,POSITIVE,0.998796
754,"Elliot is My Newest Book Boyfriend! Sweet, end...",1,POSITIVE,0.998864
755,Beautifully Packagaed and Wonderful Series I'v...,1,POSITIVE,0.998924
756,A great beginning book that will teach my Gran...,1,POSITIVE,0.998902


In [12]:
# Calculate accuracy between labels and true_labels
df_test1 = df_test1.merge(test1, on='text')
df_test1

Unnamed: 0,text,pred,label,score,polarity
0,The content is all messed up I started this bo...,0,NEGATIVE,0.999510,0
1,Duplicate copy.Damaged book. Pages missing.,0,NEGATIVE,0.999499,0
2,Awful I gave up after 38% of my Kindle. Yes we...,0,NEGATIVE,0.999505,0
3,Syrupy Overload The book is an example of lead...,0,NEGATIVE,0.999471,0
4,"Couldn‚Äôt read it; type too small! Beware, th...",0,NEGATIVE,0.999497,0
...,...,...,...,...,...
753,Actually Excited About A New Edition Been play...,1,POSITIVE,0.998796,1
754,"Elliot is My Newest Book Boyfriend! Sweet, end...",1,POSITIVE,0.998864,1
755,Beautifully Packagaed and Wonderful Series I'v...,1,POSITIVE,0.998924,1
756,A great beginning book that will teach my Gran...,1,POSITIVE,0.998902,1


In [13]:
df_test2 = pd.DataFrame(list(zip(pred_test2,preds2,labels2,scores2)), columns=['text','pred','label','score'])
df_test2 = df_test2.merge(test2, on='text')

In [14]:
df_test3 = pd.DataFrame(list(zip(pred_test3,preds3,labels3,scores3)), columns=['text','pred','label','score'])
df_test3 = df_test3.merge(test3, on='text')

In [15]:
def pred_correctly(y_true, y_pred):
  if y_true == y_pred:
    return 1
  else:
    return 0

In [16]:
df_test1['correct'] = df_test1.apply(lambda row: pred_correctly(row['polarity'], row['pred']), axis=1)
df_test2['correct'] = df_test2.apply(lambda row: pred_correctly(row['polarity'], row['pred']), axis=1)
df_test3['correct'] = df_test3.apply(lambda row: pred_correctly(row['polarity'], row['pred']), axis=1)

In [17]:
print('Test 1 Accuracy: ', df_test1['correct'].sum()/len(df_test1))

Test 1 Accuracy:  0.9696569920844327


In [18]:
print('Test 2 Accuracy: ', df_test2['correct'].sum()/len(df_test2))

Test 2 Accuracy:  0.9671916010498688


In [19]:
print('Test 3 Accuracy: ', df_test3['correct'].sum()/len(df_test3))

Test 3 Accuracy:  0.983026874115983


Fine-tuning SiEBERT

CUDA out of memory :(

In [35]:
train = train_raw.loc[train_raw.polarity != 0]
train = train[['concat_review', 'polarity']].copy()
train = train.rename(columns={'concat_review': 'text'})
train.loc[train['polarity'] == -1, 'polarity'] = 0
train_text = train['text'].dropna().astype('str').tolist()

val = val_raw.loc[val_raw.Annotator_1 != 0]
val['text'] = val['reviewTitle'] + ' ' + val['reviewDescription']
val = val[['text', 'Annotator_1']].copy()
val = val.rename(columns={'Annotator_1': 'polarity'})
val.loc[val['polarity'] == -1, 'polarity'] = 0
val_text = val['text'].dropna().astype('str').tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [36]:
small_train = train.sample(n=2000, random_state=42)
small_train_text = small_train['text'].dropna().astype('str').tolist()

In [37]:
tokenized_train = tokenizer(small_train_text,truncation=True,padding='max_length')
tokenized_val = tokenizer(val_text,truncation=True,padding='max_length')

# train_ds = SimpleDataset(tokenized_train)
# val_ds = SimpleDataset(tokenized_val)

In [38]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_train, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(tokenized_val, batch_size=8)

In [25]:
model_name = "siebert/sentiment-roberta-large-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--siebert--sentiment-roberta-large-english/snapshots/6eac71655a474ee4d6d0eee7fa532300c537856d/config.json
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.23.1",
  "type

In [26]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [27]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [29]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

RuntimeError: ignored

In [40]:
acc_metric = evaluate.load('accuracy')
def compute_metrics(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return acc_metric.compute(predictions=predictions, references=labels)

In [41]:
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", per_device_train_batch_size=8, per_device_eval_batch_size=8)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [42]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataloader,
    eval_dataset=eval_dataloader,
    compute_metrics=compute_metrics,
)

RuntimeError: ignored

In [54]:
import torch
torch.cuda.empty_cache()
# torch.cuda.memory_summary(device='cuda', abbreviated=False)

In [12]:
trainer.train()

***** Running training *****
  Num examples = 2000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 750


RuntimeError: ignored