## Sentiment analysis using BERT and fine tuning

In this notebook, we will first perform sentiment analysis using a small language model (BERT), using the media bias dataset. We will then fine tune the model and repeat the analysis, comparing the accuracy of the model before and after.

In [1]:
# Imports 
import numpy as np
import pandas as pd
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import seaborn as sns
import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline





**Importing and viewing data**

The dataset we are using can be found here as "final_labels_sg1". It is a dataset containing news headlines which have been labelled manually as biased or non-biased, including any potential biased words. 

In [2]:
url='https://docs.google.com/spreadsheets/d/1KKPAiOppopEzbnINsdl-OVR8WOg2ly1a/edit?usp=drive_link&ouid=109883226317661265367&rtpof=true&sd=true'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_excel(dwn_url)
df.head(5)

Unnamed: 0,text,news_link,outlet,topic,type,label_bias,label_opinion,biased_words
0,The Republican president assumed he was helpin...,http://www.msnbc.com/rachel-maddow-show/auto-i...,msnbc,environment,left,Biased,Expresses writer’s opinion,[]
1,Though the indictment of a woman for her own p...,https://eu.usatoday.com/story/news/nation/2019...,usa-today,abortion,center,Non-biased,Somewhat factual but also opinionated,[]
2,Ingraham began the exchange by noting American...,https://www.breitbart.com/economy/2020/01/12/d...,breitbart,immigration,right,No agreement,No agreement,['flood']
3,The tragedy of America’s 18 years in Afghanist...,http://feedproxy.google.com/~r/breitbart/~3/ER...,breitbart,international-politics-and-world-news,right,Biased,Somewhat factual but also opinionated,"['tragedy', 'stubborn']"
4,The justices threw out a challenge from gun ri...,https://www.huffpost.com/entry/supreme-court-g...,msnbc,gun-control,left,Non-biased,Entirely factual,[]


Running inference from Pre-Trained DistilBERT. Note, we are assuming Non-biased -> positive sentiment and Biased -> negative sentiment

An off the shelf model is not designed for this task, so we will attempt to fine tune it later

In [3]:
# Take random sample from dataset

df_sample = df.sample(n=100, random_state=42).reset_index(drop=True)
texts = df_sample["text"].tolist()

In [4]:
from transformers import pipeline
# Load pre-trained DistilBERT model for sentiment classification
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

# Predict sentiment for each text
predictions = classifier(texts)

Device set to use cpu


In [5]:
# Mapping labels (as classification is binary)
def map_labels(pred):
    if pred["label"] == "POSITIVE":
        return "Non-biased"
    elif pred["label"] == "NEGATIVE":
        return "Biased"
    else:
        return "No agreement" # if confidence is low

# Apply mapping to predictions
df_sample["predicted_label"] = [map_labels(p) for p in predictions]

# Undo mapping from original labels for comparsion
#label_mapping = {0:'Biased', 1: "No agreement", 2: "Non-biased"}
#df_sample['label_bias'] = df['label_bias'].map(label_mapping)

# Show results
df_sample[["text", "label_bias", "predicted_label"]].head(10)




Unnamed: 0,text,label_bias,predicted_label
0,If she doesn't want to thank him for his honor...,Biased,Biased
1,House Democrats’ Chinese coronavirus relief pa...,Biased,Biased
2,French sporting goods retailer Decathlon has c...,Non-biased,Biased
3,That is because much of what Newsom and the De...,No agreement,Biased
4,Apple and Google are facing criticism for offe...,Non-biased,Biased
5,Democrats really don’t want to spend billions ...,Biased,Biased
6,"That’s why white nationalists, who are enthusi...",Biased,Biased
7,Hungary is backing President Trump in his crac...,Biased,Non-biased
8,For the first time since the enactment of the ...,Non-biased,Biased
9,"Earlier this year, Freedom to Prosper and Data...",Non-biased,Biased


In [6]:
# Metrics
from sklearn.metrics import classification_report

df_sample["label_bias"] = df_sample["label_bias"].astype(str)
df_sample["predicted_label"] = df_sample["predicted_label"].astype(str)

print(classification_report(df_sample["label_bias"], df_sample['predicted_label']))

              precision    recall  f1-score   support

      Biased       0.40      0.71      0.51        41
No agreement       0.00      0.00      0.00         9
  Non-biased       0.48      0.26      0.34        50

    accuracy                           0.42       100
   macro avg       0.29      0.32      0.28       100
weighted avg       0.40      0.42      0.38       100



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Fine Tuning**

In [None]:
# Map labels
label_mapping = {"Biased":0, "No agreement": 1, "Non-biased": 2}
df['label_bias'] = df['label_bias'].map(label_mapping)

In [None]:
# Convert into HF dataset
from datasets import Dataset
dataset = Dataset.from_pandas(df)

In [None]:
# Tokenisation
from transformers import AutoTokenizer

# DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    tokens = tokenizer(examples["text"], padding="max_length", truncation=True)
    tokens["labels"] = examples["label_bias"]  # Add labels to the tokenized dataset
    return tokens

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1700 [00:00<?, ? examples/s]

In [None]:
train_test = tokenized_datasets.train_test_split(test_size=0.2)
train_dataset = train_test["train"]
val_dataset = train_test["test"]


In [None]:
from transformers import TrainingArguments
from transformers import Trainer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification

# Load model with num_labels set to 3 (for positive, neutral, negative)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    load_best_model_at_end=True
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)



In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.796735
2,No log,0.826373
3,No log,0.922407


TrainOutput(global_step=255, training_loss=0.5093984566482843, metrics={'train_runtime': 4627.8624, 'train_samples_per_second': 0.882, 'train_steps_per_second': 0.055, 'total_flos': 623137755856896.0, 'train_loss': 0.5093984566482843, 'epoch': 3.0})

In [24]:
model.save_pretrained("sentiment-distilbert")
tokenizer.save_pretrained("sentiment-distilbert")


('sentiment-distilbert\\tokenizer_config.json',
 'sentiment-distilbert\\special_tokens_map.json',
 'sentiment-distilbert\\vocab.txt',
 'sentiment-distilbert\\added_tokens.json',
 'sentiment-distilbert\\tokenizer.json')

In [25]:
from transformers import pipeline

classifier = pipeline("text-classification", model="sentiment-distilbert", tokenizer=tokenizer)

text = df['text'][0]
prediction = classifier(text)

print(prediction)  


Device set to use cpu


[{'label': 'LABEL_0', 'score': 0.463336706161499}]
