# I Sentiment Orientation Identification

## Table of Contents

1. [Loading Data and Libraries](#loading-dependencies)
2. [Loading the model](#model)
3. [Check for Token Overflow in commentBody](#Preprocess-Tokenize-Overflow)
4. [Sentiment analysis](#sentiment)
5. [Saving Results](#saving)

## Loading Data and Libraries 
<a class="anchor" id="loading-dependencies"></a>

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

import torch
from scipy.special import softmax
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig

df_c = pd.read_parquet('Comments.parquet')

## Loading the model 
<a class="anchor" id="model"></a>

Using the pretrained model 'twitter-XLM-roBERTa-base for Sentiment Analysis' from the Cardiff NLP group at Cardiff University.  
This Model was fine-tuned with 198M tweets in over thirty languages.  
https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment

In [2]:
MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
#tokenizer.save_pretrained(MODEL)
#model.save_pretrained(MODEL)

## Check for Token Overflow in commentBody
<a class="anchor" id="Preprocess-Tokenize-Overflow"></a>

1. **Preprocessing**  
Replacing any mentioned usernames to ensure that mentions, which might carry sentiment connotations, do not influence the sentiment of the message.

2. **Tokenizing**  
Tokenize the comment text using the tokenizer included with the model.

3. **Checking for Token Overflow**  
Since the CardiffNLP/twitter-xlm-roberta-base-sentiment model allows a maximum token length of 512, a check of the token length is performed on a sample of the dataset (n=100,000).

In [3]:
def preprocess(text):
    """
    Preprocess the comment by replacing usernames, links and newline caracters.
    
    Parameters:
    text (str): Unporcessed comment Body.
    
    Returns:
    str: Cleaned comment Body
    """
    text = text.replace("\n", "") 
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)
    
def tokenize_text(text): 
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=1024, padding=True).to('cuda')
    return encoded_input

def count_tokens(tokenized_input):
    return tokenized_input['input_ids'].shape[1]  
    

tqdm.pandas(desc="Preprocess comment sample")
df_c = df_c[df_c['commentBody'].notna()].copy()
df_s = df_c.sample(100_000)
df_s['commentBody_preprocessed'] = df_s['commentBody'].progress_apply(preprocess)

tqdm.pandas(desc="Tokenize comment sample")
df_s['tokenized_input_1024'] = df_s['commentBody_preprocessed'].progress_apply(tokenize_text)
df_s['num_tokens'] = df_s['tokenized_input_1024'].apply(count_tokens)

print(f"Average number of tokens: {df_s['num_tokens'].mean()}")
print(f"Median number of tokens:  {df_s['num_tokens'].median()}")
print(f"Minimum number of tokens: {df_s['num_tokens'].min()}")
print(f"Maximum number of tokens: {df_s['num_tokens'].max()}")
print(f"Number of tokens > 512:   {(df_s['num_tokens'] > 512).sum()}")

Preprocess comment sample: 100%|████████████████████████████████████████████| 100000/100000 [00:02<00:00, 36361.03it/s]
Tokenize comment sample: 100%|███████████████████████████████████████████████| 100000/100000 [01:26<00:00, 1154.89it/s]


Average number of tokens: 95.42051
Median number of tokens:  70.0
Minimum number of tokens: 3
Maximum number of tokens: 786
Number of tokens > 512:   17


## Sentiment analysis
<a class="anchor" id="sentiment"></a>

To reduce calculation times, the model inference is performed on the GPU.  
The duration of the sentiment analysis for each comment in the dataset is strongly dependent on the available hardware.  
Efforts to parallelize the processing of the CardiffNLP/twitter-xlm-roberta-base-sentiment model were unsuccessful.   
Sequentially processing each comment already fully utilizes my GPU (RTX 3060) at 100% capacity, achieving only 60 analyses per second.

In [8]:
def sentiment_analysis(row, tokenizer, model):
    """
    Perform sentiment analysis on a single comment.
    
    Parameters:
    row (str): Preprocessed comment body.
    tokenizer: The tokenizer from the pretrained sentiment model.
    model: The pre-trained sentiment model.
    
    Returns:
    float: The polarity score.
    """
    preprocessed_text = row
    encoded_input = tokenizer(preprocessed_text, return_tensors='pt', truncation=True, max_length=512, padding=True).to('cuda')
    model = model.to('cuda')
    output = model(**encoded_input)
    
    scores = output.logits[0].cpu().detach()
    scores = softmax(scores)
    sentiment_scores = {'positive': scores[2], 'neutral': scores[1], 'negative': scores[0]}

    weights = {'positive': 1, 'neutral': 0, 'negative': -1}
    weighted_sum = sum(sentiment_scores[sentiment] * weights[sentiment] for sentiment in sentiment_scores)
    polarity_score = np.tanh(weighted_sum)
    
    return polarity_score

tqdm.pandas(desc="Preprocess comment sample")
df_c = df_c[df_c['commentBody'].notna()].copy().sample(1000)
df_c['commentBody_preprocessed'] = df_c['commentBody'].progress_apply(preprocess)

tqdm.pandas(desc="Sentiment Analysis")
df_c['polarity_scores'] = df_c['commentBody_preprocessed'].progress_apply(sentiment_analysis, args=(tokenizer, model))

Preprocess comment sample: 100%|████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 45455.87it/s]
Sentiment Analysis: 100%|██████████████████████████████████████████████████████████| 1000/1000 [00:27<00:00, 36.70it/s]


## Saving Results
<a class="anchor" id="saving"></a>

The final results of the fully processed and analyzed dataset will be saved as a parquet file, containing the initial comments table, including a new column 'polarity_scores'.

In [None]:
df_c = pd.read_parquet('sentiments.parquet')