## Test - Generative solutions _____
### Lucas Henrique Marchiori


---



This code is defined in two parts, one using pre-trained LLM's, such as BERT, and the second part using Google Gemini's API

The first part was based on code I wrote for a Machine Learning 2 course at the Federal University of São Carlos, and on an undergraduate research project I did at the same university.

# Training the BERT model

In [None]:
# installing the necessary libraries
%%capture
!pip install -q opendatasets
!pip install ydata_profiling
!pip install sentence_transformers

In [None]:
# Importing libraries to be used
import torch
import warnings
import numpy as np
import opendatasets as od
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import torch.nn as nn

from ydata_profiling import ProfileReport
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

#from transformers import BertModel, BertTokenizer, BertForSequenceClassification
#from transformers import Adafactor, AdamW, get_linear_schedule_with_warmup
warnings.filterwarnings('ignore')

# setting pytorch to use googlecolab GPU, if not available use CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


In [None]:
## Downloading the dataset available on kaggle, instructions for generating user and key to use the API: http://bit.ly/kaggle-creds
# In this case we are using posts on social networks, which are texts to be able to classify their sentiment
od.download('https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset/data') # insert your kaggle  username and key
df = pd.read_csv('/content/social-media-sentiments-analysis-dataset/sentimentdataset.csv')


Skipping, found downloaded files in "./social-media-sentiments-analysis-dataset" (use force=True to force download)


In [None]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,1,1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,2,2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,3,3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,4,4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


In [None]:
df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], inplace=True) # excluindo colunas desnecessarias
df['Sentiment'] = df['Sentiment'].str.strip() # Como nosso target é texto, pode ser q eles tenham espaços a mais, o strip remove todo espaço antes e após a string

In [None]:
# Generates a Report of our dataframe, in order to analyze correlations, missing data, etc...
ProfileReport(df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



From the report above, you can see that this dataset is already well structured, there are no columns with null values, and the only correlations found were between likes and retweets, i.e. generally those who like a tweet also retweet it.

another piece of information we noticed is that the words that are repeated the most in the text are stop-words (the, a, off, on, etc...) and the sentiments that are repeated the most are joy and positive.

In [None]:
# As we have many classes, this can have a negative impact on our solution, ranging from high consumption of computing resources (RAM and VRAM)
# Even overfitting, so sentiments that appear less than 3 times will be classified as 'others'.

not_important = df['Sentiment'].value_counts()[df['Sentiment'].value_counts() < 3].index
df.loc[df['Sentiment'].isin(not_important), 'Sentiment'] = 'Others'


In [None]:
# Here we realize that there are 84 different feelings, and 137 texts classified as others
df['Sentiment'].value_counts()

Sentiment
Others            137
Positive           45
Joy                44
Excitement         37
Contentment        19
                 ... 
Satisfaction        3
Accomplishment      3
Harmony             3
Creativity          3
Wonder              3
Name: count, Length: 84, dtype: int64

In [None]:
# below we will make a text target -> int relationship, in order to facilitate the implementation of machine learning
class_to_int = {}
for i, sentiment in enumerate(df['Sentiment'].unique()):
    class_to_int[sentiment] = int(i)
class_to_int

{'Positive': 0,
 'Negative': 1,
 'Neutral': 2,
 'Others': 3,
 'Disgust': 4,
 'Happiness': 5,
 'Joy': 6,
 'Love': 7,
 'Amusement': 8,
 'Admiration': 9,
 'Awe': 10,
 'Surprise': 11,
 'Acceptance': 12,
 'Anticipation': 13,
 'Bitter': 14,
 'Calmness': 15,
 'Confusion': 16,
 'Excitement': 17,
 'Kind': 18,
 'Pride': 19,
 'Shame': 20,
 'Elation': 21,
 'Euphoria': 22,
 'Contentment': 23,
 'Serenity': 24,
 'Gratitude': 25,
 'Hope': 26,
 'Empowerment': 27,
 'Compassion': 28,
 'Tenderness': 29,
 'Arousal': 30,
 'Enthusiasm': 31,
 'Fulfillment': 32,
 'Reverence': 33,
 'Despair': 34,
 'Grief': 35,
 'Loneliness': 36,
 'Jealousy': 37,
 'Resentment': 38,
 'Frustration': 39,
 'Boredom': 40,
 'Regret': 41,
 'Curiosity': 42,
 'Indifference': 43,
 'Numbness': 44,
 'Melancholy': 45,
 'Nostalgia': 46,
 'Ambivalence': 47,
 'Determination': 48,
 'Hopeful': 49,
 'Proud': 50,
 'Grateful': 51,
 'Empathetic': 52,
 'Compassionate': 53,
 'Playful': 54,
 'Free-spirited': 55,
 'Inspired': 56,
 'Confident': 57,
 'Bitt

In [None]:
df['Label'] = df['Sentiment'].replace(class_to_int) # criando uma coluna chamada label com os int referente a cada sentimento


In the codes below we'll have some graphical visualizations just for the sake of curiosity of the post patterns, and the amount of sentiment in this dataset

In [None]:
fig = px.bar(x=df['Sentiment'].unique(), y=df['Sentiment'].value_counts(sort=False),
             color= df['Sentiment'].unique(),
             labels={'y':'Frequência', 'x':'Sentimento'})
fig.show()

In [None]:
text_size = df['Text'].apply(len)
fig = px.histogram(x=text_size)
fig.update_layout(bargap=0.2)
fig.show()

In [None]:
fig = px.violin(df, x='Year', y='Platform', box=True, points="all", color='Platform',
                title='Distribuição das Plataformas ao Longo dos Anos')

fig.show()


In [None]:
MODEL = f"bert-base-uncased"  # Defining our LLM model, which in this case is the uncased BERT, i.e. a BERT that doesn't distinguish between capital and small letters

# Below we are generating a tokenizer from BERT, tokenizer divides a text into smaller parts, called tokens, in this case the one from BERT
# it also adds special tokens like [SEP] indicating sentence separation [CLS] indicating the beginning of a sentence, and so on
tokenizer = BertTokenizer.from_pretrained(MODEL)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
df = df[['Text', 'Label']] # Creating a df with only the information needed for the prediction algorithm, the text and the label
train_df, test_df = train_test_split(df, test_size=0.25,  random_state=42) # Separating between training and test, with 25% of the samples being test


In [None]:
tokenized_train = tokenizer.batch_encode_plus(train_df.Text.values.tolist(), add_special_tokens = True, return_attention_mask = True, pad_to_max_length = True,
                                               max_length = 480, return_tensors = 'pt')
tokenized_test = tokenizer.batch_encode_plus(test_df.Text.values.tolist(), add_special_tokens = True, return_attention_mask = True, pad_to_max_length = True,
                                            max_length = 480, return_tensors = 'pt')

In [None]:
tokenized_train

{'input_ids': tensor([[  101,  2985,  2019,  ...,     0,     0,     0],
        [  101, 19882,  2007,  ...,     0,     0,     0],
        [  101,  2587,  1037,  ...,     0,     0,     0],
        ...,
        [  101,  2007, 26452,  ...,     0,     0,     0],
        [  101,  6832, 15575,  ...,     0,     0,     0],
        [  101,  3110,  8618,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Above, tensors were generated from the tokens generated by bert, where input_ids contains the IDs of the tokens in the input sequence after tokenization. Each integer represents a specific token.

token_type_ids indicates which segment (or sequence) each token belongs to

attention_mask indicates which parts of the input sequence are important (marked as 1) and which are just empty spaces (marked as 0), this is useful during training for the algorithm to know which parts can be ignored and which are important

In [None]:
input_ids_train = tokenized_train['input_ids']
attention_masks_train = tokenized_train['attention_mask']
labels_train = torch.tensor(train_df.Label.values)

input_ids_val = tokenized_test['input_ids']
attention_masks_val = tokenized_test['attention_mask']
labels_test = torch.tensor(test_df.Label.values.tolist())

In [None]:
labels_train # Here is an example of a tensor value for labels, above we are assigning the values generated by the tokenizer to different variables

tensor([ 6,  6, 42,  3,  5, 80,  2,  3, 23, 53, 17,  8,  9, 48,  3,  0, 47, 12,
        76, 81, 40, 69, 23,  6, 11, 22,  3,  6, 44, 41, 10, 34,  3,  3,  3, 53,
        22, 12, 82,  0, 34,  3,  3,  2,  2,  3, 46, 83, 16, 26,  3, 45, 17, 49,
        13,  0, 11, 43,  3, 35,  6,  6, 48,  3,  3,  1, 25,  0,  0,  0, 18, 38,
        67, 25,  2, 36, 17,  6, 35,  3, 17, 49, 34,  0,  3, 56, 39, 60,  3,  3,
        25, 54,  3,  2, 49,  0, 46, 17,  2, 25,  6, 12, 77, 14, 10, 74, 34, 52,
         3, 17, 65, 25, 20,  3, 49, 13,  6,  4, 55, 43,  3,  6, 29, 24, 80, 54,
        44,  0, 83, 31, 25, 17, 53, 28, 44,  3, 70, 49, 34, 10, 29,  3, 35,  2,
        19, 56,  3, 65, 59,  0, 47, 40, 83, 83, 19,  8, 66, 45, 12, 34, 17, 46,
         6, 37, 62,  3,  3, 80, 21, 23,  3,  6,  3,  3, 36,  2, 83, 71,  3,  3,
        79, 27, 26,  6, 36,  1,  3,  6, 60,  6,  0, 63, 24,  0,  3, 40,  0,  3,
        42, 36, 67, 51,  3, 26, 25,  6, 42, 57, 25, 26,  3,  3, 22,  6, 82, 47,
        49,  0,  6, 59, 55,  3, 81, 35, 

In [None]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_test = TensorDataset(input_ids_val, attention_masks_val, labels_test)

In [None]:
dataloader_train = DataLoader(dataset_train, sampler=RandomSampler(dataset_train), batch_size=6)
dataloader_test = DataLoader(dataset_test, sampler=SequentialSampler(dataset_test), batch_size=6)

#Dataloader is merging training and test data every 6 minutes to facilitate training and iteration

In [None]:
class SentimentClassifier(nn.Module):

    """
        applies a dropout layer and a linear layer to perform sentiment classification, using a pre-trained BERT model
    """

    def __init__(self, labels):
        """
        Parameters:
            labels (int): The number of sentiment classes to be classified.
        """
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(MODEL)
        self.drop = nn.Dropout(p=0.2)
        self.out = nn.Linear(self.bert.config.hidden_size, labels)

    def forward(self, input_ids, attention_mask):
        """
        Defines the data flow during the forward pass of the model.

        Parameters:
            input_ids (torch.Tensor): IDs of the input tokens.
            attention_mask (torch.Tensor): Binary attention mask.

        output:
            torch.Tensor: Outputs of the classification .
        """
        _, pooled_output = self.bert(input_ids=input_ids, attention_mask=attention_mask, return_dict=False)
        output = self.drop(pooled_output)
        return self.out(output)

model = SentimentClassifier(84)
model = model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
# documentation for changing parameters or optimizer: https://huggingface.co/docs/transformers/main_classes/optimizer_schedules
EPOCHS = 10
total_steps = len(dataloader_train) * EPOCHS
optimizer = AdamW(model.parameters(), lr = 2e-5, correct_bias=False) # AdamW optimizer with a learning rate of 2e-5
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

In [1]:
def eval_model(model, dataloader, device, criterion, epoch=0):
    """
    Evaluates the model on a validation data set.

    Parameters:
        model (torch.nn.Module): model .
        dataloader (torch.utils.data.DataLoader): Validation dataLoader.
        device (torch.device): device (GPU or CPU).
        criterion (torch.nn.Module): Loss function
        epoch (int): Current epoch

    Returns:
        Tuple: A tuple containing the predictions of the model and the actual values of the labels.
    """
    model = model.train()
    correct_pred = 0
    losses = []
    predictions = []
    true_values = []

    data_loader = tqdm(dataloader, desc=f"Epoch {epoch}", leave=False, disable=False) #tqdm shows progress during loop iterations

    for d in dataloader:

        input_ids = d[0].to(device)
        attention_mask = d[1].to(device)
        targets = d[2].to(device)

        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask) # passing the data through the model to make predictions
        _, preds = torch.max(outputs, dim=1) # geting only the forecasts
        correct_pred += torch.sum(preds == targets) #  Summing only the correct ones

        loss = criterion(outputs, targets) # calculates loss function
        losses.append(loss.item())

        # Backpropagation and updating of model parameters
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        #updates the tqpm information to be displayed
        data_loader.set_postfix(loss=np.mean(losses))

        #Stores forecasts and true values for later analysis
        predictions.append(preds.cpu())
        true_values.append(targets.cpu())


    predictions = np.concatenate(predictions, axis=0)
    true_values = np.concatenate(true_values, axis=0)

    # Returns predictions and true values
    accuracy = correct_pred.item() / len(dataloader)

    return (predictions, true_values)

In [None]:
def train_model(model, dataloader, optimizer, device, scheduler, criterion, epoch=0):
    """
    Trains the model on a set of training data.

    Parameters:
        model (torch.nn.Module): model.
        dataloader (torch.utils.data.DataLoader): Training dataLoader.
        optimizer (torch.optim.Optimizer): optimizer for updating the model parameters.
        device (torch.device): device (GPU or CPU).
        scheduler (torch.optim.lr_scheduler._LRScheduler): learning rate scheduler.
        criterion (torch.nn.Module): Loss function.
        epoch (int): current epoch
    """

    model = model.train() # configures model for training

    losses = []

    data_loader = tqdm(dataloader, desc=f"Epoch {epoch}", leave=False, disable=False)  #tqdm shows progress during loop iterations

    for d in dataloader:
        input_ids = d[0].to(device)
        attention_mask = d[1].to(device)
        targets = d[2].to(device)

        outputs = model(input_ids=input_ids,
                        attention_mask=attention_mask) #  passing the data through the model to make predictions



        loss = criterion(outputs, targets) # Calculating the loss function
        losses.append(loss.item())

        #Backpropagation and updating of model parameters
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        data_loader.set_postfix({'training_loss': '{:.3f}'.format(np.mean(losses))})



In [None]:
criterion = nn.CrossEntropyLoss().to(device)

best = 0
for epoch in range(EPOCHS):
  train_model(model, dataloader_train, optimizer, device, scheduler, criterion, epoch+1) # Train the model on the training data set
  prediction, true_values = eval_model(model, dataloader_test, device, criterion, epoch +1)# Evaluates the model on the validation data set and obtains the predictions
  accuracy = accuracy_score(prediction, true_values) # Calculates the accuracy of the predictions
  if accuracy > best:
    torch.save(model.state_dict(), 'best_model_others.bin') # Saves the model with the best accuracy
    best = accuracy
print(f'A melhor accuracia foi: {best}')

                                                           

A melhor accuracia foi: 0.7595628415300546




In [None]:
model.load_state_dict(torch.load('best_model_others.bin')) # Loads the model with the best accuracy
model = model.to(device)
prediction, true_values = eval_model(model, dataloader_test, device, criterion) # Evaluates the model on the validation data set and obtains the predictions



In [None]:
print(classification_report(prediction,true_values))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        14
           2       1.00      1.00      1.00         1
           3       1.00      0.84      0.91        37
           4       0.67      1.00      0.80         2
           5       0.00      0.00      0.00         0
           6       1.00      0.85      0.92        13
           7       0.00      0.00      0.00         0
           9       0.00      0.00      0.00         0
          10       1.00      1.00      1.00         2
          11       0.00      0.00      0.00         0
          12       1.00      1.00      1.00         2
          13       0.00      0.00      0.00         0
          14       0.00      0.00      0.00         0
          15       1.00      1.00      1.00         2
          16       1.00      1.00      1.00         4
          17       1.00      0.80      0.89        10
          18       1.00      1.00      1.00         1
          20       0.00    

## Conclusions:

Firstly, we have an overall accuracy of 76%, which is considered good accuracy, it is above the 50% that would be the “random” choice, but it is noticeable that some labels had their scores zeroed and this is because their occurrence in the dataset can be very low(3) and when comparing with other data that occurs a lot, the neural network was unable to draw a pattern to identify these little repeated feelings.

Points for improvement in this code are the tuning of the pre-trained model and the choice of model, as in this case a generalist BERT was chosen, so an opportunity for improvement for future work is to see if there is a pre-trained model for the sentiment classification activity, which could improve the results. Another point for improvement is tuning, because apart from the choice of activation function, no choice of hyperparameters was made to improve the algorithm's performance, so future opportunities include finding the hyperparameters that obtain the highest score, and for this we can use Hyper-Parameter Optimization (HPO) techniques, which are Active Learning techniques.

In conclusion, despite the difficulty in training the model itself, where it was necessary to create 2 functions and 1 class that do not exist natively in Python libraries, and requires a certain amount of programming knowledge, it performs well and could perform even better if its hyperparameters are adjusted.

# Using Google Gemini API

In [None]:
%%capture
!pip install google-generativeai

recreating the dataset

In [None]:
df = pd.read_csv('/content/social-media-sentiments-analysis-dataset/sentimentdataset.csv')
df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], inplace=True) #Deleting unnecessary columns
df['Sentiment'] = df['Sentiment'].str.strip()
not_important = df['Sentiment'].value_counts()[df['Sentiment'].value_counts() < 3].index
df.loc[df['Sentiment'].isin(not_important), 'Sentiment'] = 'Others'


In [None]:
# Creating a String with all the sentiments, excluding those that appear less than 3 times, to serve as input to the API
sentimentos =  df['Sentiment'].unique().tolist()
sentimentos = ', '.join(sentimentos)
sentimentos

'Positive, Negative, Neutral, Others, Disgust, Happiness, Joy, Love, Amusement, Admiration, Awe, Surprise, Acceptance, Anticipation, Bitter, Calmness, Confusion, Excitement, Kind, Pride, Shame, Elation, Euphoria, Contentment, Serenity, Gratitude, Hope, Empowerment, Compassion, Tenderness, Arousal, Enthusiasm, Fulfillment, Reverence, Despair, Grief, Loneliness, Jealousy, Resentment, Frustration, Boredom, Regret, Curiosity, Indifference, Numbness, Melancholy, Nostalgia, Ambivalence, Determination, Hopeful, Proud, Grateful, Empathetic, Compassionate, Playful, Free-spirited, Inspired, Confident, Bitterness, Fearful, Overwhelmed, Jealous, Devastated, Frustrated, Envious, Dismissive, Thrill, Inspiration, Satisfaction, Reflection, Accomplishment, Enchantment, Harmony, Creativity, Wonder, Adventure, Heartbreak, Betrayal, Desolation, Embarrassed, Sad, Hate, Bad, Happy'

In [None]:
import google.generativeai as genai
genai.configure(api_key=MY_KEY) # Insert your API key here
generation_config = {"temperature": 0}

model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest",
                              generation_config=generation_config)


In [None]:
answers = []
convo = model.start_chat(history=[])
convo.send_message(f"You are a Sentiment Analysis system and you are restricted to talk only about the sentiment of the text. Do not talk about anything but the sentiments text appears to have, and the feelings must necessarily be only one of these:{sentimentos} choose just one from this list, always")
for text in df['Text']:
    convo.send_message(text.strip())
    answers.append(convo.last.text)


ResourceExhausted: 429 Resource has been exhausted (e.g. check quota).

In [None]:
answers

['Happiness \n', 'Frustration \n']

## Conclusão

Generating via API when you have a small dataset (in this case) can be more advantageous because it consumes fewer computing resources and is also easier to use, since in just a few lines of code you can use LLMs that have already been implemented and generate satisfactory results.
However, as a disadvantage, there is a monetary cost, as can be seen from this code, it was only possible to generate 2 responses and the free quota has already been used up, so it is necessary to pay to generate more requests to the API, and in a more serious case, when trying to use the OpenAI(GPT) API, it was not possible to generate any responses without first investing financially.
Another disadvantage is that for very specific data, the LLMs already implemented tend not to perform very well, as they are more generalist, so the option of training the model itself, using BERT or other pre-trained models, becomes a little more viable, especially when you have a lot of specific data.

Another point to highlight, and a recommendation for future work, is the use of NLP techniques to identify the structure of the post, and from this structure infer the sentiment and guide the LLM using prompt engineering with this information, thus avoiding the LLM suffering from hallucinations and straying too far from the answer. It is also recommended to study and alter hyperparameters, here in this work we have only touched on "temperature", which by setting it to 0 tells the LLM that it has to give an exact answer, without being creative, thus preventing it from inventing information that it doesn't need or that it doesn't have.