### Let's start the training

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
train_df = pd.read_json('C:/Codes/Siena_AI/Synthetic_Data/Training_Dataset.json')
val_df = pd.read_json('C:/Codes/Siena_AI/Synthetic_Data/Validation_Dataset.json')

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")

# Function to tokenize and encode the data
def preprocess_data(data):
    return tokenizer(
        data['Sentence'].tolist(),
        padding=True,
        truncation=True,
        max_length=384,
        return_tensors="pt"
    )

# Preprocess both train and validation data
train_encodings = preprocess_data(train_df)
val_encodings = preprocess_data(val_df)

In [4]:
from transformers import AutoModelForSequenceClassification

# Define the number of unique labels
num_labels = len(train_df['Sentiment'].unique())

model = AutoModelForSequenceClassification.from_pretrained(
    "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment",
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
#As we see above, this model was initially trained for binary Classification,
# we will have to re-write the configuration
label_mappings = {
    0: "Strong Negative",
    1: "Mild Negative",
    2: "Neutral",
    3: "Mild Positive",
    4: "Strong Positive"
}

# Update the model's configuration to use custom id2label and label2id mappings
model.config.id2label = label_mappings
model.config.label2id = {v: k for k, v in label_mappings.items()}

In [6]:
#Let's take a look at the model configuration
model.config

BertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "directionality": "bidi",
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Strong Negative",
    "1": "Mild Negative",
    "2": "Neutral",
    "3": "Mild Positive",
    "4": "Strong Positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "Mild Negative": 1,
    "Mild Positive": 3,
    "Neutral": 2,
    "Strong Negative": 0,
    "Strong Positive": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_

In [7]:
import torch
from transformers import Trainer, TrainingArguments


# Convert labels to numeric format
train_labels = torch.tensor(train_df['Sentiment'].factorize()[0])
val_labels = torch.tensor(val_df['Sentiment'].factorize()[0])

# Create a Dataset class to work with Trainer
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset = SentimentDataset(train_encodings, train_labels)
val_dataset = SentimentDataset(val_encodings, val_labels)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-7,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,1.658947
2,No log,1.637957
3,No log,1.625891
4,No log,1.622058


TrainOutput(global_step=300, training_loss=1.6486251831054688, metrics={'train_runtime': 1413.8335, 'train_samples_per_second': 3.395, 'train_steps_per_second': 0.212, 'total_flos': 241739793388800.0, 'train_loss': 1.6486251831054688, 'epoch': 4.0})

In [8]:
# Save model and tokenizer
model.save_pretrained("./fine_tuned_sentiment_model_rt1")
tokenizer.save_pretrained("./fine_tuned_sentiment_model_rt1")

# Evaluate the model on the validation dataset
trainer.evaluate()

{'eval_loss': 1.6220580339431763,
 'eval_runtime': 23.0043,
 'eval_samples_per_second': 13.041,
 'eval_steps_per_second': 0.826,
 'epoch': 4.0}

In [14]:
# Get predictions on the validation set
predictions = trainer.predict(val_dataset)

### Test Some Sample Text

In [9]:
# Load model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("fine_tuned_sentiment_model_rt1")
model = AutoModelForSequenceClassification.from_pretrained("fine_tuned_sentiment_model_rt1")

In [10]:
from transformers import pipeline
classifier = pipeline("text-classification", model="fine_tuned_sentiment_model_rt1")

In [11]:
test_texts = ["The food was absolutely horrible. The flavors were completely unbalanced and the presentation was terrible. I wouldn\u2019t recommend this place to my worst enemy.",
             "While this restaurant had some good dishes, some of my friends ordered the same thing, which resulted in an overcrowded table. The dining area was quite small, which made us feel rushed.",
            "This place has been on my bucket list for so long! The food is always a highlight, and I love trying new dishes. I highly recommend it to anyone looking for a great dining experience.",
             "The food was okay, nothing special. It was neither bad nor great, just an average meal."]

In [12]:
for text in test_texts:
    result = classifier(text)
    print(f"Input: {text}")
    print(f"Predicted Label: {result[0]['label']},Score: {result[0]['score']:.4f}\n")

Input: The food was absolutely horrible. The flavors were completely unbalanced and the presentation was terrible. I wouldn’t recommend this place to my worst enemy.
Predicted Label: Mild Negative,Score: 0.2509

Input: While this restaurant had some good dishes, some of my friends ordered the same thing, which resulted in an overcrowded table. The dining area was quite small, which made us feel rushed.
Predicted Label: Mild Negative,Score: 0.2828

Input: This place has been on my bucket list for so long! The food is always a highlight, and I love trying new dishes. I highly recommend it to anyone looking for a great dining experience.
Predicted Label: Strong Negative,Score: 0.2551

Input: The food was okay, nothing special. It was neither bad nor great, just an average meal.
Predicted Label: Mild Negative,Score: 0.2697



### Micro, Macro Precision and Recall Scores

In [13]:
from sklearn.metrics import classification_report
import torch
from tqdm import tqdm

In [14]:
# Function to get predictions from the model
def get_predictions(model, dataset):
    model.eval()
    all_preds = []
    all_labels = []

    for batch in tqdm(dataset):
        inputs = {key: batch[key].unsqueeze(0).to(model.device) for key in batch if key != 'labels'}
        labels = batch['labels'].to(model.device)

        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits

        predictions = torch.argmax(logits, dim=-1)

        # Convert labels and predictions to list format
        all_labels.append(labels.item())  # For single labels
        all_preds.append(predictions.item())  # For single predictions
    
    return all_labels, all_preds

In [15]:
# Get predictions and labels from the validation set
true_labels, predicted_labels = get_predictions(model, val_dataset)

# Compute classification metrics: precision, recall, f1, and support
report = classification_report(true_labels, predicted_labels, labels=[0,1,2,3,4], output_dict=True)

# Print out the classification report
print(report)

100%|████████████████████████████████████████████████████████████████████| 300/300 [00:29<00:00, 10.29it/s]

{'0': {'precision': 0.0851063829787234, 'recall': 0.06666666666666667, 'f1-score': 0.07476635514018691, 'support': 60.0}, '1': {'precision': 0.25766871165644173, 'recall': 0.7, 'f1-score': 0.37668161434977576, 'support': 60.0}, '2': {'precision': 0.25, 'recall': 0.08333333333333333, 'f1-score': 0.125, 'support': 60.0}, '3': {'precision': 0.26785714285714285, 'recall': 0.25, 'f1-score': 0.25862068965517243, 'support': 60.0}, '4': {'precision': 0.07142857142857142, 'recall': 0.016666666666666666, 'f1-score': 0.02702702702702703, 'support': 60.0}, 'accuracy': 0.22333333333333333, 'macro avg': {'precision': 0.18641216178417586, 'recall': 0.22333333333333333, 'f1-score': 0.17241913723443242, 'support': 300.0}, 'weighted avg': {'precision': 0.18641216178417586, 'recall': 0.22333333333333333, 'f1-score': 0.17241913723443245, 'support': 300.0}}





In [16]:
import numpy as np

# Extract precision, recall, and support for each class
precision_per_class = [report[str(label)]['precision'] for label in range(len(report) - 3)]
recall_per_class = [report[str(label)]['recall'] for label in range(len(report) - 3)]
support_per_class = [report[str(label)]['support'] for label in range(len(report) - 3)]

# Calculate macro-averaged metrics (which are directly available in `report`)
macro_precision = report['macro avg']['precision']
macro_recall = report['macro avg']['recall']

# Calculate micro-averaged metrics
# Weighted by the number of samples per class (support)
micro_precision = np.sum([p * s for p, s in zip(precision_per_class, support_per_class)]) / np.sum(support_per_class)
micro_recall = np.sum([r * s for r, s in zip(recall_per_class, support_per_class)]) / np.sum(support_per_class)

# Output calculated values
print("Macro Precision:", macro_precision)
print("Macro Recall:", macro_recall)
print("Micro Precision:", micro_precision)
print("Micro Recall:", micro_recall)

Macro Precision: 0.18641216178417586
Macro Recall: 0.22333333333333333
Micro Precision: 0.18641216178417586
Micro Recall: 0.22333333333333333


## Re-Training the model again, differently...

### Let's start the training

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")

# Function to tokenize and encode the data
def preprocess_data(data):
    return tokenizer(
        data['Sentence'].tolist(),
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

# Preprocess both train and validation data
train_encodings = preprocess_data(train_df)
val_encodings = preprocess_data(val_df)

In [19]:
from transformers import AutoModelForSequenceClassification

# Define the number of unique labels
num_labels = len(train_df['Sentiment'].unique())

model = AutoModelForSequenceClassification.from_pretrained(
    "IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment",
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([2, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([2]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
label_mappings = {
    0: "Strong Negative",
    1: "Mild Negative",
    2: "Neutral",
    3: "Mild Positive",
    4: "Strong Positive"
}

# Update the model's configuration to use custom id2label and label2id mappings
model.config.id2label = label_mappings
model.config.label2id = {v: k for k, v in label_mappings.items()}

In [21]:
import torch
from transformers import Trainer, TrainingArguments


# Convert labels to numeric format
train_labels = torch.tensor(train_df['Sentiment'].factorize()[0])
val_labels = torch.tensor(val_df['Sentiment'].factorize()[0])

# Create a Dataset class to work with Trainer
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset = SentimentDataset(train_encodings, train_labels)
val_dataset = SentimentDataset(val_encodings, val_labels)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-7,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,1.619712
2,1.631500,1.56537
3,1.631500,1.51547
4,1.545600,1.469467
5,1.475200,1.428275
6,1.475200,1.394433
7,1.416200,1.368611
8,1.416200,1.349045
9,1.381500,1.337835
10,1.364100,1.334437


TrainOutput(global_step=3000, training_loss=1.4690140380859376, metrics={'train_runtime': 4668.4585, 'train_samples_per_second': 2.57, 'train_steps_per_second': 0.643, 'total_flos': 604349483472000.0, 'train_loss': 1.4690140380859376, 'epoch': 10.0})

In [25]:
# Save model and tokenizer
model.save_pretrained("./fine_tuned_sentiment_model_rt2")
tokenizer.save_pretrained("./fine_tuned_sentiment_model_rt2")

# Evaluate the model on the validation dataset
trainer.evaluate()

{'eval_loss': 1.3344367742538452,
 'eval_runtime': 23.7275,
 'eval_samples_per_second': 12.644,
 'eval_steps_per_second': 3.161,
 'epoch': 10.0}

### Micro, Macro Precision and Recall Scores

In [26]:
from sklearn.metrics import classification_report
import torch
from tqdm import tqdm

In [27]:
# Function to get predictions from the model
def get_predictions(model, dataset):
    model.eval()
    all_preds = []
    all_labels = []

    for batch in tqdm(dataset):
        inputs = {key: batch[key].unsqueeze(0).to(model.device) for key in batch if key != 'labels'}
        labels = batch['labels'].to(model.device)

        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits

        predictions = torch.argmax(logits, dim=-1)

        # Convert labels and predictions to list format
        all_labels.append(labels.item())  # For single labels
        all_preds.append(predictions.item())  # For single predictions
    
    return all_labels, all_preds

In [28]:
# Get predictions and labels from the validation set
true_labels, predicted_labels = get_predictions(model, val_dataset)

# Compute classification metrics: precision, recall, f1, and support
report = classification_report(true_labels, predicted_labels, labels=[0,1,2,3,4], output_dict=True)

# Print out the classification report
print(report)

100%|████████████████████████████████████████████████████████████████████| 300/300 [00:29<00:00, 10.22it/s]

{'0': {'precision': 0.5909090909090909, 'recall': 0.65, 'f1-score': 0.6190476190476191, 'support': 60.0}, '1': {'precision': 0.43636363636363634, 'recall': 0.4, 'f1-score': 0.41739130434782606, 'support': 60.0}, '2': {'precision': 0.6, 'recall': 0.55, 'f1-score': 0.5739130434782609, 'support': 60.0}, '3': {'precision': 0.48, 'recall': 0.4, 'f1-score': 0.43636363636363634, 'support': 60.0}, '4': {'precision': 0.5135135135135135, 'recall': 0.6333333333333333, 'f1-score': 0.5671641791044776, 'support': 60.0}, 'accuracy': 0.5266666666666666, 'macro avg': {'precision': 0.5241572481572482, 'recall': 0.5266666666666666, 'f1-score': 0.5227759564683641, 'support': 300.0}, 'weighted avg': {'precision': 0.5241572481572482, 'recall': 0.5266666666666666, 'f1-score': 0.5227759564683641, 'support': 300.0}}





In [29]:
import numpy as np

# Extract precision, recall, and support for each class
precision_per_class = [report[str(label)]['precision'] for label in range(len(report) - 3)]
recall_per_class = [report[str(label)]['recall'] for label in range(len(report) - 3)]
support_per_class = [report[str(label)]['support'] for label in range(len(report) - 3)]

# Calculate macro-averaged metrics (which are directly available in `report`)
macro_precision = report['macro avg']['precision']
macro_recall = report['macro avg']['recall']

# Calculate micro-averaged metrics
# Weighted by the number of samples per class (support)
micro_precision = np.sum([p * s for p, s in zip(precision_per_class, support_per_class)]) / np.sum(support_per_class)
micro_recall = np.sum([r * s for r, s in zip(recall_per_class, support_per_class)]) / np.sum(support_per_class)

# Output calculated values
print("Macro Precision:", macro_precision)
print("Macro Recall:", macro_recall)
print("Micro Precision:", micro_precision)
print("Micro Recall:", micro_recall)

Macro Precision: 0.5241572481572482
Macro Recall: 0.5266666666666666
Micro Precision: 0.5241572481572482
Micro Recall: 0.5266666666666666


## Test some Sample Text

In [30]:
# Load model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("fine_tuned_sentiment_model_rt2")
model = AutoModelForSequenceClassification.from_pretrained("fine_tuned_sentiment_model_rt2")

In [31]:
from transformers import pipeline
classifier = pipeline("text-classification", model="fine_tuned_sentiment_model_rt2")

In [32]:
test_texts = ["The food was absolutely horrible. The flavors were completely unbalanced and the presentation was terrible. I wouldn\u2019t recommend this place to my worst enemy.",
             "While this restaurant had some good dishes, some of my friends ordered the same thing, which resulted in an overcrowded table. The dining area was quite small, which made us feel rushed.",
            "This place has been on my bucket list for so long! The food is always a highlight, and I love trying new dishes. I highly recommend it to anyone looking for a great dining experience.",
             "The food was okay, nothing special. It was neither bad nor great, just an average meal."]

In [33]:
for text in test_texts:
    result = classifier(text)
    print(f"Input: {text}")
    print(f"Predicted Label: {result[0]['label']},Score: {result[0]['score']:.4f}\n")

Input: The food was absolutely horrible. The flavors were completely unbalanced and the presentation was terrible. I wouldn’t recommend this place to my worst enemy.
Predicted Label: Strong Negative,Score: 0.4599

Input: While this restaurant had some good dishes, some of my friends ordered the same thing, which resulted in an overcrowded table. The dining area was quite small, which made us feel rushed.
Predicted Label: Strong Negative,Score: 0.3145

Input: This place has been on my bucket list for so long! The food is always a highlight, and I love trying new dishes. I highly recommend it to anyone looking for a great dining experience.
Predicted Label: Strong Positive,Score: 0.3114

Input: The food was okay, nothing special. It was neither bad nor great, just an average meal.
Predicted Label: Mild Negative,Score: 0.3300



## Analysis

Clearly, running the model with more epochs has a better outcome. 

I would say, for this particular model:
<ul>
    <li>Reducing the batch size, because of how similar the samples were when we generated the samples certainly helps. I suspect, why reducing the batch size matters is because it's like working with one individual sample.</li>
    <li>Increasing the number of epochs will certainly help better fit the model, if better resources are available, we could increase the number.</li>
    <li>We could definitely tinker with the weight decay and learning rate as well. This model can definitely perform better than what it is now.</li>
</ul>