# USING A PRE-TRAINED BERT MODEL

The task involves a multi-class classification problem for sentiment analysis in English tweets. Given the nature of the data and the task, a strong approach would be to use a pre-trained transformer model like BERT, which has shown great success in NLP tasks, including sentiment analysis.

Below, I'll outline the main steps of the code and provide an example implementation. The code will include the following:

1. Data Loading and Preprocessing: Read the CSV file and preprocess the text.
2. Model Selection: Use a pre-trained BERT model for classification.
3. Training: Train the model on the provided training data.
4. Evaluation: Evaluate the model using the validation set and the micro-averaged F1-score.
5. Prediction: Generate predictions on the test set.
6. Submission File Creation: Create the submission file as per the required format.

In [None]:
import pandas as pd
import torch
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score, accuracy_score

### Step 1: Data Loading and Preprocessing

In [None]:
# Define the DistilBert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Load and preprocess the data
def load_data(file_path):
    df = pd.read_csv(file_path)
    inputs = tokenizer(df['text'].tolist(), padding='max_length', truncation=True, max_length=128, return_tensors='pt')
    label_mapping = {'positive': 0, 'negative': 1, 'neutral': 2}
    labels = torch.tensor(df['label'].map(label_mapping).tolist())
    dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
    return dataset

train_dataset = load_data("C:\\Users\\danij\\Documents\\UC3M\\TFG\\DATA\\train.csv")
valid_dataset = load_data("C:\\Users\\danij\\Documents\\UC3M\\TFG\\DATA\\dev.csv")

### Handle Class Imbalance

In [None]:
train_df = train_dataset
label_mapping = {'positive': 0, 'negative': 1, 'neutral': 2}
class_weights = torch.tensor([len(train_df)/train_df['label'].value_counts()[label] for label in label_mapping.values()])

### Model and Training Configuration

We'll use the transformers library to load a pre-trained BERT model.

In [None]:
# Define the DistilBert model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss(weight=class_weights.to(device))

# Learning Rate Scheduler
total_steps = len(train_dataset) * 3 # Number of epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

### Step 3: Training


The training process involves tokenizing the text and training the model.

In [None]:
# Define the tokenization function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# Create a DataLoader
train_dataset = train_df[['text', 'label']]
train_dataset = train_dataset.map(tokenize, batched=True)
train_loader = DataLoader(train_dataset, batch_size=32)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Define Trainer and train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

### Step 4: Evaluation

Evaluate the model on the validation set.

In [None]:
# Tokenize validation data
val_dataset = val_df[['text', 'label']]
val_dataset = val_dataset.map(tokenize, batched=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Get predictions
predictions = trainer.predict(val_loader)

# Calculate micro-averaged F1 score
f1_micro = f1_score(val_df['label'], predictions, average='micro')
print("Micro-averaged F1-score:", f1_micro)

### Step 5: Prediction

Predict the labels for the test set and create the submission file.

In [None]:
# Assume test_df contains the test data
test_dataset = test_df[['text']]
test_dataset = test_dataset.map(tokenize, batched=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Get test predictions
test_predictions = trainer.predict(test_loader)

# Decode labels
test_labels = label_encoder.inverse_transform(test_predictions)

# Create submission file
submission_df = pd.DataFrame({'tweet_id': test_df['tweet_id'], 'label': test_labels})
submission_df.to_csv('answer.txt', sep='\t', index=False)