# PIPELINE: USING Hugging Face - Transformers 

Step-by-step solution to create a functional code for the task of multi-class sentiment classification of tweets. I will use the Hugging Face transformers library, which provides access to a wide range of pre-trained language models, including BERT, RoBERTa, and more. For this example, I'll use the DistilBertForSequenceClassification model, which is a lightweight variant of BERT that is well-suited for sequence classification tasks like sentiment analysis.

Here's an outline of the steps we'll follow:

1. Install the necessary Python packages.
2. Preprocess the data.
3. Load the pre-trained model and tokenizer from Hugging Face.
4. Fine-tune the model on the training data.
5. Evaluate the model on the validation data.
6. Generate predictions on the test data and save them in the required submission format.

In [None]:
pip install pandas 

In [None]:
pip install transformers  

In [None]:
pip install torch

In [None]:
pip install scikit-learn

### Step 1: Install the necessary Python packages

In [1]:
# Import the necessary libraries
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
import torch
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import classification_report

In [2]:
# Define the DistilBert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

### Step 2: Preprocess the data

In [3]:
# Load and preprocess the data
def load_data(file_path):
    # Read the data from the CSV file
    df = pd.read_csv(file_path)
    
    # Tokenize the text, pad or truncate to a fixed length, and convert to integer IDs
    inputs = tokenizer(df['text'].tolist(), padding='max_length', truncation=True, max_length=128, return_tensors='pt')
    
    # Convert the labels to integers
    label_mapping = {'positive': 0, 'negative': 1, 'neutral': 2}
    labels = torch.tensor(df['label'].map(label_mapping).tolist())
    
    # Create a TensorDataset from the tokenized inputs and labels
    dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
    
    return dataset

In [4]:
# Load the training and validation data
train_dataset = load_data("C:\\Users\\danij\\Documents\\UC3M\\TFG\\DATA\\train.csv")
valid_dataset = load_data("C:\\Users\\danij\\Documents\\UC3M\\TFG\\DATA\\dev.csv")

# Concatenate the train and dev data
# data = pd.concat([train_data, dev_data])

### Step 3: Load the pre-trained model and tokenizer

In [5]:
# Define the DistilBert model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)
model.train()

# Define the optimizer and loss function
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Step 4: Fine-tune the model on the training data

In [6]:
# Fine-tune the model on the training data
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(3):
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()

### Step 5: Evaluate the model on the validation data

In [7]:
# Evaluate the model on the validation data
valid_loader = DataLoader(valid_dataset, batch_size=32)
predictions, true_labels = [], []
model.eval()
for batch in valid_loader:
    input_ids, attention_mask, labels = batch
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    preds = torch.argmax(outputs.logits, dim=1)
    predictions.extend(preds.tolist())
    true_labels.extend(labels.tolist())
print(classification_report(true_labels, predictions, target_names=['positive', 'negative', 'neutral']))

              precision    recall  f1-score   support

    positive       0.53      0.57      0.55       148
    negative       0.32      0.45      0.38        94
     neutral       0.83      0.75      0.79       511

    accuracy                           0.68       753
   macro avg       0.56      0.59      0.57       753
weighted avg       0.71      0.68      0.69       753



### Step 6: Generate predictions on the test data and save them in the required submission format

In [None]:
# Load and preprocess the test data
test_df = pd.read_csv("testing.csv")
test_inputs = tokenizer(test_df['text'].tolist(), padding='max_length', truncation=True, max_length=128, return_tensors='pt')
test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'])
test_loader = DataLoader(test_dataset, batch_size=32)

In [None]:
# Generate predictions on the test data
test_predictions = []
for batch in test_loader:
    input_ids, attention_mask = batch
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    preds = torch.argmax(outputs.logits, dim=1)
    test_predictions.extend(preds.tolist())

In [None]:
# Save the predictions in the required submission format
submission_df = pd.DataFrame({'tweet_id': test_df['tweet_id'], 'label': test_predictions})
submission_df['label'] = submission_df['label'].replace({0: 'positive', 1: 'negative', 2: 'neutral'})
submission_df.to_csv("answer.txt", sep='\t', index=False)