# Task
Train a DistilBERT model on the "cropped_english_sentiment.csv" dataset for sentiment analysis.

## Load the dataset

### Subtask:
Load the "cropped_english_sentiment.csv" file into a pandas DataFrame.


**Reasoning**:
Import pandas, load the CSV file into a DataFrame, display the head and info of the DataFrame to inspect the data and its structure.



In [None]:
import pandas as pd

df = pd.read_csv('hind_eng_multi.csv')

display(df.head())
display(df.info())

Unnamed: 0,Sentence,Sentiment
0,almost got in a giant car accident on the 101,0
1,like something wholly original,2
2,b.s. one another,0
3,"Happy Star Wars Day, may the 4th be with you ...",2
4,few new converts,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Sentence   1000 non-null   object
 1   Sentiment  1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


None

## Preprocessing

### Subtask:
Prepare the data for training, which may include handling missing values and encoding the labels.


**Reasoning**:
Check for missing values in the DataFrame and display the unique values in the 'label' column.



In [None]:
print("Missing values in each column:")
print(df.isnull().sum())

print("\nUnique values in the 'Sentiment' column:")
print(df['Sentiment'].unique())

Missing values in each column:
Sentence     0
Sentiment    0
dtype: int64

Unique values in the 'Sentiment' column:
['0' '2' '1' 'Positive' 'Neutral' 'Negative']


## Tokenization

### Subtask:
Tokenize the text data using the DistilBERT tokenizer.


**Reasoning**:
Tokenize the text data using the DistilBERT tokenizer according to the instructions.



In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-multilingual-cased')

tokenized_data = tokenizer(
    list(df['Sentence']),
    truncation=True,
    padding=True,
    return_tensors='pt'
)

print(tokenized_data.keys())
print(tokenized_data['input_ids'][0])
print(tokenized_data['attention_mask'][0])

KeysView({'input_ids': tensor([[  101, 17122, 19556,  ...,     0,     0,     0],
        [  101, 11850, 26133,  ...,     0,     0,     0],
        [  101,   170,   119,  ...,     0,     0,     0],
        ...,
        [  101, 21547, 86709,  ...,     0,     0,     0],
        [  101, 47313, 31028,  ...,     0,     0,     0],
        [  101,   865, 11714,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])})
tensor([  101, 17122, 19556, 10106,   169, 49429, 13000, 25037, 10135, 10105,
        14123,   102,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0, 

## Prepare data for pytorch

### Subtask:
Convert the tokenized data into a format suitable for PyTorch.


**Reasoning**:
Convert the tokenized data and labels into PyTorch tensors and create a DataLoader.



In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Map string labels to numerical labels
label_map = {'Negative': 0, 'Neutral': 1, 'Positive': 2, '0': 0, '1': 1, '2': 2}
df['Sentiment'] = df['Sentiment'].map(label_map)

# Ensure labels are of type torch.long
labels = torch.tensor(df['Sentiment'].values, dtype=torch.long)

# Update tokenized_data to match the filtered DataFrame
tokenized_data['input_ids'] = tokenized_data['input_ids'][df.index]
tokenized_data['attention_mask'] = tokenized_data['attention_mask'][df.index]


dataset = TensorDataset(tokenized_data['input_ids'], tokenized_data['attention_mask'], labels)

batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size)

print(f"Number of batches in the DataLoader: {len(dataloader)}")

Number of batches in the DataLoader: 63


## Define the model

### Subtask:
Load the pre-trained DistilBERT model from the `transformers` library.


**Reasoning**:
Import the necessary class and load the pre-trained model.



In [None]:
from transformers import DistilBertForSequenceClassification

# After mapping string labels to numerical labels, there are only 3 unique labels (0, 1, 2).
# So we should set num_labels to 3.
num_labels = 3
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=num_labels)

print(f"Model loaded with {num_labels} labels.")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded with 3 labels.


## Train the model

### Subtask:
Fine-tune the DistilBERT model on your sentiment dataset.


**Reasoning**:
Define the optimizer and loss function, set the model to training mode, and then iterate through the dataloader to perform the training steps for one epoch, including forward pass, loss calculation, backward pass, optimizer step, and gradient zeroing.



In [None]:
from torch.optim import AdamW
import torch.nn as nn

optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

num_epochs = 3  # You can adjust the number of epochs here

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in dataloader:
        input_ids, attention_mask, labels = batch

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        loss = criterion(logits, labels)

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    average_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average training loss: {average_loss}")

Epoch 1/3, Average training loss: 0.7188712626932159
Epoch 2/3, Average training loss: 0.41063500257829827
Epoch 3/3, Average training loss: 0.2094960036938862


## Evaluate the model

### Subtask:
Evaluate the performance of the trained model on a test set.


**Reasoning**:
Set the model to evaluation mode, initialize variables for evaluation metrics, and iterate through the dataloader to calculate accuracy.



In [None]:
model.eval()

total_correct = 0
total_samples = 0

with torch.no_grad():
    for batch in dataloader:
        input_ids, attention_mask, labels = batch

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        predictions = torch.argmax(logits, dim=-1)

        total_correct += (predictions == labels).sum().item()
        total_samples += labels.size(0)

accuracy = total_correct / total_samples
print(f"Accuracy on the dataset: {accuracy}")

Accuracy on the dataset: 0.881
