**Transformers:**



*   Transformers are a type of neural network architecture that relies on attention mechanism. The attention mechanism helps the model to learn long-range dependencies between different parts of a sequence.

*   Transformer is composed of two parts: Encoder and decoder. Encoder takes the input sequence and produces hidden states, and the decoder takes the hidden states and produces output sequence.

*   Transformers are now used for variety of natural language processing tasks including machine translation, text summarization and question answering. They have been used for other tasks such as speech recognition and computer vision.




<br>

#Encoder Part: Text classification

In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
!pip install bertviz
!pip install umap-learn

In [None]:
import pandas as pd
from datasets import load_dataset

In [None]:
dataset=load_dataset("dair-ai/emotion")
dataset.set_format(type="pandas") #setting the dataset as pandas dataframe

In [None]:
df = dataset['train'][:] #the dataset will be displayed in pandas format
df.head() #shows first 5 rows

In [None]:
classes = dataset['train'].features['label'].names #trying to get the label names as shown in huggingface
classes

In [None]:
df['label_name'] = df['label'].apply(lambda x: classes[x]) #apply showing label names for each label
df.head()

<br>

#Dataset Analysis
<br>
Dataset analysis is required to undestand more about our dataset and class distribution and overall data distribution.

In [None]:
import matplotlib.pyplot as plt
label_counts =df['label_name'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title('Frequency of Classes')
plt.show()

In [None]:
df['Words Per Tweet'] = df['text'].str.split().apply(len)

In [None]:
df.boxplot("Words Per Tweet", by='label_name')

<br>

#Text to Token Conversion

In [None]:
from transformers import AutoTokenizer #This will automatically fetch the tokenization technique based on the model name
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
text = "Valentine's day. Crying in the hotel bar."
encoded_text = tokenizer(text)
print(encoded_text)

You find 101 and 102 in the list. These are special tokens. 101 marks the start of a sentence (CLS), 102 marks the end of the sentence (separator).

In [None]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

The uppercase in the sentence have been turned to lower case because we are using DistilBERT uncased model.

In [None]:
tokenizer.vocab_size, tokenizer.model_max_length

This displays the total no. of tokens in the dictionary and maximum sequence length of the model.

<br>

#Tokenization of the Emotion Data

In [None]:
dataset.reset_format() #To work on the whole data in one go.

In [None]:
#map - tokenization method
def tokenize(batch):
  temp = tokenizer(batch['text'], padding=True, truncation=True) #takes a batch of data and applies padding - so all are in same length, truncation - so sequences longer than max length for model are truncated
  return temp

print(tokenize(dataset["train"][:2]))

The result is token IDs of each word (tokenized) and attention mask to show which of them are padding and which is actual data.

<br>

Here we are encoding the dataset. "batched=True, batch_size=None" would mean at one go, train split will pass as a whole data and test split as a whole data.

In [None]:
dataset_encoded = dataset.map(tokenize, batched=True, batch_size=None)

In [None]:
dataset_encoded

<br>

#Model Building

In [None]:
text

In [None]:
inputs = tokenizer(text, return_tensors='pt')
inputs

In [None]:
from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(model_ckpt)

In [None]:
model

The model has multiple layers. Firstly, we find embedding layer that generate embeddings. Then there's transformer that has encoder stack and uses multi-head self attention technique.

In [None]:
with torch.no_grad():
  outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
last_hidden_states

In [None]:
last_hidden_states.shape

768 - total length of vector generated by DistilBERT
<br>13 - No. of tokens present in the data

*   AutoModelForSequenceClassification model has a classification head on top of the pretrained model outputs.


In [None]:
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #Checks if GPU is available and uses it, if not then uses CPU.
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=6)

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 64
model_name = "distilbert-finetuned-emotion-recog"
training_args = TrainingArguments(
    output_dir=model_name,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    eval_strategy='epoch',
    disable_tqdm=False
)

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average='weighted')
  acc = accuracy_score(labels, preds)
  return {'accuracy': acc, 'f1': f1}

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset_encoded["train"],
    eval_dataset=dataset_encoded["validation"],
    processing_class=tokenizer  # Changed from tokenizer
)

In [None]:
trainer.train()

In [None]:
preds_outputs = trainer.predict(dataset_encoded['test'])
preds_outputs.metrics

In [None]:
import numpy as np
y_preds = np.argmax(preds_outputs.predictions, axis=1)
y_true = dataset_encoded['test'][:]['label']

In [None]:
from sklearn.metrics import classification_report
print(classes)
print(classification_report(y_true, y_preds))

In [None]:
label_counts

Testing for the prediction

In [None]:
text = "I feel alone even when I have a crowd around me."
dataset_encoded = tokenizer(text, return_tensors='pt').to(device)
with torch.no_grad():
    outputs = model(**dataset_encoded)
logits = outputs.logits
pred = torch.argmax(logits, dim=1)
pred, classes[pred]

In [None]:
outputs