# BERT tabanli DistilBERT modelini kullanarak Sentiment Analysis modeli egitelim.
---
## Roadmap:
1. Load Dataset
2. Preprocess Dataset
3. Tokenizer
4. Load Model
5. Train
6. Evaluate


 ### Load Dataset

Emotion is a dataset of English X(Twitter) messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.

[dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion)


In [None]:
!pip install -q datasets

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('dair-ai/emotion')

In [None]:
dataset

In [None]:
train_ds = dataset['train']
train_ds[:5]

In [None]:
train_ds.features

Datasets to Pandas Dataframe

In [None]:
import pandas as pd

dataset.set_format(type='pandas')

df = dataset['train'][:]
df.head()

In [None]:
def label_int2str(row):
    return dataset['train'].features['label'].int2str(row)

In [None]:
df['label_name'] = df['label'].apply(label_int2str)
df.head()

Data Visualization

In [None]:
import matplotlib.pyplot as plt

In [None]:
df['label_name'].value_counts(ascending=True).plot.barh()
plt.title('Frequency of Classes')
plt.show()

In [None]:
df['Words Per Tweet'] = df['text'].str.split().apply(len)
df.boxplot('Words Per Tweet', by='label_name', grid=False, showfliers=False, color='blue')
plt.suptitle('')
plt.xlabel('')
plt.show()

Pandas Dataframe to Datasets

In [None]:
dataset.reset_format()

### Data Preprocessing

---
#### Tokenization


Character Tokenization

In [None]:
text = 'It is fun to work with NLP using HuggingFace.'

tokenized_text = list(text)
print(tokenized_text, len(tokenized_text))

In [None]:
token2idx = {ch: idx for idx, ch in enumerate(set(tokenized_text))}
print(token2idx)

In [None]:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

OneHot Encoding;

In [None]:
df = pd.DataFrame({'name':['can', 'efe', 'ada'],
                   'label': [0,1,2]})
df

In [None]:
pd.get_dummies(df, dtype=int)

torch kutuphanesini kullanarak OneHot Encoding yapalim;

In [None]:
import torch
import torch.nn.functional as F

input_ids = torch.tensor(input_ids)

one_hot_encoding = F.one_hot(input_ids, num_classes=len(token2idx))
one_hot_encoding.shape # 44=len(tokenized_text) , 22=unique token(set(tokenized_text))

In [None]:
print(f'Token: {tokenized_text[0]}')
print(f'Tensor Index: {input_ids[0]}')
print(f'One Hot: {one_hot_encoding[0]}') # tum unique degeler bir column oldu. Bu yuzden 'I' 17. index'te degeri 1

Burada yapmis oldugumuz Character Tokenize islemi genellikle kullanilmaz.

Bunun yerine Work Tokenize kullanir. Bu train surecinin komplexligini dusurur.

Work Tokenization;

In [None]:
tokenized_text = text.split()
print(tokenized_text)

Subword Tokenization;

Hadi veri setimizin Tokenizer'ini yukleyelim.

In [None]:
from transformers import AutoTokenizer

model_ckpt = 'distilbert-base-uncased'

# distilbert-base-uncased modelinin on egitimli Tokenizer'ini yukluyoruz
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Bu islemleri Manuel olarak da yukleyebiliriz;

In [None]:
from transformers import DistilBertTokenizer

distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

metnimizi tokenizer'a verelim;

In [None]:
encoded_text = tokenizer(text)
print(encoded_text)

In [None]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens) # CLS -> Baslangici, SEP -> Bitisi ifade eder.

In [None]:
tokenizer.convert_tokens_to_string(tokens)

Sozlukteki kelime sayisi;

In [None]:
# transformer mimarisinde ki Token Encoding katmaninda vocab_size genellikle 20k-200k arasindadir.
# distilbert tokenizer'in sozlukteki kelime sayisi
tokenizer.vocab_size

In [None]:
# modele girecek dizinin max uzunlugu, bundan fazla olanlar atilir
tokenizer.model_max_length

In [None]:
print(distilbert_tokenizer(text))
print(distilbert_tokenizer.convert_ids_to_tokens(distilbert_tokenizer(text).input_ids))

Tum metni Tokenize edelim;

In [None]:
def tokenize(batch):
    return tokenizer(batch['text'], truncation=True) # verisetinin 'text' column'una uygula ve max_len'den fazla olan kelimeleri at

Birkac metni fonksiyonumuza verelim;

In [None]:
# input_ids -> token'lerin sayisal temsilleri
# attention_mask -> modelin hangi token'i dikkate alacagini gosterir
print(tokenize(dataset['train'][:2]))

Butun veriyi tokenize edelim;

In [None]:
dataset_encoded = dataset.map(tokenize, batched=True, batch_size=None)

#### Padding

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
print(dataset_encoded['train'].column_names)
print(dataset_encoded['train'][0])

## Trainer
---

Model Loading;


Emotion is a dataset of English X(Twitter) messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.

These are labels.

Loading Distilbert Model;

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 6 # anger, fear, joy, love, sadness, and surprise

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt,
                                                           num_labels=num_labels
                                                           ).to(device)

Evaluation for model;

In [None]:
!pip install -q evaluate

In [None]:
import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')

# Accuracy Calculation
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return accuracy.compute(predictions=predictions,
                            references=labels)

Model Sharing;

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Set Training Arguments Hyperparameters

In [None]:
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q git+https://github.com/huggingface/accelerate

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='https://huggingface.co/distilbert/distilbert-base-uncased',
    num_train_epochs=5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    weight_decay=0.005,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    push_to_hub=True,
    report_to='none'
)

Model Training;

In [None]:
print(f"Model:{model}\nArgs:{training_args}\nCompute Metrics:{compute_metrics}\nTrain DS:{dataset_encoded['train'][0]}\n\
Eval DS:{dataset_encoded['validation'][:1]}\nTokenizer:{tokenizer}")

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset_encoded['train'],
    eval_dataset=dataset_encoded['validation'],
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

Confusion Matrix:

In [None]:
preds_output = trainer.predict(dataset_encoded['validation'])
print(preds_output.metrics)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize='true')
    fig, ax = plt.subplots(figsize=(6, 6), dpi=100)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                display_labels=labels)
    disp.plot(cmap='Blues', values_format='.2f', ax=ax, colorbar=False)
    plt.title('Normalized Confusion Matrix')
    plt.show()

In [None]:
y_preds = np.argmax(preds_output.predictions, axis=1)

y_valid = np.array(dataset_encoded['validation']['label'])

labels = dataset['train'].features['label'].names

In [None]:
plot_confusion_matrix(y_preds, y_valid, labels)

### Model Saving

In [None]:
trainer.push_to_hub(commit_message='Training completed.')

Model Loading and Predict

In [None]:
from transformers import pipeline

model_id = "mehmettozlu/distilbert-base-uncased"

classifier = pipeline('text-classification', model=model_id)

In [None]:
custom_text = "I wtached a movie yesterday. It was really good."

preds = classifier(custom_text, return_all_scores=True)

preds_df = pd.DataFrame(preds[0])

In [None]:
plt.bar(labels, 100*preds_df['score'])
plt.title(f'{custom_text}')
plt.ylabel('Class Probability %')
plt.show()