<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification
Text classification is a common NLP task that assigns a label or class to text. 

Table of Content:
- [0.Install Neccessary Libraries](#0)
- [1.Import Required Libraries](#1)
- [2.Load Dataset](#2)
- [3.Preprocess the Data](#3)
- [4.Metrics for Evaluation](#4)
- [5.Train the Model](#5)
- [6.Inference](#6)


<a name="0"></a>
## 0.Install Neccessary Libraries
We need `transformers`,`datasets` and `evaluate` datasets to be installed

In [49]:
!pip install -q transformers datasets evaluate sentencepiece

<a name="1"></a>
## 1.Import Required Libraries

In [50]:
import transformers
import datasets
import evaluate
import numpy as np
import ipywidgets as widgets
import sentencepiece
transformers.logging.CRITICAL

<a name="2"></a>
## 2.Load Dataset
IMDB dataset is used in this notebook

In [51]:
imdb = datasets.load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [52]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

To have shorter training time

In [53]:
imdb = imdb["train"].shuffle().train_test_split(.4)

imdb dataset contains three splits (train, test, unsupervised) and each train and test split contain 25000 samples

to investigate one sample we have:

In [54]:
imdb["train"][1]

{'text': "Is this film a joke? Is it a comedy? Surely it isn't a serious thriller? There is no suggestion that there is any intended humor, but on quite a few occasions the poor acting, poor directing, and appalling script had the audience laughing out loud in the cinema. The plot is acceptable - a promising young artist just reaching his peak shot dead by an assassin he walks in on by mistake. The killer sees the young artists work portfolio he is carrying and decides to attend an exhibition of his work. At the exhibition the assassin meets the dead artists sister and they end up falling in love. It is all very predictable stuff and the end will not have anyone guessing as it is so poorly scripted. The film takes place mainly in and around Vienna, Austria, and shows what a beautiful city it is. Do not waste your time on this film though, unless you are studying how NOT to act, direct or script a film!",
 'label': 0}

In [55]:
imdb["train"].column_names

['text', 'label']

As you can see every sample has two columns named text(movie review) and label(binary value 0 for negative and 1 for positive review)

<a name="3"></a>
## 3.Preprocess the Data
We must tokenize the dataset in order to be able to feed it to neural network

To do so we chose `DistilBERT` tokenizer

In [56]:
#@title Choose a model checkpoint

checkpoint = "distilbert-base-uncased"  #@param ["distilbert-base-uncased", "bert-base-uncased"]

In [57]:
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length and use map function to apply it to all samples of dataset

In [58]:
#@title Choose Maximum Sequence Length
max = tokenizer.model_max_length
slider = widgets.IntSlider(value=256, min=64, max=max)
display(slider)

IntSlider(value=256, max=512, min=64)

In [59]:
max_length = slider.value
def tokenize_func(sample):
    return tokenizer(sample["text"],max_length=max_length, truncation= True)

In [60]:
tokenized_imdb = imdb.map(tokenize_func, batched= True)

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [61]:
data_collator = transformers.DataCollatorWithPadding(
    tokenizer= tokenizer
)

<a name="4"></a>
## 4.Metrics for Evaluation
For this task, accuracy metric is used

In [62]:
accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [63]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

<a name="5"></a>
## 5.Train the Model

Before you start training your model, create a map of the expected ids to their labels with id2label and label2id

In [64]:
id2label = {0: "negative",
            1:"positive"}
label2id = {k: v for v, k in id2label.items()}

Load the model checkpoint with `AutoModelForSequenceClassification` along with the number of expected labels, and the label mappings

In [65]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                                        num_labels=2, 
                                                                        id2label=id2label, 
                                                                        label2id=label2id)

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Define your training hyperparameters in `TrainingArguments`

In [66]:
args = transformers.TrainingArguments(
    output_dir= "my_model", #directory for storing model checkpoints
    learning_rate= 2e-5,
    per_device_train_batch_size= 32,
    per_device_eval_batch_size= 32,
    num_train_epochs= 2,
    weight_decay= .01, #L2 regularization
    evaluation_strategy= "epoch", #model wii be evaluated at the end of each epoch
    save_strategy= "epoch",
    load_best_model_at_end= True,
)

Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.

In [67]:
trainer= transformers.Trainer(
    model= model,
    args= args,
    train_dataset= tokenized_imdb["train"],
    eval_dataset = tokenized_imdb["test"],
    tokenizer= tokenizer,
    data_collator= data_collator,
    compute_metrics= compute_metrics,
)

Call `train()` to finetune your model

In [68]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.224257,0.9117
2,0.282200,0.252887,0.9147


TrainOutput(global_step=938, training_loss=0.22280937471369436, metrics={'train_runtime': 1515.7194, 'train_samples_per_second': 19.793, 'train_steps_per_second': 0.619, 'total_flos': 3946665830400000.0, 'train_loss': 0.22280937471369436, 'epoch': 2.0})

<a name="6"></a>
## 6.Inference

In [69]:
text = imdb["test"][3]["text"]
label = imdb["test"][3]["label"]

In [72]:
classifier = transformers.pipeline("sentiment-analysis" , model= "/content/my_model/checkpoint-938")

In [74]:
classifier(text[:512])

[{'label': 'positive', 'score': 0.8871289491653442}]

In [75]:
label

1