# Fine Tuning Sentiment Analysis Use-Case using Hugging Face Transformers by `Mr. Harshit Dawar!`

In [1]:
import transformers as trs
import datasets 
import warnings

In [3]:
warnings.filterwarnings("ignore")

In [4]:
# Loading the Amazon Polarity Dataset
dataset = datasets.load_dataset("amazon_polarity")

Downloading builder script:   0%|          | 0.00/4.11k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.64k [00:00<?, ?B/s]

Downloading and preparing dataset amazon_polarity/amazon_polarity to C:/Users/harsh/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/a27b32b7e7b88eb274a8fa8ba0f654f1fe998a87c22547557317793b5d2772dc...


Downloading data:   0%|          | 0.00/688M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

Dataset amazon_polarity downloaded and prepared to C:/Users/harsh/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/a27b32b7e7b88eb274a8fa8ba0f654f1fe998a87c22547557317793b5d2772dc. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })
})

In [8]:
dataset["train"].features

{'label': ClassLabel(names=['negative', 'positive'], id=None),
 'title': Value(dtype='string', id=None),
 'content': Value(dtype='string', id=None)}

In [9]:
dataset["train"][0]

{'label': 1,
 'title': 'Stuning even for the non-gamer',
 'content': 'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'}

In [11]:
# dataset["train"].data

In [12]:
model_to_use = "distilbert-base-uncased"
tokenizer = trs.AutoTokenizer.from_pretrained(model_to_use)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [13]:
# Applying Truncation to the sentences
def apply_truncation(string):
    return tokenizer(string["content"], truncation = True)

In [14]:
tokenized_dataset = dataset.map(apply_truncation, batched = True)

Map:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400000 [00:00<?, ? examples/s]

In [15]:
training_arguments = trs.TrainingArguments("My_Distil_BERT_Trainer", evaluation_strategy = "epoch", save_strategy = "epoch", num_train_epochs = 3)

In [16]:
model = trs.AutoModelForSequenceClassification.from_pretrained(model_to_use)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier

In [20]:
metric = datasets.load_metric("glue", "sst2")

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

In [21]:
import numpy as np
def compute_values_for_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    predicted_values = np.argmax(logits, axis = -1)
    return metric.compute(predictions = predicted_values, references = labels)

In [23]:
trainer = trs.Trainer(model, training_arguments, train_dataset = tokenized_dataset["train"], tokenizer = tokenizer, compute_metrics = compute_values_for_metrics )

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: title, content. If title, content are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3600000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1350000
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
