In [70]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, pipeline
import torch
from datasets import Dataset, load_dataset
import pandas as pd

In [38]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [39]:
data = load_dataset("stanfordnlp/imdb")
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [40]:
train_data = pd.DataFrame(data['train'])
test_data = pd.DataFrame(data['test'])

# Coverting the data set in hugging face format
train_data = Dataset.from_pandas(train_data)
test_data = Dataset.from_pandas(test_data)

Now, in the `tokenize_function`, tokenize the examples' `text` column, set the padding to the longest sequence in the batch and enable truncation to ensure all sequences are of the same length.
 
Then, map the `training_set` and `test_set` to the `tokenize_function` with `batched` set to True. This will apply the tokenization in batches which is more efficient.

In [45]:
token_maker = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def get_tokenized_output(examples):
    return token_maker(examples['text'],
    padding='longest',  # or padding=True
    truncation=True)

In [46]:
tokenized_train_data = train_data.map(get_tokenized_output, batched=True)
tokenized_test_data = test_data.map(get_tokenized_output, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

- `output_dir` is where the model's output is saved. Don't change this, as we need to save results in `temp_results`. As the name suggests, these are wiped frequently, so don't expect to save anything there.
- `warmup_steps` specifies the length of the warm up phase at the start of training. Gradually increasing the learning rate at the start of training can help the model avoid bad outcomes early in the training process.
- `weight_decay` helps prevent overfitting by reducing the magnitude of the model's weights.
- `logging_dir` specifies where to save the training logs
- `learning_rate`, as you should know already, refers to the size of the steps the optimizer takes for each iteration of gradient descent
-  `save_strategy` specifies how we wish to save checkpoints of the model across different epochs. Don't change this value.

Now, you need to specify the number of training epochs and the batch size for both training and evaluation.
Finetuning requires fewer epochs than pretraining.

Set `num_train_epochs` to 3, `per_device_train_batch_size` to 12, and `per_device_eval_batch_size` also to 12.

An `epoch` is when a model sees all of the training data at least once. 

In [49]:
training_arugment = TrainingArguments(
    output_dir="./IMDB_results",
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./IMDB_logs',
    learning_rate= 0.00005,
    save_strategy= "no",
    num_train_epochs= 3,
    per_device_eval_batch_size=10,
    per_device_train_batch_size=12
)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=2)

model.to(device) # Put the model to particular CPU or GPU.

trainer = Trainer(
    model= model,
    args= training_arugment,
    eval_dataset= tokenized_test_data,
    train_dataset= tokenized_train_data
)

trainer.train()
trainer.evaluate()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.6658
1000,0.4668
1500,0.3771
2000,0.3546
2500,0.3304
3000,0.2897
3500,0.2918
4000,0.2956
4500,0.2681
5000,0.2723


{'eval_loss': 0.3583813011646271,
 'eval_runtime': 320.3258,
 'eval_samples_per_second': 78.046,
 'eval_steps_per_second': 7.805,
 'epoch': 3.0}

In [57]:
model.save_pretrained("./our_finetuned_model_IMDB")

## Now lets load our model and execute some files.

Import our transformers modules to get started. We've uploaded the finetuned model from the last exercise for you, which is accessed below via `"./our_finetuned_model_IMBD"`

In [59]:
model_path = "./our_finetuned_model_IMDB"

finetuned_model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

First, define a pipeline with a model and a tokenizer. Pass `'text-classification'` as the task to the pipeline, then `model=model` and `tokenizer=tokenizer`.

Then, write a short movie review on the `single_text` line and assign `result` the classifier pipeline running on `single_text`.

# Case 1: Passing the shorest sentence.

In [81]:
single_text = """The Golden Door is the story of a Sicilian family'."""

classifier = pipeline('text-classification', model= finetuned_model, tokenizer= tokenizer)

result = classifier(single_text)

print(f"Fine-tuned model result: {result}")

Device set to use cpu


Fine-tuned model result: [{'label': 'LABEL_1', 'score': 0.9485974907875061}]


# Case 2: Passing the text without truncation.

## Without truncation model will through the following error.

**Token indices sequence length is longer than the specified `maximum sequence length for this model` (898 > 512). Running this sequence through the model will result in indexing errors**


In [82]:
single_text = """The Golden Door is the story of a Sicilian family'. Salvatore, a middle-aged man who hopes for a more fruitful life, persuades his family to leave their homeland behind in Sicily, take the arduous journey across the raging seas, and inhabit a land whose rivers supposedly flow with milk. In short, they believe that by risking everything for the New World their dreams of prosperity will be fulfilled. The imagery of the New World is optimistic, clever and highly imaginative. Silver coins rain from heaven upon Salvatore as he anticipates how prosperous he'll be in the New World; carrots and onions twice the size of human beings are shown being harvested to suggest wealth and health, and rivers of milk are swam in and flow through the minds of those who anticipate what the New World will yield. All of this imagery is surrealistically interwoven with the characters and helps nicely compliment the gritty realism that the story unfolds to the audience. The contrast between this imagery versus the dark reality of the Sicilian people helps provide hope while they're aboard the ship to the New World.<br /><br />The voyage to the New World is shot almost in complete darkness, especially when the seas tempests roar and nearly kill the people within. The dark reality I referred to is the Old World and the journey itself to the New World. The Old World is depicted as somewhat destitute and primitive. This is shown as Salvatore scrambles together to sell what few possessions he has left (donkeys, goats and rabbits) in order to obtain the appropriate clothing he needs to enter the New World. I thought it was rather interesting that these people believed they had to conform to a certain dress code in order to be accepted in the New World; it was almost suggesting that people had to fit a particular stereotype or mold in order to be recognized as morally fit. The most powerful image in the film was when the ship is leaving their homeland and setting sail for the New World. This shot shows an overhead view of a crowd of people who slowly seem to separate from one another, depicting the separation between the Old and New Worlds. This shot also suggested that the people were being torn away from all that was once familiar, wanted to divorce from their previous dark living conditions and were desirous to enter a world that held more promise.<br /><br />As later contrasted to how the New World visually looks, the Old World seems dark and bleak as compared to the bright yet foggy New World. I thought it was particularly interesting that the Statue of Liberty is never shown through the fog at Ellis Island, but is remained hidden. I think this was an intentional directing choice that seemed to negate the purpose of what the Statue of Liberty stands for: "Give me your poor, your tired, your hungry" seemed like a joke in regards to what these people had to go through when arriving at the New World. Once they arrived in the Americas, they had to go through rather humiliating tests (i.e. delousing, mathematics, puzzles, etc.) in order to prove themselves as fit for the New World. These tests completely changed the perspectives of the Sicilian people. In particular, Salvatore's mother had the most difficult time subjecting herself to the rules and laws of the New World, feeling more violated than treated with respect. Where their dreams once provided hope and optimism for what the New World would provide, the reality of what the New World required was disparaging and rude. Salvatore doesn't change much other than his attitude towards what he felt the New World would be like versus what the New World actually was seemed disappointing to him. This attitude was shared by mostly everyone who voyaged with him. Their character arcs deal more with a cherished dream being greatly upset and a dark reality that had to be accepted.<br /><br />The film seems to make a strong commentary on preparing oneself to enter a heavenly and civilized society. Cleanliness, marriage and intelligence are prerequisites. Adhering to these rules is to prevent disease, immoral behavior and stupidity from dominating. Perhaps this is a commentary on how America has learned from the failings of other nations and so was purposefully established to secure that these plagues did not infest and destruct. Though the rules seemed rigid, they were there to protect and help the people flourish."""

classifier = pipeline('text-classification', model= finetuned_model, tokenizer= tokenizer)

result = classifier(single_text)

print(f"Fine-tuned model result: {result}")

Device set to use cpu


RuntimeError: The expanded size of the tensor (898) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [1, 898].  Tensor sizes: [1, 512]

# Case 3: Passing the text with truncation.

## With truncation we will get the following response.

In [83]:
single_text = """The Golden Door is the story of a Sicilian family'. Salvatore, a middle-aged man who hopes for a more fruitful life, persuades his family to leave their homeland behind in Sicily, take the arduous journey across the raging seas, and inhabit a land whose rivers supposedly flow with milk. In short, they believe that by risking everything for the New World their dreams of prosperity will be fulfilled. The imagery of the New World is optimistic, clever and highly imaginative. Silver coins rain from heaven upon Salvatore as he anticipates how prosperous he'll be in the New World; carrots and onions twice the size of human beings are shown being harvested to suggest wealth and health, and rivers of milk are swam in and flow through the minds of those who anticipate what the New World will yield. All of this imagery is surrealistically interwoven with the characters and helps nicely compliment the gritty realism that the story unfolds to the audience. The contrast between this imagery versus the dark reality of the Sicilian people helps provide hope while they're aboard the ship to the New World.<br /><br />The voyage to the New World is shot almost in complete darkness, especially when the seas tempests roar and nearly kill the people within. The dark reality I referred to is the Old World and the journey itself to the New World. The Old World is depicted as somewhat destitute and primitive. This is shown as Salvatore scrambles together to sell what few possessions he has left (donkeys, goats and rabbits) in order to obtain the appropriate clothing he needs to enter the New World. I thought it was rather interesting that these people believed they had to conform to a certain dress code in order to be accepted in the New World; it was almost suggesting that people had to fit a particular stereotype or mold in order to be recognized as morally fit. The most powerful image in the film was when the ship is leaving their homeland and setting sail for the New World. This shot shows an overhead view of a crowd of people who slowly seem to separate from one another, depicting the separation between the Old and New Worlds. This shot also suggested that the people were being torn away from all that was once familiar, wanted to divorce from their previous dark living conditions and were desirous to enter a world that held more promise.<br /><br />As later contrasted to how the New World visually looks, the Old World seems dark and bleak as compared to the bright yet foggy New World. I thought it was particularly interesting that the Statue of Liberty is never shown through the fog at Ellis Island, but is remained hidden. I think this was an intentional directing choice that seemed to negate the purpose of what the Statue of Liberty stands for: "Give me your poor, your tired, your hungry" seemed like a joke in regards to what these people had to go through when arriving at the New World. Once they arrived in the Americas, they had to go through rather humiliating tests (i.e. delousing, mathematics, puzzles, etc.) in order to prove themselves as fit for the New World. These tests completely changed the perspectives of the Sicilian people. In particular, Salvatore's mother had the most difficult time subjecting herself to the rules and laws of the New World, feeling more violated than treated with respect. Where their dreams once provided hope and optimism for what the New World would provide, the reality of what the New World required was disparaging and rude. Salvatore doesn't change much other than his attitude towards what he felt the New World would be like versus what the New World actually was seemed disappointing to him. This attitude was shared by mostly everyone who voyaged with him. Their character arcs deal more with a cherished dream being greatly upset and a dark reality that had to be accepted.<br /><br />The film seems to make a strong commentary on preparing oneself to enter a heavenly and civilized society. Cleanliness, marriage and intelligence are prerequisites. Adhering to these rules is to prevent disease, immoral behavior and stupidity from dominating. Perhaps this is a commentary on how America has learned from the failings of other nations and so was purposefully established to secure that these plagues did not infest and destruct. Though the rules seemed rigid, they were there to protect and help the people flourish."""

classifier = pipeline('text-classification', model= finetuned_model, tokenizer= tokenizer, truncation=True)

result = classifier(single_text)

print(f"Fine-tuned model result: {result}")

Device set to use cpu


Fine-tuned model result: [{'label': 'LABEL_1', 'score': 0.9861398935317993}]


## Conclusion

**As we can see, in Case 1 and Case 2 we are having the 4% of accuracy difference. Instead of passing the single line. We can pass in the more context using the truncation. And this is the reason we are getting better accuracy in case of the `input text truncation`**

## Checking the accuracy for negative examples as well.

In [84]:
single_text = """While I count myself as a fan of the Babylon 5 television series, the original movie that introduced the series was a weak start. Although many of the elements that would later mature and become much more compelling in the series are there, the pace of The Gathering is slow, the makeup somewhat inadequate, and the plot confusing. Worse, the characterization in the premiere episode is poor. Although the ratings chart shows that many fans are willing to overlook these problems, I remember The Gathering almost turned me off off what soon grew into a spectacular series."""

classifier = pipeline('text-classification', model= finetuned_model, tokenizer= tokenizer, truncation=True)

result = classifier(single_text)

print(f"Fine-tuned model result: {result}")

Device set to use cpu


Fine-tuned model result: [{'label': 'LABEL_0', 'score': 0.9691716432571411}]


**This same approach works for batches. Define a list of movie reviews in `batch_texts` and run them through the classifier, same as before.**

In [87]:
## YOUR SOLUTION HERE ##
batch_texts = ["It is so gratifying to see one great piece of art converted into another without distortion or contrivance. I had no guess as to how such an extraordinary piece of literature could be recreated as a film worth seeing. If you loved Bulgakov's book you would be, understandably, afraid of seeing some misguided interpretation done more for the sake of an art-film project than for actually bringing the story's deeper meaning to the screen. There are a couple examples of this with the Master and Margarita. As complex and far-fetched as the story is, the movie leaves out nothing. It is as if the filmmaker read Bulgakov's work the same way an orchestral conductor reads a score--with not a note missed. Why can't we find such talent here in the U.S. ? So now my favorite book and movie have the same title.", 
               "I watched mask in the 80's and it's currently showing on Fox Kids in the UK (very late at night). I remember thinking that it was kinda cool back in the day and had a couple of the toys too but watching it now bores me to tears. I never realised before of how tedious and bland this cartoon show really was. It's just plain awful! It is no where near in the same league as The Transformers, He-man or Thundercats and was very quickly forgot by nearly everyone once it stopped being made. I only watch it on Fox Kids because Ulysses 31 comes on straight after it (that's if mask doesn't put me to sleep first). One of the lesser 80's cartoons that i hope to completely forget about again once it finishes airing on Fox Kids!"]

results = classifier(batch_texts)

for i, result in enumerate(results):
    print(f"Prediction for text {i+1}: {result['label']}, Score: {result['score']}")


Prediction for text 1: LABEL_0, Score: 0.6447861194610596
Prediction for text 2: LABEL_0, Score: 0.9176220297813416
