# **Overview**

The IMDB Dataset contains 50,000 movie reviews collected from IMDb.
It is mainly used for sentiment analysis and Natural Language Processing (NLP) tasks.

The dataset is balanced, meaning it has an equal number of positive and negative reviews. **bold text**

## **Importing data**

In [16]:
import pandas as pd
from datasets import Dataset

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/IMDB-Dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [17]:
data.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [18]:
data['sentiment']

Unnamed: 0,sentiment
0,positive
1,positive
2,positive
3,negative
4,positive
...,...
49995,positive
49996,negative
49997,negative
49998,negative


In [19]:
data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


# **Split the dataset into training and test sets**

In [20]:
dataset = Dataset.from_pandas(data)
dataset = dataset.train_test_split(test_size=0.3)
dataset


DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 35000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 15000
    })
})

# **Label Encoding for Sentiment Classification**

In [21]:
label2id = {'negative' :0 ,'positive' : 1}
id2label = {0 : 'negative' ,1 : 'positive'}

dataset = dataset.map(lambda x : {'label' : label2id[x['sentiment']]})


Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

In [22]:
dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment', 'label'],
        num_rows: 35000
    })
    test: Dataset({
        features: ['review', 'sentiment', 'label'],
        num_rows: 15000
    })
})

In [23]:
dataset['train'][0]

{'review': "Hmmm, yeah this episode is extremely underrated.<br /><br />Even though there is a LOT of bad writing and acting at parts. I think the good over wins the bad. <br /><br />I love the origami parts and the big 'twist' at the end. I absolutely love that scene when Michelle confronts Tony. It's actually one of my favorite scenes of Season 1. <br /><br />For some reason, people have always hated the Reincarnation episodes, yet I have always liked them. They're not the best, in terms of writing. but the theme really does interest me,<br /><br />I'm gonna give it a THREE star, but if the writing were a little more consistent i'd give it FOUR.",
 'sentiment': 'positive',
 'label': 1}

# **Data tokenization**

In [24]:
from transformers import AutoTokenizer
import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_ckpt = 'huawei-noah/TinyBERT_General_4L_312D'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt ,use_fast = True)


In [25]:
# See the tokenized review
tokenizer(dataset['train'][0]['review'])

{'input_ids': [101, 17012, 2213, 1010, 3398, 2023, 2792, 2003, 5186, 2104, 9250, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2130, 2295, 2045, 2003, 1037, 2843, 1997, 2919, 3015, 1998, 3772, 2012, 3033, 1012, 1045, 2228, 1996, 2204, 2058, 5222, 1996, 2919, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2293, 1996, 2030, 13340, 4328, 3033, 1998, 1996, 2502, 1005, 9792, 1005, 2012, 1996, 2203, 1012, 1045, 7078, 2293, 2008, 3496, 2043, 9393, 17628, 4116, 1012, 2009, 1005, 1055, 2941, 2028, 1997, 2026, 5440, 5019, 1997, 2161, 1015, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2005, 2070, 3114, 1010, 2111, 2031, 2467, 6283, 1996, 27788, 10010, 9323, 4178, 1010, 2664, 1045, 2031, 2467, 4669, 2068, 1012, 2027, 1005, 2128, 2025, 1996, 2190, 1010, 1999, 3408, 1997, 3015, 1012, 2021, 1996, 4323, 2428, 2515, 3037, 2033, 1010, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 1005, 1049, 6069, 2507, 2009, 1037, 2093, 2732, 1010, 2021, 2065, 1996, 3015, 2020, 103

In [26]:
# This function prepares text data so it can be used by a Transformer model.
def tokenize(batch) :
  temp = tokenizer(batch['review'] ,padding=True , truncation= True , max_length= 300)
  return temp
dataset = dataset.map(tokenize , batched= True ,batch_size = None)

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

In [27]:
dataset['train']

Dataset({
    features: ['review', 'sentiment', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 35000
})

# **Building model evaluating functions**

In [28]:
!pip install evaluate



In [29]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred) :
  predictions , labels = eval_pred
  predictions = np.argmax(predictions ,axis = 1)
  return accuracy.compute(predictions = predictions ,references = labels)


In [30]:
from transformers import AutoModelForSequenceClassification ,TrainingArguments,Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt ,num_labels = 2 ,id2label = id2label ,label2id = label2id)

pytorch_model.bin:   0%|          | 0.00/62.7M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
args = TrainingArguments(
    output_dir='train_dir',
    overwrite_output_dir=True,
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    eval_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

  trainer = Trainer(


In [34]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3519,0.30635,0.871
2,0.2944,0.287378,0.881067
3,0.259,0.283132,0.885267


TrainOutput(global_step=3282, training_loss=0.3194708001824285, metrics={'train_runtime': 345.3813, 'train_samples_per_second': 304.012, 'train_steps_per_second': 9.503, 'total_flos': 882184338000000.0, 'train_loss': 0.3194708001824285, 'epoch': 3.0})

In [35]:
trainer.evaluate()

{'eval_loss': 0.28313177824020386,
 'eval_accuracy': 0.8852666666666666,
 'eval_runtime': 18.5107,
 'eval_samples_per_second': 810.341,
 'eval_steps_per_second': 25.337,
 'epoch': 3.0}

In [36]:
trainer.save_model('tinyBert-NLP2')

#**Evaluating the saved model**

In [38]:
import torch
from transformers import pipeline

data = ['this movie was horrible, the plot was really boring. acting was okay',
        'the movie is really sucked. there is not plot and acting was bad',
        'what a beautiful movie. great plot. acting was good. will see it again']

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

classifier = pipeline('text-classification' ,model = 'tinyBert-NLP2' ,device = device)
classifier(data)

Device set to use cuda


[{'label': 'negative', 'score': 0.9890663623809814},
 {'label': 'negative', 'score': 0.989194393157959},
 {'label': 'positive', 'score': 0.9900287985801697}]