Total steps in involved in Binary claasification <br>
Data Collection --> Data Preprocessing --> Model Building --> Model Training --> Model Experimentation --> Model Evaluation

#### <b>Data Collection</b>

We will be training a neural network model to perform sentiment analysis i.e., the task is to predict whether given sentence is positive or negative. We will be using the [IMDb dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
<br>
Our model is a 3 layer network model, where the 1st layer is embedding layer, followed by the hidden layer, and then the output layer. Pytorch uses NLP concept called continous bag-of-words (CBoW). The output layer is two dimensional layer, embeddings dimensions and hidden layer dimensions will be decided later

The steps we will perform during our sentiment analysis

*  Data Preparation
    1.  importing modules
    2.  loading data
    3.  tokenizing data
    4.  creating data spilts
    5.  creating a vocabulory
    6.  numericalizing data
    7.  creating the data loaders

*  Building the Model
    1.  create a neural network model
    2.  define the loss & optimization
    3.  create a training loop
    4.  create a validation loop



In [1]:
!pip install datasets torchtext==0.17.0

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchtext==0.17.0
  Downloading torchtext-0.17.0-cp310-cp310-manylinux1_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.2.0 (from torchtext==0.17.0)
  Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.5/755.5 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting torchdata==0.7.1 (from torchtext==0.17.0)
  Downloading torchdata-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0

##### **Data Preparation**

##### **importing modules**

In [2]:
## We will import required modules
import collections
import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm

In [3]:
### Set random seed for torch & numpy
seed = 1
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

##### **loading the dataset**

We will load IMDB dataset using the `datasets` library, it takes two arguments, one is data source and another is `split`

In [4]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [5]:
train_data, test_data

(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [6]:
train_data.num_rows

25000

In [7]:
train_data[0], train_data[-1]

({'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

By the reviewing the `text` and `labels` fields of the dataset, we can say that `label` as `0` is `negative` sentiment and as `1` is `positive` sentiment

##### **Tokenization**

In [8]:
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")

In [9]:
tokenizer("Hello World!")

['hello', 'world', '!']

In [10]:
def tokenize_example(example, tokenizer, max_length):
    tokens = tokenizer(example["text"])[:max_length]
    return {"tokens":tokens}

In [11]:
max_length = 256
train_data = train_data.map(
    tokenize_example, fn_kwargs={"tokenizer": tokenizer, "max_length": max_length}
    )
test_data = test_data.map(
    tokenize_example, fn_kwargs={"tokenizer": tokenizer, "max_length": max_length}
    )

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [12]:
train_data

Dataset({
    features: ['text', 'label', 'tokens'],
    num_rows: 25000
})

In [13]:
train_data.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

In [14]:
train_data[0]['tokens'][:10]

['i',
 'rented',
 'i',
 'am',
 'curious-yellow',
 'from',
 'my',
 'video',
 'store',
 'because']

##### **creating validation data**

In [15]:
test_size = 0.25

train_valid_data = train_data.train_test_split(test_size=test_size)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

In [16]:
len(train_data), len(valid_data), len(test_data)

(18750, 6250, 25000)

##### **creating a Vocabulary**

In [17]:
min_freq = 3
special_tokens = ["<unk>", "<pad>"]

vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

In [18]:
len(vocab)

29246

In [19]:
vocab.get_itos()[:10]

['<unk>', '<pad>', 'the', '.', ',', 'a', 'and', 'of', 'to', "'"]

In [20]:
vocab["love"]

120

In [21]:
unk_index = vocab["<unk>"]
pad_index = vocab["<pad>"]

In [22]:
"some_token" in vocab

False

In [23]:
vocab.set_default_index(unk_index)

In [24]:
vocab["some_token"]

0

In [25]:
vocab.lookup_indices(["hello", "world", "some_token", "<pad>"])

[4748, 186, 0, 1]

##### **Numericalizing Data**

In [26]:
def numericalize_example(example, vocab):
    ids = vocab.lookup_indices(example["tokens"])
    return {"ids":ids}

In [27]:
train_data = train_data.map(numericalize_example, fn_kwargs={"vocab": vocab})
valid_data = valid_data.map(numericalize_example, fn_kwargs={"vocab": vocab})
test_data = test_data.map(numericalize_example, fn_kwargs={"vocab": vocab})

Map:   0%|          | 0/18750 [00:00<?, ? examples/s]

Map:   0%|          | 0/6250 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [28]:
train_data[0]["tokens"][:10]

['when',
 'i',
 'began',
 'watching',
 'the',
 'muppets',
 'take',
 'manhattan',
 ',',
 'the']

In [29]:
vocab.lookup_indices(train_data[0]["tokens"][:10])

[60, 12, 1582, 144, 2, 6239, 206, 3932, 4, 2]

In [30]:
train_data = train_data.with_format(type="torch", columns=['ids', 'label'])
valid_data = valid_data.with_format(type="torch", columns=['ids', 'label'])
test_data = test_data.with_format(type="torch", columns=["ids", 'label'])

In [31]:
train_data[0]['ids'][:10]

tensor([  60,   12, 1582,  144,    2, 6239,  206, 3932,    4,    2])

In [32]:
train_data[0]['label']

tensor(0)

In [33]:
train_data[0].keys()

dict_keys(['label', 'ids'])

##### **creating data loaders**

In [35]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_ids = [i["ids"] for i in batch]
        batch_ids = nn.utils.rnn.pad_sequence(
            batch_ids, padding_value=pad_index, batch_first=True
        )
        batch_label = [i["label"] for i in batch]
        batch_label = torch.stack(batch_label)
        batch = {"ids": batch_ids, "label": batch_label}
        return batch

    return collate_fn