### About This Tokenizer

To tokenize the raw textual review data, we utilize the AutoTokenizer module provided by the transformers package. We initialize the AutoTokenizer with parameters tuned for the pre-trained Google BERT Base Uncased model, as outlined below. We then use the tokenizer to transform the textual review documents into vectors representations of max-length of 256, with shorter sequences padded with a special padding token. To ensure that non-salient tokens (such as the padding, masking, and unknown tokens) do not affect classifier model inference, the tokenizer also produces an attention mask vector for each tokenized document, which are then returned with the tokenized document in a dictionary data structure. 

### Import Statements

In [1]:
from src.dataset import ReviewDataset
from src.dataloader_raw import ReviewDataLoader as ReviewDataLoaderRaw
from src.dataloader_tokenized import ReviewDataLoader as ReviewDataLoaderTokenized
from src.tokenizer import Tokenizer

### Environment Variables

In [2]:
data_path_all = "../../data/all.csv"
data_path_train = "../../data/train.csv"
data_path_test = "../../data/test.csv"
model = "google-bert/bert-base-uncased"
max_length = 256
batch_size = 8

### Load Data

In [3]:
dataset = ReviewDataset(data_path_all)
dataloader_raw = ReviewDataLoaderRaw(dataset, batch_size=batch_size, shuffle=False)

### Data Sample

In [4]:
features, labels = next(iter(dataloader_raw))
for i in range(len(features)):
    print(f"Review: {features[i][0:50]}... Sentiment: {labels[i]}")

Review: I&#039;ve tried a few antidepressants over the yea... Sentiment: 1
Review: Holy Hell is exactly how I feel. I had been taking... Sentiment: 0
Review: This is a waste of money.  Did not curb my appetit... Sentiment: 0
Review: No problems, watch what you eat.... Sentiment: 1
Review: I smoked for 50+ years.  Took it for one week and ... Sentiment: 1
Review: After just 1 dose of this ciprofloxacn, I felt 99%... Sentiment: 1
Review: If I could give it a 0, I would absolutely do so. ... Sentiment: 0
Review: After a few days and it &quot;kicked in,&quot; eve... Sentiment: 0


### Load Tokenizer

In [5]:
tokenizer = Tokenizer(model, max_length)
dataloader_tokenized = ReviewDataLoaderTokenized(dataset, tokenizer, batch_size=batch_size, shuffle=False)

### Tokenized Data Sample

In [6]:
features, labels = next(iter(dataloader_tokenized))
tokens = features['input_ids'].tolist()
attention_mask = features['attention_mask'].tolist()
labels = labels.tolist()
for i in range(len(labels)):
    print(f"{{Tokens: {tokens[i][0:5]}... Attention Mask: {attention_mask[i][0:5]}...}} Label: {labels[i]}")

{Tokens: [101, 1045, 1004, 1001, 6021]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 1
{Tokens: [101, 4151, 3109, 2003, 3599]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 0
{Tokens: [101, 2023, 2003, 1037, 5949]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 0
{Tokens: [101, 2053, 3471, 1010, 3422]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 1
{Tokens: [101, 1045, 20482, 2005, 2753]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 1
{Tokens: [101, 2044, 2074, 1015, 13004]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 1
{Tokens: [101, 2065, 1045, 2071, 2507]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 0
{Tokens: [101, 2044, 1037, 2261, 2420]... Attention Mask: [1, 1, 1, 1, 1]...} Label: 0


### Decoded Tokenized Data Sample

In [7]:
for i in range(len(labels)):
    decoded = tokenizer.get_tokenizer().decode(tokens[i], skip_special_tokens=False)
    print(f"Decoded: {decoded[0:50]} | Label: {labels[i]}")

Decoded: [CLS] i & # 039 ; ve tried a few antidepressants o | Label: 1
Decoded: [CLS] holy hell is exactly how i feel. i had been  | Label: 0
Decoded: [CLS] this is a waste of money. did not curb my ap | Label: 0
Decoded: [CLS] no problems, watch what you eat. [SEP] [PAD] | Label: 1
Decoded: [CLS] i smoked for 50 + years. took it for one wee | Label: 1
Decoded: [CLS] after just 1 dose of this ciprofloxacn, i fe | Label: 1
Decoded: [CLS] if i could give it a 0, i would absolutely d | Label: 0
Decoded: [CLS] after a few days and it & quot ; kicked in,  | Label: 0


### Tokenizer Exploration

In [8]:
print(f"Vocab Size: {tokenizer.get_vocab_size()}")
print(f"Pad Token: {tokenizer.get_pad_token()} | Pad Token ID: {tokenizer.get_pad_token_id()}")
print(f"Unk Token: {tokenizer.get_unk_token()} | Unk Token ID: {tokenizer.get_unk_token_id()}")
print(f"Mask Token: {tokenizer.get_mask_token()} | Mask Token ID: {tokenizer.get_mask_token_id()}")

Vocab Size: 30522
Pad Token: [PAD] | Pad Token ID: 0
Unk Token: [UNK] | Unk Token ID: 100
Mask Token: [MASK] | Mask Token ID: 103


### Tokenized Data Exploration

In [9]:
print(f"Number of Reviews: {len(dataset)}")
print(f"Number of Tokenized Reviews: {len(dataloader_tokenized) * dataloader_tokenized.batch_size}")

features, labels = next(iter(dataloader_tokenized))
tokens = features['input_ids'].tolist()
attention_mask = features['attention_mask'].tolist()
labels = labels.tolist()
print(f"Tokens Length: {len(tokens[0])} | Attention Mask Length: {len(attention_mask[0])}")
    
print("The number of reviews and tokenized reviews are not the same because the last batch is smaller than the batch size.")

Number of Reviews: 96923
Number of Tokenized Reviews: 96928
Tokens Length: 256 | Attention Mask Length: 256
The number of reviews and tokenized reviews are not the same because the last batch is smaller than the batch size.
