**Domain Data preparation** \\
The first step is to download the Amazon Product Review dataset from the UCSD repository and preprocess the data. \\

We are using metadata and not the reviews for training, as we want to model to get familiarized with the different types of products and their uses. \\

We have uploaded the dataset to our Google drive (can also use wget function to download it but it takes some time). In preprocessing, we will drop the columns we do not need. We are not editing the text as it is domain training. We will do more detailed preprocessing such as removing of unwanted characters, removal of stopwords, etc. in the task specific training

In [2]:
import gzip
import json 
import pandas as pd 

#Unzipping the data
data = []
count=0
with gzip.open('meta_All_Beauty.json.gz') as f:
  for l in f:
    data.append(json.loads(l.strip()))
#     count+=1
#     if count>300000:
#         break

In [None]:
print(data[0])

In [3]:
#Loading the dataset into a Dataframe
df = pd.DataFrame.from_dict(data)

In [4]:
len(df)

32892

In [None]:
print(df.description.iloc[0], df.title.iloc[0])

In [None]:
df.description.iloc[1][0]

In [5]:
#Function to edit the description column of the dataframe 
def edit_description(desc):
  if desc==[]:
    return ''
  else:
    return desc[0]

In [6]:
df['description'] = df.apply(lambda row: edit_description(row['description']), axis = 1)

In [7]:
#Filtering the dataset 
# 1) Check which all rows contain NA entries for title and description
# 2) Split the dataframe into 2 halves based on the above condition 
df = df.fillna('')
filtered_df = df[((df.title !='') & (df.description!=''))]

In [None]:
len(filtered_df)

In [8]:
#Combining the title and description into one column - 'custom_input'
filtered_df['custom_input'] = filtered_df['title'] + filtered_df['description']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['custom_input'] = filtered_df['title'] + filtered_df['description']


In [9]:
train_df = filtered_df[['custom_input']]

In [None]:
train_df.head()

**Data Visualization** \\
Will be checking the wordcloud to visually see which word occured the most in the training data.


In [None]:
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud 

stopwords = stopwords.words('english')
stopwords = stopwords + ["br","href"]
text = " ".join(input for input in train_df.custom_input)

wordcloud = WordCloud(stopwords=stopwords).generate(text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

**Data Splitting** \\
Splitting the data into train and validation sets

In [None]:
!pip install transformers

In [10]:
import torch 
from transformers import BertTokenizer, BertForMaskedLM
from torch.utils.data import DataLoader, Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
#Data split - 80% training and 20% validaiton 
train_data = train_df.sample(frac=1, random_state=42)
train_data = train_data.reset_index(drop=True)

In [12]:
#Initializing the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Tokenization** \\
Tokenize the text data using the BERT tokenizer. This involves converting the text into a sequence of tokens that can be fed into the BERT model.

Due to limited size of the dataset we are using the base model with 110M parameters and not the BERT large model which has 330M parameters. 

There are several other options of pretrained tokenizers, but we are using BERT Tokenizer. A tokenizer splits text into tokens according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Finally, you want the tokenizer to return the actual tensors that get fed to the model. Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow

In [14]:
#DataLoader expects 2 additional fucntions apart from the initialized Dataset object - getitem and len methods
class AmazonReviewsDataset(Dataset):
  #This class handles the conversion of data into a Dataset object
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer
    def __getitem__(self, index): 
        title_desc = self.data.loc[index, 'custom_input']
        input = self.tokenizer(title_desc, return_tensors='pt', add_special_tokens=True, max_length =512, truncation=True, padding='max_length')
        input_ids = input.input_ids.squeeze()
        attention_mask = input.attention_mask.squeeze()
        labels = input_ids.clone()
        #creates a copy of input ids as the label 
        
        #Now we are creating the mask - each non-special token has a 15% chance of getting masked 
        rand = torch.rand(input_ids.shape)
        #create a tensor of float values (b/w 0-1) that have equal dimensions as our input
        mask_arr = (rand<0.15) * (input.input_ids[0]!=101) * (input.input_ids[0]!=102) * (input.input_ids[0]!=0)
        selection = torch.flatten(mask_arr.nonzero()).tolist()
        input_ids[selection] = 103
        return {'input_ids':input_ids, 'attention_mask': attention_mask, 'labels': labels}
    def __len__(self):
        return len(self.data)
train_dataset = AmazonReviewsDataset(train_data, tokenizer)

If we directly call the tokenizer instead of using the encode function, it will return the token_ids and attention mask as well. And therefore we do not have to create input_mask on our own. Moreover, as we have specified that we want the returned output to be a pytorch tensor, therefore no need to use torch.tensor() function while returning the tokenized input

In [None]:
train_dataset.__getitem__(0)

**Model Domain Training** \\
We are training model from the scratch for 20 epochs on the dataset in a  Masked Language Modeling fashion. BERT model can be trained in 2 possible ways - 1) MLM 2) NSP (Next Sequence Prediction). As we are using MLM only, therefore we are making use of BertMaskedLM model

[SEP] - 102 \\
[MASK] - 103 \\
[CLS] - 101 \\
We are only conserned about the input_ids and won't give much attention to token_type_ids (useful only when we have more than one sequences) and attention_mask (tells whether we have to attend a token or not). 

In [None]:
len(train_dataset)

In [15]:
#We won't need to focus on the token_ids, because we are doing MLM
#For training we need input_ids with the mask tokens and output labels (without the mask tokens)

#Now we have our tokens in correct format and dimensioanlity but we need to process them through DataLoader object
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

In [16]:
#Checking if we have a GPU and if we have move the model to it 
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [17]:
#Enabling model's training mode
model.train()

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [18]:
from transformers import AdamW
#We are using Adam with Weight Decay as the optimizer 
optim = AdamW(model.parameters(), lr=1e-5)



In [19]:
from tqdm import tqdm
#tqdm allows us to create a progress bar during training 

epochs = 10
#Be careful with number of epochs when training transformer models - they tend to overfit very easily 

for epoch in range(epochs):
  #Start training loop 
  loop = tqdm(train_loader, leave = True)
  #leave=True, leaves a progress bar rather than replacing it with a new one after each epoch
  
  #iterating over batches in loop
  for batch in loop:
    optim.zero_grad()
    #Rather than having any randomly initialized gradients at the start, we make them zero 
    
    input_ids = batch['input_ids'].to(device) 
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)
    #We want the data as well as the model to be present in the GPU 

    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    #Extracting loss from those outputs

    loss = outputs.loss 
    loss.backward() #Calculates loss for every parameter in our model, now can do the gradient update using the optimizer
    optim.step() #Takes a step to optimize every parameter in our model

    loop.set_description(f'Epoch {epoch}') 
    loop.set_postfix(loss=loss.item())

Epoch 0: 100%|█████████████████████████████████████████████████████████| 1823/1823 [07:17<00:00,  4.16it/s, loss=0.114]
Epoch 1: 100%|████████████████████████████████████████████████████████| 1823/1823 [07:17<00:00,  4.17it/s, loss=0.0455]
Epoch 2: 100%|█████████████████████████████████████████████████████████| 1823/1823 [07:17<00:00,  4.17it/s, loss=0.146]
Epoch 3: 100%|████████████████████████████████████████████████████████| 1823/1823 [07:16<00:00,  4.17it/s, loss=0.0709]
Epoch 4: 100%|████████████████████████████████████████████████████████| 1823/1823 [07:16<00:00,  4.17it/s, loss=0.0573]
Epoch 5: 100%|████████████████████████████████████████████████████████| 1823/1823 [07:16<00:00,  4.17it/s, loss=0.0651]
Epoch 6: 100%|████████████████████████████████████████████████████████| 1823/1823 [07:16<00:00,  4.17it/s, loss=0.0493]
Epoch 7: 100%|████████████████████████████████████████████████████████| 1823/1823 [07:17<00:00,  4.17it/s, loss=0.0577]
Epoch 8: 100%|██████████████████████████

In [20]:
torch.save(model.state_dict(), 'Model_Domain_weigths')