<a href="https://colab.research.google.com/github/KuldeepDileep/RandomWalk-Probability-and-Statistics-Project/blob/master/XLM_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

**Multi NLI** 
https://cims.nyu.edu/~sbowman/multinli/

```
```


Introduction
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.



**XNLI**
https://github.com/facebookresearch/XNLI

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations

In [None]:
#taking 5000 samples from Multi_NLI datasets
#Exlore Multi_NLI datasets 
mnli_dataset = tfds.load(name="multi_nli",split='train[:50000]')

[1mDownloading and preparing dataset multi_nli/plain_text/1.0.0 (download: 216.34 MiB, generated: Unknown size, total: 216.34 MiB) to /root/tensorflow_datasets/multi_nli/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/multi_nli/plain_text/1.0.0.incomplete5YQMK3/multi_nli-train.tfrecord


  0%|          | 0/392702 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/multi_nli/plain_text/1.0.0.incomplete5YQMK3/multi_nli-validation_matched.tfrecord


  0%|          | 0/9815 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/multi_nli/plain_text/1.0.0.incomplete5YQMK3/multi_nli-validation_mismatched.tfrecord


  0%|          | 0/9832 [00:00<?, ? examples/s]



[1mDataset multi_nli downloaded and prepared to /root/tensorflow_datasets/multi_nli/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
# We will expore individual items in the NLI dataset
import textwrap
#wrap text for 80 characters 
wrapper = textwrap.TextWrapper(width=80,initial_indent=' ', subsequent_indent=' ') 
label_names = ['entailment','neutral','contradiction']

#retreive few random samples                     
for ex in mnli_dataset.shuffle(buffer_size=256).take(count=3):
  #printing the Premise
  premise = ex['premise'].numpy().decode('utf-8')
  #printing the premise 
  hypothesis = ex['hypothesis'].numpy().decode('utf-8')
  #printing the labels corresponding to each (premise, hyptoesis)
  label = ex['label'].numpy()
  label_name = label_names[label]
  print("Premise: \n"  +str(wrapper.fill(premise))+"\n hypothesis: \n"+str( wrapper.fill(hypothesis))+"\n Label :  "+str(label))
  print("\n --------------------------\n")

Premise: 
 Top-level interest in the amount of improper payments at the organizations that
 participated in our study often resulted from program, audit, and/or media
 reports of misspent funds or fraudulent activities.
 hypothesis: 
 Corporate heads frequently ignored questionable financial activities until they
 came to light publicly.
 Label :  0

 --------------------------

Premise: 
 Otherwise there is nothing to be done."
 hypothesis: 
 There is much that can be done.
 Label :  2

 --------------------------

Premise: 
 Women's left eyelids ticked almost instinctively, and Herman kept silent until
 the end of the convention.
 hypothesis: 
 Herman was distracted by the women's left eyelid.
 Label :  1

 --------------------------



Importning XNLI


In [None]:
#importing xnli_dataset
# Exploring XNLI dataset
xnli_dataset = tfds.load(name='XNLI',split='test')

[1mDownloading and preparing dataset xnli/1.1.0 (download: 17.04 MiB, generated: 29.62 MiB, total: 46.65 MiB) to /root/tensorflow_datasets/xnli/1.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/xnli/1.1.0.incompleteL32AAP/xnli-test.tfrecord


  0%|          | 0/5010 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/xnli/1.1.0.incompleteL32AAP/xnli-validation.tfrecord


  0%|          | 0/2490 [00:00<?, ? examples/s]

[1mDataset xnli downloaded and prepared to /root/tensorflow_datasets/xnli/1.1.0. Subsequent calls will reuse this data.[0m


In [None]:
#This is the language code for urdu language in the XNLi  Datasets
language_code = 'ur'
language_index = 12

In [None]:
import textwrap
#wrap text for 80 characters
wrapper = textwrap.TextWrapper(width=80,initial_indent=' ', subsequent_indent=' ') 
label_names = ['entailment','neutral','contradiction']

#retreive few random samples to check the tagging for urdu                    
for ex in xnli_dataset.shuffle(buffer_size=256).take(count=6):
  premise = ex['premise'][language_code].numpy().decode('utf-8')
  hypothesis = ex['hypothesis']['translation'][language_index].numpy().decode('utf-8')
  label = ex['label'].numpy()
  label_name = label_names[label]
  print("\n Premise: \n"  +str(wrapper.fill(premise))+"\n hypothesis:\n "+str( wrapper.fill(hypothesis))+"\nLabel :  "+str(label))
  print("\n --------------------------\n")


 Premise: 
 ہمارے تدریس کی ہسپتال اور تحقیق کے پروگراموں کو ریاستی معاونت حاصل نہیں ہوتی.
 hypothesis:
  ریسرچ پروگرام کو ریاست سے مالی امداد نہیں مل سکی کیونکہ وہ لوگوں پر استعمال
 کرتے ہیں.
Label :  1

 --------------------------


 Premise: 
 یہ قطار میں لگے ستونوں میں سب سے واضح ہوتا ہے جس سے ونسنٹ سلکی نے قدیم یونان کے
 ہتھیاروں سے فوجی دستے سے مماثلت دی ہے۔
 hypothesis:
  ونسنٹ سلی فون تعمیر کا ماہر ہے
Label :  1

 --------------------------


 Premise: 
 گلیسی آبجیکٹ کی شدت پسندوں نے انہیں بالکل فکر نہیں کی، اور اس کی بہترین وجہ سے
 وہ کوشش کرنے پر مجبور نہیں کرتا.
 hypothesis:
  کیونکہ وہ کوشش نہیں کرتا اس لئے وہ پریشان نہیں ہے
Label :  0

 --------------------------


 Premise: 
 تو میں یہ جاری رکھنا چاہتا ہوں کیونکہ مجھے معلوم ہے کہ اگر آپ نہ کریں تو آپ کو
 کئی مسائل کا سامنا کرنا پڑ سکتا ہے
 hypothesis:
  کوشش کرنے کا کوئی فائدہ نہیں ہے، اسلئے میں زحمت ہی نہیں کرتا.
Label :  2

 --------------------------


 Premise: 
 جی ہاں کچھ جگاہیں اچھی ہیں ان کو یو پی ایس یا دوسرے طری

Helper Function

In [None]:
#install transformes 
!pip install transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 53.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 74.2 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

In [None]:
import torch
from transformers import AutoTokenizer
# Download the tokenizer for the XLM-Robert `base` model.
xlmr_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

Working through the tokenizer

In [None]:
sentence_1="are you going to school"
sentence_2 = "How are you today"

#econde the two sentences together
encoded = xlmr_tokenizer.encode_plus(sentence_1,sentence_2)
print(encoded)
#print ids of the resulting tokens
print("input Ids",encoded['input_ids'])

#convert the token IDs back to string so we can check them out
print("tokens", xlmr_tokenizer.convert_ids_to_tokens(encoded['input_ids']))

print("\n attention Mask:",encoded['attention_mask'])

**Choose Maximum sequence length**
to work on the reasonable value of *max_len*, so that all sequences will be padded or truncated, we will do some statistics on the dataset

In [None]:
import numpy as np

lengths_en = []

labels_en = []

print("tokenizing all examples to check sequence lengths")
#iterate through the dataset
for ex in mnli_dataset:
  premise = ex['premise'].numpy().decode('utf-8')
  hypothesis = ex['hypothesis'].numpy().decode('utf-8')
  
  #report progress
  if((len(lengths_en)%30000)==0):
    print('tokenized{:,} samples.'.format(len(lengths_en)))

  #tokenizer.encode will tokenize the senstence, map the tokens 
  #and add the required special tokens

  encoded = xlmr_tokenizer.encode(
      premise,hypothesis, add_special_tokens=True,
      )

  lengths_en.append(len(encoded))

  labels_en.append(ex['label'].numpy())

print('Done.')
print(len(lengths_en))

In [None]:
# some statistics on min, max
print("Min lenght:",min(lengths_en))
print("MaX lenght:",max(lengths_en))
print("Median lenght:",int(np.median(lengths_en)))


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.distplot(lengths_en, kde=False, rug=False)
plt.xlabel("Sequence length")
plt.ylabel('# of samples')


In [None]:
max_len = 128

#count the number of sequences that are longer than max_len  tokens
num_truncated = np.sum(np.greater(lengths_en, max_len))

#comapre this to total number of training sentences 
num_sentences = len(lengths_en)
prcnt = float(num_truncated)/float(num_sentences)

print("percent of tokens longer than max_len",prcnt)


In [None]:
# class balance

sns.countplot(labels_en)

plt.title('Class Distribution')
plt.xlabel('Category')
plt.ylabel('# of training Samples')
plt.show()

## tokenizing English DataSet##

We will perform truncation and padding

In [None]:
import torch

labels_en =[]
input_ids_en = []
attn_mask_en = []
segment_ids_en = []

print("encoding all examples")

#iterate through the dataset
for ex in mnli_dataset:
  #report progress
  if((len(lengths_en)%30000)==0):
    print('tokenized{:,} samples.'.format(len(lengths_en)))

  premise = ex['premise'].numpy().decode('utf-8')
  hypothesis = ex['hypothesis'].numpy().decode('utf-8')

  encoded_dict = xlmr_tokenizer.encode_plus(premise,hypothesis,
                                          max_length = max_len,
                                          padding = 'max_length',
                                          truncation=True,
                                          return_tensors = 'pt')
  
  # add this example to our list
  input_ids_en.append(encoded_dict["input_ids"])
  attn_mask_en.append(encoded_dict["attention_mask"])
  labels_en.append(ex['label'].numpy())

  
  #Note that XLM-R does not appear to use Segment IDS

#convert each python list of Tensors into a 2D Tensor Matix
input_ids_en = torch.cat(input_ids_en, dim=0)
attn_mak_en = torch.cat(attn_mask_en, dim=0)

# cast the labes into tensors

labels_en =  torch.tensor(labels_en)

print(len(labels_en),"Samples")

#Train -Validation Split#
Split 10% of our training exmples for validation

In [None]:
from torch.utils.data import TensorDataset, random_split

#combine the trainign inputs to tensorDatasets
dataset = TensorDataset(input_ids_en,attn_mak_en, labels_en)

 #create a 90-10 train validation split
train_size = int(0.9*(len(dataset)))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset,[train_size, val_size])

# set batch size of DataLoadr
The pytorch dataloader will takes care of the randomly grouping our trainng data into batches

In [None]:
from torch.utils.data import DataLoader, RandomSampler,SequentialSampler

batch_size=16

train_dataloader = DataLoader(train_dataset,
                          sampler=RandomSampler(train_dataset),
                          batch_size = batch_size)


validation_dataloader = DataLoader(val_dataset,
                          sampler=RandomSampler(val_dataset),
                          batch_size = batch_size)

# Load Pre-Trained Model


In [None]:
from transformers import XLMRobertaForSequenceClassification
import torch
xlmr_model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base",num_labels=3)


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

In [None]:
print("Loading model to GPU")
import torch
#connect to GPU
device = torch.device('cuda')

#report the GPu which is granted 
print('GPU:',torch.cuda.get_device_name(0))

#copy the initial model with weights to GPU
desc = xlmr_model.to(device)
print("Done")

# Learning Rate


In [None]:
from transformers import AdamW

optimizer  = AdamW(xlmr_model.parameters(),
                   lr = 2e-6,# learning rate
                   eps=1e-8 # args.adma_epsilon
                   )

In [None]:
from transformers import get_linear_schedule_with_warmup

#number of training epochs
epochs = 3
#total no of training steps is [number of batches]*[number of epoches]
total_steps = len(train_dataloader)*epochs

#create the learning rate scheduler 
scheduler  =  get_linear_schedule_with_warmup( optimizer, 
                                              num_warmup_steps = 0,
                                              num_training_steps = total_steps)


#Training Loop
The usual trainign loop, Note that we do include the seqment IDs, sice this is a sentence pair task. 

In [None]:
#from tqdm.auto import tqdm

#progress_bar = tqdm(range(total_step))
#model.train()

#for epoch in range(epochs):
#    for batch in train_dataloader:
#        batch = {k: v.to(device) for k, v in batch.items()}
#        outputs = model(**batch)
#        loss = outputs.loss
#        loss.backward()

#        optimizer.step()
#        lr_scheduler.step()
#        optimizer.zero_grad()
#        progress_bar.update(1)

In [None]:
import random
import numpy as np
from transformers import get_scheduler

from tqdm.auto import tqdm

progress_bar = tqdm(range(total_steps))

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps,
)

seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)



training_stats = []

total_t0 =time.time()


for epoch_i in range(0,epochs):
  #training
  print("Epoch:",epoch_i+1)


  t0 = time.time()
  total_train_loss = 0

  xlmr_model.train()

  #pick an interval which to print progress update
  #u#pdate_interval =   good_update_interval(
   #   total_iters=len(train_dataloader),
     # num_desired_updates = 10)

  #For each batch of train_data
  for step, batch in enumerate(train_dataloader):
    #progress update 
    
    b_input_ids = batch[0].to(device)
    b_input_mask = batch[1].to(device)
    b_lables = batch[2].to(device)

    xlmr_model.zero_grad()
    
    
    #loss,logits = xlmr_model(b_input_ids,attention_mask = b_input_mask,labels=b_lables)
    
    outputs =xlmr_model(b_input_ids,attention_mask = b_input_mask,labels=b_lables)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

    #total_train_loss +=loss.item()
    #print(loss)
    #break

    #loss.backward()

    #torch.nn.utils.clip_grad_norm(xlmr_model.parameters(),1.0)

    #optimizer.step()

    #scheduler.step()
  
  #avg_train_losss  = total_train_loss/len(train_dataloader)


  #put the model in evaluation mode
  xlmr_model.eval()

  #total_eval_loss = 0

  #predictions, true_labels = [],[]

  for batch in validation_dataloader:

    b_input_ids = batch[0].to(device)
    b_input_mask = batch[1].to(device)
    b_lables = batch[2].to(device)

    with torch.no_grad():
      outputs = xlmr_model(b_input_ids,attention_mask = b_input_mask,labels=b_lables)
    







    

    


      
  







#mounting google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
xlmr_model.save_pretrained("/content/gdrive/MyDrive/Model")

In [None]:
#loading a model
xlmr_model = XLMRobertaForSequenceClassification.from_pretrained("/content/gdrive/MyDrive/Model", local_files_only=True)

#for CPU

In [None]:
xlmr_model.eval()
predictions=[]
true_labels=[]
for batch in validation_dataloader:

    b_input_ids = batch[0]
    b_input_mask = batch[1]
    b_lables = batch[2]

    with torch.no_grad():
      outputs = xlmr_model(b_input_ids,attention_mask = b_input_mask)
    logits = outputs.logits
    logits=logits.detach().cpu().numpy()
    label_ids = b_lables.to('cpu').numpy()
    predictions = torch.argmax(logits, dim=-1)
    #print(b_lables)
    #print(predictions)
    #print("\n -------------------------- \n")
    predictions.append(logits)
    true_labels.append(label_ids)

In [None]:
print("Loading model to GPU")
import torch
#connect to GPU
device = torch.device('cuda')

#report the GPu which is granted 
print('GPU:',torch.cuda.get_device_name(0))

#copy the initial model with weights to GPU
desc = xlmr_model.to(device)
print("Done")

In [None]:
xlmr_model.eval()
predictions=[]
true_labels=[]
for batch in validation_dataloader:

    b_input_ids = batch[0].to(device)
    b_input_mask = batch[1].to(device)
    b_lables = batch[2].to(device)

    with torch.no_grad():
      outputs = xlmr_model(b_input_ids,attention_mask = b_input_mask)
    #logits = outputs.logits
    #logits=logits.detach().cpu().numpy()
    #label_ids = b_lables.to('cpu').numpy()
    #predictions = torch.argmax(logits, dim=-1)
    #print(b_lables)
    #print(predictions)
    #print("\n -------------------------- \n")
    #predictions.append(logits)
    #true_labels.append(label_ids)

NameError: ignored

In [None]:
predictions[1]

In [None]:
#combine the results across all batches
#predictions
flat_predictions = np.concatenate(predictions, axis=0)
flat_true_labels = np.concatenate(true_labels, axis=0)

##for sample pick the label
predicted_labels = np.argmax(flat_predictions, axis=1).flatten()

In [None]:
# the number of correct predictions to get our accuracy
accuracy = (predicted_labels==flat_true_labels).mean()
print('XLMR- prediction accuracy',accuracy)

## Test on Urdu DataSet


In [None]:
#tokenize the test samples
#retrieve the test set
xnli_test_dataset=tfds.load(name='xnli',split='test')

In [None]:
import torch
labels_ur = []
input_ids_ur = []
attn_masks_ur= []
segment_ids_ur = []

print("Encoding Test examples")


for ex in xnli_test_dataset:
  premise = ex['premise'][language_code].numpy().decode('utf-8')
  hypothesis = ex['hypothesis']['translation'][language_index].numpy().decode('utf-8')
  

  #convert sentence pairs to input ids, with attention masks
  encoded_dict = xlmr_tokenizer.encode_plus(premise, hypothesis,
                                       max_length = max_len,
                                       padding='max_length',
                                       truncation=True,
                                       return_tensors = 'pt')
  input_ids_ur.append(encoded_dict['input_ids'])
  attn_masks_ur.append(encoded_dict['attention_mask'])
  #print(ex['label'].numpy)
  labels_ur.append(ex['label'].numpy())

  #convert each python list of tensor to 2D tensor Matrix

input_ids_ur = torch.cat(input_ids_ur,dim=0)
attn_masks_ur = torch.cat(attn_masks_ur,dim=0)

#cast the labels into tensors
labels_ur = torch.tensor(labels_ur)
print(len(labels_ur))


In [None]:
print(type(labels_ur))

In [None]:
#construct a TensordataSet from encoded examples
prediction_dataset = TensorDataset(input_ids_ur,attn_masks_ur,labels_ur)

#dataloader for handling batches
prediction_dataloader = DataLoader(prediction_dataset,batch_size=batch_size)

In [None]:
# class balance

sns.countplot(labels_ur.numpy())

plt.title('Class Distribution')
plt.xlabel('Category')
plt.ylabel('# of training Samples')
plt.show()

# run predictions 

In [None]:
#Predictions on the test set
print("predicting labels for")

#put models on evaluation mode
xlmr_model.eval()

#tracking variables
predictions, true_labels=[],[]

count = 0

#predict
for batch in prediction_dataloader:
  #add batch to GPU
  batch = tuple(t.to(device) for t in batch)

  #unpack the inputs from our data loader
  b_input_ids, b_input_mask, b_lables = batch

  with torch.no_grad():
    output = xlmr_model(b_input_ids,
                        attention_mask = b_input_mask)
    logits = outputs.logits
    logits=logits.detach().cpu().numpy()
    label_ids = b_lables.to('cpu').numpy()
    #predictions = torch.argmax(logits, dim=-1)
    #print(b_lables)
    #print(predictions)
    #print("\n -------------------------- \n")
    predictions.append(logits)
    true_labels.append(label_ids)



In [None]:
print(label_ids)

In [None]:
#combine the results across all batches
#predictions
flat_predictions = np.concatenate(predictions, axis=0)
flat_true_labels = np.concatenate(true_labels, axis=0)

##for sample pick the label
predicted_labels = np.argmax(flat_predictions, axis=1).flatten()

In [None]:
# the number of correct predictions to get our accuracy
len(flat_true_labels)
#accuracy = (predicted_labels==flat_true_labels).sum()
#print('XLMR- prediction accuracy',accuracy)