# Everything You Need To Know About TPUs & Bidirectional Encoder Representations from Transformers [BERT]


# What Is A TPU ?

A tensor processing unit is an AI accelerator application-specific integrated circuit developed by Google specifically for neural network machine learning, particularly using Google's own TensorFlow software.

The tensor processing unit was announced in May 2016 at Google I/O, when the company said that the TPU had already been used inside their data centers for over a year. The chip has been specifically designed for Google's TensorFlow framework, a symbolic math library which is used for machine learning applications such as neural networks. However, as of 2017 Google still used CPUs and GPUs for other types of machine learning. Other AI accelerator designs are appearing from other vendors also and are aimed at embedded and robotics markets.





In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('zEOtG-ChmZE', width=800, height=400)

# Why use a TPU ?

Tensor Processing Unit is highly-optimised for large batches and CNNs and has the highest training throughput , it can decrease your training time
![Image](https://wideops.com/wp-content/uploads/2019/07/Google_Cloud_TPU_v3_Pod_speed-ups_sCUcjjC.max-1000x1000.png)


# Some Great Kernels Demonstrating The Power Of TPU  🔥

- [Incredible TPUs - finetune EffNetB0-B6 at once](https://www.kaggle.com/agentauers/incredible-tpus-finetune-effnetb0-b6-at-once)

- [Super-duper fast pytorch tpu kernel... 🔥🔥🔥🔥🔥](https://www.kaggle.com/abhishek/super-duper-fast-pytorch-tpu-kernel)

- [Accelerator Power Hour (PyTorch + TPU)](https://www.kaggle.com/abhishek/accelerator-power-hour-pytorch-tpu)

- [[TPU-Inference] Super Fast XLMRoberta](https://www.kaggle.com/shonenkov/tpu-inference-super-fast-xlmroberta)

- [Triple Stratified KFold with TFRecords](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords)

# Bidirectional Encoder Representations from Transformers [BERT]  

Bidirectional Encoder Representations from Transformers is a technique for NLP pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. Google is leveraging BERT to better understand user searches.

When BERT was published, it achieved state-of-the-art performance on a number of natural language understanding tasks:

GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0.
SWAG (Situations With Adversarial Generations)

The reasons for BERT's state-of-the-art performance on these natural language understanding tasks are not yet well understood. Current research has focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences, analysis of internal vector representations through probing classifiers, and the relationships represented by attention weights.

approiate model for the given problem - bert-base-multilingual-cased

* You should pick a “cased” model or an “uncased” model depending on whether you think letter casing will be helpful for the task you’re trying to solve.

# Different Refined Versions Of BERT 

[Here's A list of pretrained different model options from 🤗](https://huggingface.co/transformers/pretrained_models.html)

# Top  Contenders For The GIven Problem

### 1) XLM-RoBERTa

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.


### 2) DistilBERT

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of Bert’s performances as measured on the GLUE language understanding benchmark.


### 3) XLM 

The XLM model was proposed in Cross-lingual Language Model Pretraining by Guillaume Lample*, Alexis Conneau*. It’s a transformer pre-trained using one of the following objectives:

-a causal language modeling (CLM) objective (next token prediction),

-a masked language modeling (MLM) objective (Bert-like), or

-a Translation Language Modeling (TLM) object (extension of Bert’s MLM to multiple language inputs)

# Lets Get Started - Loading Libraries And Data 


In [None]:
import tensorflow as tf
import os
import numpy as np
import matplotlib.pyplot as plt
import transformers 
from transformers import TFAutoModel, AutoTokenizer
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split

In [None]:
train_data_frame=pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test_data_frame =pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
sample_sub=pd.read_csv("../input/contradictory-my-dear-watson/sample_submission.csv")

# Initialise TPU | Distribution strategies | TPU Configuration
Most times users want to run the model on multiple TPUs in a data parallel way. A distribution strategy is an abstraction that can be used to drive models on CPU, GPUs or TPUs. Simply swap out the distribution strategy and the model will run on the given device

In [None]:
def Utilize_TPUs():  
    """
    Initialize training strategy using TPU if available else using default strategy for CPU and  single GPU
    
    After the TPU is initialized, you can also use manual device placement to place the computation on a single TPU device.

    """
    try:
        
        resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
        tf.config.experimental_connect_to_cluster(resolver)
        tf.tpu.experimental.initialize_tpu_system(resolver)
        strategy = tf.distribute.experimental.TPUStrategy(resolver)
        REPLICAS = strategy.num_replicas_in_sync
        print("Connected to TPU Successfully:\n TPUs Initialised with Replicas:",REPLICAS)
        
        return strategy
    
    except ValueError:
        
        print("Connection to TPU Falied")
        print("Using default strategy for CPU and single GPU")
        strategy = tf.distribute.get_strategy()
        
        return strategy
    
strategy=Utilize_TPUs()

# AutoModels
* In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the from_pretrained method.

* AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary:

* Instantiating one of AutoModel, AutoConfig and AutoTokenizer will directly create a class of the relevant architecture (ex: model = AutoModel.from_pretrained('bert-base-cased') will create a instance of BertModel).

# Defining Parameters

In [None]:
the_chosen_one="jplu/tf-xlm-roberta-base"
max_len=80
batch_size = 16 * strategy.num_replicas_in_sync
AUTO     = tf.data.experimental.AUTOTUNE
epochs= 20
n_steps = len(train_data_frame) // batch_size

# Define Build & Compile Model

[Get your model from 🤗](https://huggingface.co/models)

* Make sure you choose the right one we are using TFAutoModel therefore we choose jplu/tf-xlm-roberta-large	 


In [None]:
def model_baseline(strategy,transformer):
    with strategy.scope():
        transformer_encoder = TFAutoModel.from_pretrained(transformer)
        input_layer = Input(shape=(max_len,), dtype=tf.int32, name="input_layer")
        sequence_output = transformer_encoder(input_layer)[0]
        cls_token = sequence_output[:, 0, :]
        output_layer = Dense(3, activation='softmax')(cls_token)
        model = Model(inputs=input_layer, outputs=output_layer)
        model.compile(
            Adam(lr=1e-5), 
            loss='sparse_categorical_crossentropy', 
            metrics=['accuracy']
        )
        return model
model=model_baseline(strategy,the_chosen_one)

In [None]:
model.summary()

# EDA And Data Preprocessing 

In [None]:
train_data_frame.head()

In [None]:
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
import seaborn as sns

In [None]:
import plotly.express as px
fig = px.bar(train_data_frame, x=train_data_frame['language'])
iplot(fig)

Observations - 
* Max data points are of English Language 
* All other languages are balanced in count 

In [None]:
sns.countplot(train_data_frame.label)

Observations- 

* Balanced class distibution is observed 

# Processing Data Before Feeding It To Transformers


 

# Tokenizing       [Using AutoTokenizer from Transformers]

Tokenization is breaking a text chunk in smaller parts. Whether it is breaking Paragraph in sentences, sentence into words or word in characters.

AutoTokenizer is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the AutoTokenizer.from_pretrained(pretrained_model_name_or_path) class method.

The from_pretrained() method takes care of returning the correct tokenizer class instance based on the model_type property of the config object, or when it’s missing, falling back to using pattern matching on the pretrained_model_name_or_path string:


In [None]:
tokenizer = AutoTokenizer.from_pretrained(the_chosen_one)


In [None]:
train_data = train_data_frame[['premise', 'hypothesis']].values.tolist()


In [None]:
test_data = test_data_frame[['premise', 'hypothesis']].values.tolist()


# Encoding Data 

Numberically representing  text data such that It can be feed to the model 

In [None]:
train_encoded=tokenizer.batch_encode_plus(train_data,pad_to_max_length=True,max_length=max_len)

In [None]:
test_encoded=tokenizer.batch_encode_plus(test_data,pad_to_max_length=True,max_length=max_len)

# Validation Split 

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(train_encoded['input_ids'], train_data_frame.label.values, test_size=0.1)

x_test = test_encoded['input_ids']

# Loading Data Into tf.Data.Dataset 

In [None]:
train_dataset = (tf.data.Dataset.from_tensor_slices((x_train, y_train)).repeat().shuffle(2048).batch(batch_size).prefetch(AUTO))

valid_dataset = (tf.data.Dataset.from_tensor_slices((x_valid, y_valid)).batch(batch_size).cache().prefetch(AUTO))

test_dataset = (tf.data.Dataset.from_tensor_slices(x_test).batch(batch_size))

# Training Base Model

In [None]:
model.fit(train_dataset,steps_per_epoch=n_steps,validation_data=valid_dataset,epochs=epochs)

# Making Predictions & Saving

In [None]:
predictions = model.predict(test_dataset, verbose=1)
sample_sub['prediction'] = predictions.argmax(axis=1)

In [None]:
sample_sub.to_csv("submission.csv",index= False)