**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading Libraries and Dependencies**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ IMPORTING MODULES: UNCOMMENT BELOW:
# !pip install transformers
import torch
import keras
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import RandomSampler, SequentialSampler
import tensorflow as tf
from transformers import pipeline
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification
from transformers import get_linear_schedule_with_warmup

from tqdm import tqdm, trange
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt

**Activating GPU**
- Pretraining a multi-head attention transformer model requires the parallel processing GPUs can provide.

In [4]:
#@ ACTIVATING THE GPU:
device_name = tf.test.gpu_device_name()    
if device_name != "/device:GPU:0":
    raise SystemError("GPU device not found")
print("Found GPU at: {}".format(device_name))       # Inspection.

Found GPU at: /device:GPU:0


In [5]:
#@ SPECIFYING CUDA AS DEVICE: 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")   # Initialization. 
n_gpu = torch.cuda.device_count()                                       # Number of GPUs. 
torch.cuda.get_device_name(0)                                           # Inspection. 

'Tesla K80'

**Loading Dataset**
- We will load CoLA dataset based on Warstadt et al. paper.

In [8]:
#@ LOADING THE DATASET:
path = "/content/in_domain_train.tsv"                   # Path to dataset.
df = pd.read_csv(path, delimiter="\t", header=None, 
                 names=["sentence_source", "label",
                        "label_notes", "sentence"])     # Reading dataset.
df.shape                                                # Inspecting shape.

(8551, 4)

In [9]:
#@ INSPECTING DATASET:
df.sample(5)

Unnamed: 0,sentence_source,label,label_notes,sentence
5813,c_13,1,,ivan got a headache on wednesday from the disg...
557,bc01,1,,a bear occupies the cave .
5557,b_73,1,,some of them made as many errors as joan .
4400,ks08,1,,fred must have been both singing songs and dri...
7120,sgww85,0,*,we talked about that he had worked at the whit...


**Preprocessing Dataset**

In [10]:
#@ PREPROCESSING THE DATASET: 
sentences = df.sentence.values                                          # Initializing arrays. 
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]  # Adding BERT tokens. 
labels = df.label.values                                                # Initializing arrays. 

**BERT Tokenizer**
- We will initialize a pretrained BERT tokenizer. 

In [13]:
#@ INITIALIZING BERT TOKENIZER:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased",
                                          do_lower_case=True)           # Initializing BERT tokenizer. 
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]      # Initializing tokenization. 
print("Tokenizing first sentence:")
print(tokenized_texts[0])                                               # Inspection. 

Tokenizing first sentence:
['[CLS]', 'our', 'friends', 'wo', 'n', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']


In [14]:
#@ PROCESSING THE DATA:
MAX_LEN = 128                                                           # Initialization. 
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in \
             tokenized_texts]                                           # Initializing BERT vocabulary. 
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long",
                          truncating="post", padding="post")            # Padding input tokens. 