<a href="https://colab.research.google.com/github/JoyeBright/Tensorflow-Tutorial/blob/master/DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Install the distilled version of BERT**

In [0]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
!pip install transformers



## **Import Dataset**

In [0]:
import numpy as np
import pandas as pd

In [0]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [5]:
df.head(5)

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [0]:
mini_batch = df[:2000]

In [7]:
print(mini_batch.shape)

(2000, 2)


**Number of negative or positive sentences in the selected**

In [8]:
mini_batch[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

## **Load DistilBERT for English Language**

In [0]:
from transformers import *

In [0]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

In [0]:
model = DistilBertModel.from_pretrained('distilbert-base-cased')

## **DataSet Preparation**

In [12]:
mini_batch[0] # fetch only sentences

0       a stirring , funny and finally transporting re...
1       apparently reassembled from the cutting room f...
2       they presume their audience wo n't sit still f...
3       this is a visually stunning rumination on love...
4       jonathan parker 's bartleby should have been t...
                              ...                        
1995    too bland and fustily tasteful to be truly pru...
1996                           it does n't work as either
1997    this one aims for the toilet and scores a dire...
1998    in the name of an allegedly inspiring and easi...
1999    the movie is undone by a filmmaking methodolog...
Name: 0, Length: 2000, dtype: object

# **Tokenize each sentence in the database**

In [0]:
tokens = mini_batch[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

***The tokenizer method first splits the sentences using Wordpeice, then embeds [CLS] and [SEP] tokens and finally replace each token with their corresponding ID.***

In [14]:
tokens.head(5)

0    [101, 170, 20329, 117, 6276, 1105, 1921, 19920...
1    [101, 4547, 1231, 11192, 5521, 11813, 1121, 11...
2    [101, 1152, 3073, 22369, 1147, 3703, 192, 1186...
3    [101, 1142, 1110, 170, 19924, 15660, 187, 1408...
4    [101, 179, 7637, 22252, 2493, 1200, 112, 188, ...
Name: 0, dtype: object

### **Padding our sentences with the same length**

In [15]:
# This is the max length in the dataset
max_length=0
for i in tokens.values:
  if(len(i))>max_length:
    max_length = len(i)

print(max_length)  

66


In [0]:
padded_rows = np.array([i + [0]*(max_length-len(i)) for i in tokens.values])

In [17]:
print(padded_rows.shape)
print(padded_rows[:2]) # Now all rows have 66 tokens

(2000, 66)
[[  101   170 20329   117  6276  1105  1921 19920  1231 18632  1104  5295
   1105  1103  8839  1105  4970  5367  2441   102     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0]
 [  101  4547  1231 11192  5521 11813  1121  1103  5910  1395  1837  1104
   1251  1549 14907  8439   102     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0]]


## **Masking the words**

In [0]:
# Ignoring the padding
"In order to mask the tokens, we should not mask those fake tokens which created for padding purposes"
"So first check with a condition where if they are not equal to 0, they will select for masking Language Model"
"Otherwise replace them with zeros"
attention_mask = np.where(padded_rows != 0 , 1, 0)

In [19]:
print(attention_mask.shape)

(2000, 66)


## **Feed to DistilBERT**

In [0]:
import torch

In [0]:
input_ids = torch.tensor(padded_rows)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
  last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [22]:
print(last_hidden_states[0].shape)

torch.Size([2000, 66, 768])


***2000 rows per sequences, 66 tokens in each sentence, 768 hidden units***

* Because we are doing text classification, we only need CLS token (first hidden neuron connected to each sequences)

In [23]:
features = last_hidden_states[0][:, 0, :].numpy()
print(features.shape)

(2000, 768)


## ***Time to work on the 2nd Model***

In [0]:
from sklearn.model_selection import train_test_split
labels = mini_batch[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

## **Logistic Regression**

In [25]:
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## **LR Accuracy after using DistilBERT**

In [0]:
from sklearn.model_selection import cross_val_score

In [27]:
lr_clf.score(test_features, test_labels)

0.826

### **Dummy Classifier as a Baseline**

In [28]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print(scores.mean())

0.48600000000000004
