<font size="4">The goal here is to perform sentiment analysis using distilbert. (https://medium.com/huggingface/distilbert-8cf3380435b5)
 
We will encode sentences with distillbert,then feed these as feature to a xgboost classifier.

We will be using the Stanford Sentiment Treebank 2 (SST2) dataset (Movie reviews with one sentence per review, labeled either positive:1, or negative: 0, no neutral)</font>

In [2]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.sequence import pad_sequences
import xgboost as xgb
import time
from scipy import stats
from scipy.stats import randint
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

In [3]:
#Load the dataset on a pandas dataframe, we don't get the whole set , only the train part
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
pd.set_option('display.max_colwidth', -1)

In [4]:
#See what the dataset looks like
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films",1
1,apparently reassembled from the cutting room floor of any given daytime soap,0
2,"they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science fiction elements of bug eyed monsters and futuristic women in skimpy clothes",0
3,"this is a visually stunning rumination on love , memory , history and the war between art and commerce",1
4,jonathan parker 's bartleby should have been the be all end all of the modern office anomie films,1


<font size="4"> Let's load the pretrained distilBert model and its tokenizer, with the "transformers" library from huggingface, we won't be training or fine tuning the model for now, just use it for inference. </font>


In [4]:
from transformers import TFDistilBertModel, DistilBertTokenizer

In [5]:
distil_bert=TFDistilBertModel.from_pretrained('distilbert-base-uncased')#load distilbert,a lot of gpu memory required
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")#load the tokenizer for distilbert


In [6]:
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
#Tokenize each sentence of our dataset,
#the first token is always <CLS>,it stands for classification,it's what we will retrieve at the end.

padded=pad_sequences(tokenized,padding='post')
#we pad our inputs becasuse bert need to have all inputs the same length 
#bert does better with padding on the right rather than on the left hence post

<font size="4"> Bert also need attention masks since inputs are padded</font>

In [7]:
attention_mask = np.where(padded != 0, 1, 0) #create the attention masks
attention_mask.shape #6920 inputs, each having a length of 67 

(6920, 67)

In [8]:
padded_splits=np.split(padded,8) # We split the inputs,otherwise my gpu runs out of memory during inference
mask_splits=np.split(attention_mask,8)

In [10]:
padded_splits[0][0] #exemple of an tokenized sentence ,101 is the CLS token 

array([  101,  1037, 18385,  1010,  6057,  1998,  2633, 18276,  2128,
       16603,  1997,  5053,  1998,  1996,  6841,  1998,  5687,  5469,
        3152,   102,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0], dtype=int32)

In [11]:
#Run inference on our tokenized dataset
outputs=[]
for i in range(len(padded_splits)):
 outputs.append(distil_bert(padded_splits[i],attention_mask=mask_splits[i]))

In [12]:
del distil_bert
del tokenizer

In [13]:
#We concatenate the last hidden states,into one numpy array
last_hidden=np.concatenate(( outputs[0][0].numpy(),outputs[1][0].numpy()
                            ,outputs[2][0].numpy(),outputs[3][0].numpy()
                            ,outputs[4][0].numpy(),outputs[5][0].numpy()
                            ,outputs[6][0].numpy(),outputs[7][0].numpy()),axis=0)

In [66]:
#save the features to disk 
np.save('/home/florian/bert/last_hidden_uni.npy',last_hidden)
# need to restart the kernel to clear gpu memory

In [7]:
#load the feature from disk 
last_hidden=np.load('/home/florian/bert/last_hidden_uni.npy')

In [8]:
last_hidden.shape
#(number of sentences,embedding length ,number of units)

(6920, 67, 768)

In [9]:
features = last_hidden[:,0,:] #We only want to retreive the first token (cls) of each units

targets=df[1] #labels

In [10]:
print(features.shape)
print(targets.shape)

(6920, 768)
(6920,)


In [12]:
features

array([[-0.21593428, -0.14028901,  0.008311  , ..., -0.13694826,
         0.5867007 ,  0.20112711],
       [-0.17262739, -0.14476174,  0.00223407, ..., -0.17442545,
         0.21386476,  0.37197474],
       [-0.05063341,  0.07203925, -0.02959689, ..., -0.0714896 ,
         0.7185235 ,  0.26225457],
       ...,
       [-0.06550973, -0.05184762, -0.14094462, ..., -0.06450678,
         0.60223097,  0.2134794 ],
       [-0.08523114, -0.04869815, -0.08137507, ..., -0.13589332,
         0.39505604,  0.22889736],
       [-0.29436877, -0.09234713, -0.00831658, ..., -0.05159125,
         0.43497816,  0.2889163 ]], dtype=float32)

<font size="4"> Each sentence in the dataset is now embedded as a vector of 768 features, that we can feed to an Xgboost classifier</font>

In [13]:
X = features
Y = targets

In [14]:
xgb_model = xgb.XGBClassifier(objective='binary:logistic',silent=True,early_stopping_rounds=200,tree_method='gpu_hist')
params = {
        'n_estimators': stats.randint(1500, 4000),
        'learning_rate': [0.03,0.04,0.05,0.06],
        'min_child_weight': [3,4,5,6],
        'gamma': [2,3,4,5,6],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

<font size="4"> We use random search and stratified cross validation to hypertune xgboost ,with accuracy as our metric </font>

In [18]:
folds = 5 # 5 fold cross validation
param_comb = 5 # 5 iteration per fold

skf = StratifiedKFold(n_splits=folds, shuffle = True)

random_search = RandomizedSearchCV(xgb_model, param_distributions=params, n_iter=param_comb, 
                                   scoring='accuracy', n_jobs=1, cv=skf.split(X,Y), verbose=3, random_state=42)



In [19]:
start_time = time.time() 
random_search.fit(X, Y)
end_time=time.time()-start_time
print("time:" ,end_time)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6, score=0.832, total=  26.5s
[CV] colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   26.5s remaining:    0.0s


[CV]  colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6, score=0.843, total=  26.2s
[CV] colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   52.7s remaining:    0.0s


[CV]  colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6, score=0.846, total=  26.4s
[CV] colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6 
[CV]  colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6, score=0.847, total=  26.0s
[CV] colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6 
[CV]  colsample_bytree=1.0, gamma=5, learning_rate=0.03, max_depth=5, min_child_weight=5, n_estimators=2595, subsample=0.6, score=0.857, total=  26.1s
[CV] colsample_bytree=0.6, gamma=3, learning_rate=0.05, max_depth=5, min_child_weight=5, n_estimators=2982, subsample=0.6 
[CV]  colsample_bytree=0.6, gamma=3, learning_rate=0.05, max_depth=5, min_child_weight=5, n_estimators=2982, subsample=0.6, score=0.838, total=  24.4s
[CV] colsample_bytree=0.6, 

[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed: 11.4min finished


time: 722.1062324047089


In [20]:
print(random_search.best_estimator_)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1.0,
              early_stopping_rounds=200, gamma=6, learning_rate=0.04,
              max_delta_step=0, max_depth=4, min_child_weight=4, missing=None,
              n_estimators=3933, n_jobs=1, nthread=None,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
              subsample=0.6, tree_method='gpu_hist', verbosity=1)


In [22]:
print("accuracy :",random_search.best_score_)
#0.845 accuracy without training or fine-tuning distilbert on this specific task

accuracy : 0.8453757225433526
