Let's use DistilBERT to do the Twitter sentiment analysis task: https://www.kaggle.com/kazanova/sentiment140. The task is to classify tweets as positive (4) or negative (0). I'll relabel to 1 and 0.

Some of this is adapted from the tutorial here: https://swatimeena989.medium.com/bert-text-classification-using-keras-903671e0207d, but on a new dataset (and with some additions).

In [33]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import string
from sklearn.utils import shuffle
import pickle

from tensorflow.keras.callbacks import ModelCheckpoint

import re
import time
import warnings
warnings.filterwarnings("ignore")
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import LancasterStemmer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

from transformers import DistilBertModel, DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification

import tensorflow.keras 
from tensorflow.keras.models import Sequential, Model 
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Dropout, Input, Embedding

In [34]:
dataset_cols = ["target", "ids", "date", "flag", "user", "text"]
dataset = pd.read_csv('/home/garrett/KagglesData/training.1600000.processed.noemoticon.csv', header=None, encoding='ISO-8859-1', names=dataset_cols)

In [15]:
dataset.shape
dataset.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Put all text in lowercase, remove Twitter handles, punctuation, websites, text in brackets or html tags, and words containing numbers. This does remove words like '2nite', but identifying and parsing 'textspeak' is a challenge that's a little beyond the scope of what I'm trying to investigate here (which is mainly just learning to correctly implement a form of BERT).

In [16]:
def preprocess(text):
    text = text.apply(lambda x: x.lower())
    text = text.apply(lambda x: re.sub(r'@\w+', '', x))
    text = text.apply(lambda x:re.sub('\[.*?\]', '', x))
    text = text.apply(lambda x:re.sub('https?://\S+|www\.\S+', '', x))
    text = text.apply(lambda x:re.sub('<.*?>+', '', x))
    text = text.apply(lambda x:re.sub('[%s]' % re.escape(string.punctuation), ' ', x))
    text = text.apply(lambda x:re.sub('\n', '', x))
    text = text.apply(lambda x:re.sub('\w*\d\w*', '', x))
    return text

In [17]:
df = shuffle(dataset,random_state=42)
df.head()

Unnamed: 0,target,ids,date,flag,user,text
541200,0,2200003196,Tue Jun 16 18:18:12 PDT 2009,NO_QUERY,LaLaLindsey0609,@chrishasboobs AHHH I HOPE YOUR OK!!!
750,0,1467998485,Mon Apr 06 23:11:14 PDT 2009,NO_QUERY,sexygrneyes,"@misstoriblack cool , i have no tweet apps fo..."
766711,0,2300048954,Tue Jun 23 13:40:11 PDT 2009,NO_QUERY,sammydearr,@TiannaChaos i know just family drama. its la...
285055,0,1993474027,Mon Jun 01 10:26:07 PDT 2009,NO_QUERY,Lamb_Leanne,School email won't open and I have geography ...
705995,0,2256550904,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,yogicerdito,upper airways problem


In [18]:
df = df.loc[:, ~df.columns.str.contains('ids', case=False)] 
df = df.loc[:, ~df.columns.str.contains('date', case=False)] 
df = df.loc[:, ~df.columns.str.contains('flag', case=False)]
df = df.loc[:, ~df.columns.str.contains('user', case=False)]
print(df['text'][671155])
df['text']=preprocess(df['text'])
print(df['text'][671155])
#df.tail()

Pickin up @misstinayao waitin on @sadittysash 2 hurry up...I odeeee missed dem  Table talk 2nite...LOL bout to be fat...
pickin up  waitin on   hurry up   i odeeee missed dem  table talk    lol bout to be fat   


Map the target onto a ground-truth of 0  or 1, check to make sure every tweet is labeled.

In [19]:
df['gt'] = df['target'].map({0:0,4:1})
sentences=df['text']
labels=df['gt']
len(sentences),len(labels)

(1600000, 1600000)

In [20]:
df=df.dropna()                 
df=df.reset_index(drop=True)
print('Available labels: ',df['gt'].unique())

Available labels:  [0 1]


Data exploration:

In [40]:
bert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
bert_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

In [22]:
input_ids=[]
attention_masks=[]

for sent in sentences:
    bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =64,pad_to_max_length = True,return_attention_mask = True)
    input_ids.append(bert_inp['input_ids'])
    attention_masks.append(bert_inp['attention_mask'])

input_ids=np.asarray(input_ids)
attention_masks=np.array(attention_masks)
labels=np.array(labels)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [23]:
len(input_ids),len(attention_masks),len(labels)

(1600000, 1600000, 1600000)

In [27]:
print('Preparing the pickle file.....')

pickle_inp_path='./bert_inp.pkl'
pickle_mask_path='./bert_mask.pkl'
pickle_label_path='./bert_label.pkl'

pickle.dump((input_ids),open(pickle_inp_path,'wb'))
pickle.dump((attention_masks),open(pickle_mask_path,'wb'))
pickle.dump((labels),open(pickle_label_path,'wb'))


print('Pickle files saved as ',pickle_inp_path,pickle_mask_path,pickle_label_path)

Preparing the pickle file.....
Pickle files saved as  ./bert_inp.pkl ./bert_mask.pkl ./bert_label.pkl


In [28]:
print('Loading the saved pickle files..')

input_ids=pickle.load(open(pickle_inp_path, 'rb'))
attention_masks=pickle.load(open(pickle_mask_path, 'rb'))
labels=pickle.load(open(pickle_label_path, 'rb'))

print('Input shape {} Attention mask shape {} Input label shape {}'.format(input_ids.shape,attention_masks.shape,labels.shape))

Loading the saved pickle files..
Input shape (1600000, 64) Attention mask shape (1600000, 64) Input label shape (1600000,)


In [29]:
trainval_inp,test_inp,trainval_label,test_label,trainval_mask,test_mask=train_test_split(input_ids,labels,attention_masks,test_size=0.1)
train_inp,val_inp,train_label,val_label,train_mask,val_mask=train_test_split(trainval_inp,trainval_label,trainval_mask,test_size=0.1)

In [41]:
log_dir='tensorboard_data/tb_bert'
model_save_path='./models/bert_model.h5'

callbacks = [tensorflow.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),tensorflow.keras.callbacks.TensorBoard(log_dir=log_dir)]

#print('\nBert Model',bert_model.summary())

loss = tensorflow.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tensorflow.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tensorflow.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

AttributeError: 'DistilBertForSequenceClassification' object has no attribute 'compile'