# BERT Vectors for Text Classification
Using BERT vectors with pytorch for kaggle's distaster tweet classification challenge: https://www.kaggle.com/c/nlp-getting-started

As an intial step, tweets will be tokenized and converted into BERT vectors using huggingface's [transformers library](https://github.com/huggingface/transformers), then the final hidden state output will be used as the input features to a linear classifier. This follows the approach in this notebook: http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

In [1]:
import pandas as pd
import numpy as np

from pathlib import Path

import re
import string

In [2]:
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.model_selection import (
    train_test_split,
    learning_curve,
    validation_curve,
    GridSearchCV,
    StratifiedKFold
)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.utils import shuffle
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import cross_validate

In [3]:
import torch
import transformers as ppb

## Import data
Assume that all data is contained within the working directory.

Data contains additional columnsfor keyword and location. We will ignore these for now.

In [4]:
df_train = pd.read_csv('train.csv', index_col='id')
df_test = pd.read_csv('test.csv', index_col='id')

In [5]:
df_train[:10]

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1
8,,,#RockyFire Update => California Hwy. 20 closed...,1
10,,,#flood #disaster Heavy rain causes flash flood...,1
13,,,I'm on top of the hill and I can see a fire in...,1
14,,,There's an emergency evacuation happening now ...,1
15,,,I'm afraid that the tornado is coming to our a...,1


In [42]:
text = df_train['text'].to_list()
targets = df_train['target'].values

In [7]:
# X, y = shuffle(X, y, random_state=42)

In [8]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

## Load BERT model from transformers and apply tokenization
Let's try the standard BertTokenizer from transformers

In [9]:
model_class, tokenizer_class, pretrained_weights = (
    ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

In [10]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [11]:
text_tokenized = np.array([tokenizer.encode(tweet, add_special_tokens=True) for tweet in text])

Pad vectors

In [12]:
max_len = 0
for i in text_tokenized:
    if len(i) > max_len:
        max_len = len(i)
text_tokenized_padded = np.array([i + [0]*(max_len-len(i)) for i in text_tokenized])

Apply mask

In [13]:
attention_mask = np.where(text_tokenized_padded != 0, 1, 0)
attention_mask.shape

(7613, 84)

### Generate BERT vectors from model 

In [14]:
input_ids = torch.tensor(text_tokenized_padded) 
attention_mask = torch.tensor(attention_mask)

In [18]:
input_ids.shape

torch.Size([7613, 84])

In [16]:
input_ids_array = torch.split(input_ids, 20)

In [21]:
attention_mask_array = torch.split(attention_mask, 20)

In [23]:
last_hidden_states_list = []
for inputs, masks in zip(input_ids_array, attention_mask_array):
    with torch.no_grad():
        last_hidden_states_list.append(model(inputs, attention_mask=masks))

In [30]:
hidden_states = []
for output in last_hidden_states_list:
    hidden_states.append(output[0])

Concatenate outputs into single tensor

In [36]:
hidden_states_cat = torch.cat(hidden_states, 0)

In [35]:
hidden_states_cat.shape

torch.Size([7613, 84, 768])

Use last hidden state (first element in output) to use as a feature for classifier

In [37]:
features = hidden_states_cat[:,0,:].numpy()

## Evaluate model
Use a vanilla linear svm for now

Create train/test splits

In [60]:
features, targets = shuffle(features, targets)

In [61]:
train_features, test_features, train_labels, test_labels = train_test_split(features, targets)

In [62]:
svm_clf = LinearSVC()

In [63]:
parameters = {'C': np.logspace(-4, 2, 20)}
grid_search = GridSearchCV(
    svm_clf,
    parameters,
    n_jobs=8,
    verbose=10,
    scoring='f1'
)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scores: ', grid_search.best_score_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:    1.6s
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:    2.1s
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    2.6s
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    4.9s
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   15.5s
[Parallel(n_jobs=8)]: Done  45 tasks      | elapsed:   27.4s
[Parallel(n_jobs=8)]: Done  52 out of  60 | elapsed:   38.3s remaining:    5.9s
[Parallel(n_jobs=8)]: Done  60 out of  60 | elapsed:   45.4s finished


best parameters:  {'C': 0.007847599703514606}
best scores:  0.769830233171337


An okay score. C value looks a little small! Let's try on the test set. 

In [64]:
svm_clf.set_params(**grid_search.best_params_)
svm_clf.fit(train_features, train_labels)

LinearSVC(C=0.007847599703514606, class_weight=None, dual=True,
          fit_intercept=True, intercept_scaling=1, loss='squared_hinge',
          max_iter=1000, multi_class='ovr', penalty='l2', random_state=None,
          tol=0.0001, verbose=0)

In [65]:
pred_labels = svm_clf.predict(test_features)

In [66]:
f1_score(test_labels, pred_labels)

0.7724317295188557

It's an okay score. Can probably be optimised with some preprocessing. Let's output the predictions on the test data

# Output Predictions
For now run through same steps

In [68]:
text = df_test['text'].to_list()
text_tokenized = np.array([tokenizer.encode(tweet, add_special_tokens=True) for tweet in text])
max_len = 0
for i in text_tokenized:
    if len(i) > max_len:
        max_len = len(i)
text_tokenized_padded = np.array([i + [0]*(max_len-len(i)) for i in text_tokenized])

In [69]:
attention_mask = np.where(text_tokenized_padded != 0, 1, 0)
attention_mask.shape

(3263, 73)

In [70]:
input_ids = torch.tensor(text_tokenized_padded) 
attention_mask = torch.tensor(attention_mask)

In [71]:
input_ids_array = torch.split(input_ids, 20)
attention_mask_array = torch.split(attention_mask, 20)

In [72]:
last_hidden_states_list = []
for inputs, masks in zip(input_ids_array, attention_mask_array):
    with torch.no_grad():
        last_hidden_states_list.append(model(inputs, attention_mask=masks))

In [73]:
hidden_states = []
for output in last_hidden_states_list:
    hidden_states.append(output[0])

In [74]:
hidden_states_cat = torch.cat(hidden_states, 0)

In [75]:
features_test = hidden_states_cat[:,0,:].numpy()

Train svm on all available data

In [77]:
svm_clf.fit(features, targets)

LinearSVC(C=0.007847599703514606, class_weight=None, dual=True,
          fit_intercept=True, intercept_scaling=1, loss='squared_hinge',
          max_iter=1000, multi_class='ovr', penalty='l2', random_state=None,
          tol=0.0001, verbose=0)

In [78]:
targets_output = svm_clf.predict(features_test)

In [79]:
df_test['target'] = targets_output
df_out = df_test[['target']]
df_out.to_csv('submission_bert.csv')