 Using the BERT tokenizer

In [9]:
from transformers import BertTokenizer

btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence = "He lived characteristically idle and romantic"

tokens = btokenizer.tokenize(sentence)

tokens

['he', 'lived', 'characteristic', '##ally', 'idle', 'and', 'romantic']

In [10]:
ids = btokenizer.convert_tokens_to_ids(tokens)
ids

[2002, 2973, 8281, 3973, 18373, 1998, 6298]

In [13]:
#encode method

from transformers import BertTokenizer

btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "He lived characteristically idle and romantic."
ids = btokenizer.encode(sentence)
ids

[101, 2002, 2973, 8281, 3973, 18373, 1998, 6298, 1012, 102]

In [15]:
#padding the sentence with hugging-face lib provides encode_plus
from transformers import BertTokenizer

btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "He lived characteristically idle and romantic."
encoded =btokenizer.encode_plus(text=sentence,
                               add_special_tokens=True,#add cls and sep tokens
                               max_length=12,
                               pad_to_max_length=True,#want to pad the sentence
                               return_tensors="tf") #others pt for PyTorch
token_ids = encoded["input_ids"]
print(token_ids)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


tf.Tensor([[  101  2002  2973  8281  3973 18373  1998  6298  1012   102     0     0]], shape=(1, 12), dtype=int32)


 Obtaining BERT word vectors

In [17]:
from transformers import BertTokenizer, TFBertModel

btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bmodel = TFBertModel.from_pretrained('bert-base-uncased')

sentence = "He was idle."
encoded = btokenizer.encode_plus(text=sentence,
                                add_special_tokens=True,
                                max_length=10,
                                pad_to_max_length=True,
                                return_tensors='tf')
inputs = encoded['input_ids']
outputs = bmodel(inputs)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. 

In [20]:
outputs[0].shape

TensorShape([1, 10, 768])

In [21]:
outputs[1].shape

TensorShape([1, 768])

 Using BERT for text classification

In [4]:
#using kaggle spam dataset
#Data preprocessing
import pandas as pd

spam_df = pd.read_csv("data/spam.csv", encoding="ISO-8859-1")
spam_df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
spam_df.drop(spam_df.columns[2: ].dropna(), axis=1, inplace=True)

In [6]:
spam_df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
spam_df.columns = ['label', 'text'] 

In [8]:
spam_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
data = spam_df.dropna()

In [10]:
data = data.reset_index(drop=True)

In [11]:
data['label'] = data['label'].map({"ham": 0, "spam": 1})
data.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
sentences = data['text']
labels = data['label']
len(sentences), len(labels)

(5572, 5572)

In [3]:
from transformers import BertTokenizer, TFBertModel

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertModel.from_pretrained("bert-base-uncased")


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [13]:
import numpy as np
import tensorflow 
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model

In [25]:
input_ids = []
attention_masks = []
for sent in sentences:
    bert_input = bert_tokenizer.encode_plus(text = sent,
                                           add_special_tokens = True,
                                           max_length = 64,
                                           pad_to_max_length = True,
                                           return_attention_mask = True,
                                           return_tensors= 'tf')
    input_ids.append(bert_input['input_ids'])
    attention_masks.append(bert_input['attention_mask'])
input_ids = np.asarray(input_ids)
attention_masks = np.array(attention_masks)
labels = np.array(labels)

In [26]:
inputs = Input(shape=(64,), dtype="int32")
bert = bert_model(inputs)
bert = bert[1]



In [27]:
outputs = Dense(units=1, activation="sigmoid")(bert)
model = Model(inputs, outputs)

#this compiling section has huge memory requirement . RAM requirement
#use google colab 
#training code run around 2:30 hrs take in jupyter/1.5hrs in colab even in 1 epoch
#so I left this code in markdown format

history=model.fit(input_ids,labels,batch_size=1,epochs=1)

bmodel.summary()