 **BERT  (Bidirectional Encoder Representations from Transformers )**

 1. BERT provides a pre-trained model for English and Chinese language.

 2. BERT is a text representation technique like Word Embeddings. 

 3. BERT is also a text representation technique which is a fusion of variety of state-of-the-art deep learning algorithms, such as bidirectional encoder LSTM and Transformers.


**Installing and Importing Required Libraries**


1.   Before using BERT text representation, need to install BERT for TensorFlow 2.0.

2. Executing the following pip commands on terminal to install BERT for 
TensorFlow 2.0.   



In [0]:
!pip install bert-for-tf2
!pip install sentencepiece



**TensorFlow 2.0. Google Colab**

1. TensorFlow is the dominating Deep Learning framework for Data Scientists and Jupyter Notebook is the go-to tool for Data Scientists.

2. I am  running TensorFlow 2.0. Google Colab.

In [0]:
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

**Storing Dataset In my drive **

In [0]:
from google.colab import drive
drive.mount('/content/drive')
main_directory = '/content/drive/My Drive/Colab Notebooks/a.csv'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import pandas as pd

1. The following script imports the dataset using the read_csv() method of the **Pandas dataframe**.

2. The script also prints the shape of the dataset.

In [0]:
import numpy as np

sentiment = pd.read_csv("/content/drive/My Drive/soft computing/a.csv",encoding = 'ISO-8859-1', header=None, names =['label','id','date','query','user','tweet'])

sentiment.isnull().values.any()

sentiment.shape

(1600000, 6)

In [0]:
sentiment.head(n=10)

Unnamed: 0,label,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
6,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
7,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
9,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?


In [0]:
sentiment.drop(["id", "date", "query", "user"],
          axis=1,
          inplace=True)

Removing those columns which we dont need it.

In [0]:
sentiment.shape

(1600000, 2)

The output shows that our dataset has 16,00,000 rows and 2 columns now for next step working.

In [0]:
sentiment.head(n=10)

Unnamed: 0,label,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
5,0,@Kwesidei not the whole crew
6,0,Need a hug
7,0,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,@Tatiana_K nope they didn't have it
9,0,@twittera que me muera ?


**Preprocessing the Dataset**

1.  For preprocessing our data to remove any punctuations and special characters, i have to define a function that takes as input a raw tweet from tweet column in dataset and returns the corresponding cleaned tweet.

In [0]:
def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

In [0]:
import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

In [0]:


tweeter = []
sentence = list(sentiment['tweet'])
for sen in sentence:
    tweeter.append(preprocess_text(sen))
    #df.append(sentiment)

In [0]:
print(sentiment.columns.values)

['label' 'tweet']


The tweet column contains text
 while the label column contains number 0 and 4 for sentiments.

In [0]:
sentiment.label.unique()

array([0, 4])

The following script replaces 4 sentiment by 1 and the 0 sentiment by 0.

In [0]:
import numpy as np


y = sentiment['label']

y = np.array(list(map(lambda x: 1 if x==4 else 0, y)))

In [0]:
sentiment.label.unique()

array([0, 4])

randomly printing a tweet.

In [0]:
print(tweeter[160000])      #tweeter is the list after preposition happens

 tiffanylue know was listenin to bad habit earlier and started freakin at his part 


It clearly looks like a negative tweet. Let's just confirm it by printing the corresponding label value

In [0]:
print(y[12])

0


Its label is 0 means the tweet is negative.

After preprocessing our data and  now ready to create BERT representations from our text data.

**Creating a BERT Tokenizer**

In order to use BERT **text embeddings** as input to **train** text classification model, we need to tokenize our text tweet. Tokenization refers to dividing a sentence into individual words. To tokenize our text, we will be using the BERT tokenizer. Look at the following script:

In [0]:
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In the script above,

1. we first create an object of the FullTokenizer class from the bert.bert_tokenization module.

2. Next, we create a BERT embedding layer by importing the BERT model from hub.KerasLayer. 

3. The trainable parameter is set to False, which means that we will not be training the BERT embedding. 

4. In the next line, we create a BERT vocabulary file in the form a numpy array. 

5. We then set the text to lowercase 

6. and finally we pass our vocabulary_file 

7. and to_lower_case variables to the BertTokenizer object.

## we will only be using BERT Tokenizer.

we will tokenize a random sentence, as shown below:

In [0]:
tokenizer.tokenize("spring break in plain city it snowing")

['spring', 'break', 'in', 'plain', 'city', 'it', 'snow', '##ing']

we will get the ids of the tokens using the convert_tokens_to_ids() of the tokenizer object. Look at the following script:

In [0]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("spring break in plain city it snowing"))

[3500, 3338, 1999, 5810, 2103, 2009, 4586, 2075]

As in the tokenizer vocabulary, If thereâ€™s a token that is not present in the vocabulary, the tokenizer will use the special [UNK] token and use its id.
Thats why its need to convert it in ids.

In [0]:
def tokenize_tweet(text_tweeter):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_tweeter))

In [0]:
tokenized_tweet = [tokenize_tweet(tweet) for tweet in tweeter]   #tweeter is list , tweet is column of dataset

**Prerparing Data For Training**

1. The tweets in our dataset have varying lengths.

2. Some Tweets are very small while others are very long.

3. To train the model, the input sentences should be of equal length.

4. To create sentences of equal length, one way is to pad the shorter sentences by 0s. 

5. The other way is to pad sentences within each batch.

6. Since we will be training the model in batches, we can pad the sentences within the training batch locally depending upon the length of the longest sentence. To do so, we first need to find the length of each sentence.

7. The following script creates a list of lists where each sublist contains tokenized tweet, the label of the tweet and the length of the tweet:

In [0]:
tweets_with_len = [[tweet, y[i], len(tweet)]
                 for i, tweet in enumerate(tokenized_tweet)]

In our dataset, 
1. the first half of the tweets are negative
2. while the last half contains negative reviews. 
3. Therefore, in order to have both positive and negative tweets in the training batches we need to shuffle the tweets. 
4. The following script shuffles the data randomly:

In [0]:
import random

random.shuffle(tweets_with_len)

Once the data is shuffled, 
1. we will sort the data by the length of the reviews.
2. To do so, we will use the sort() function of the list and 
3. will tell it that we want to sort the list with respect to the third item in the sublist i.e. the length of the review.

In [0]:
tweets_with_len.sort(key=lambda x: x[2])

Once the tweets are sorted by length, we can remove the length attribute from all the tweets.

In [0]:
sorted_tweets_labels = [(tweet_lab[0], tweet_lab[1]) for tweet_lab in tweets_with_len]

Once the tweets are sorted
1. we will convert thed dataset so that it can be used to train TensorFlow 2.0 models. 
2. Run the following code to convert the sorted dataset into a TensorFlow 2.0-compliant input dataset shape.

In [0]:
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_tweets_labels, output_types=(tf.int32, tf.int32))

Finally, 
1.  we can now pad our dataset for each batch. 
2. The batch size we are going to use is 32 which means that after processing 32 tweets, the weights of the neural network will be updated. 
3. To pad the tweetss locally with respect to batches, execute the following:

In [0]:
BATCH_SIZE = 34
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))


Remember, for TPU batch size 128 works like a charm, fast and easily trainable. While for GPU it will throw exhaustion error. So, reducing batch size for GPU.

Let's print the first batch and see how padding has been applied to it:

In [0]:
next(iter(batched_dataset))

(<tf.Tensor: shape=(34, 1), dtype=int32, numpy=
 array([[ 5983],
        [ 6023],
        [ 2731],
        [ 2074],
        [19613],
        [ 8404],
        [ 2189],
        [ 2621],
        [ 5030],
        [ 8785],
        [ 5541],
        [ 3892],
        [27017],
        [ 2498],
        [24057],
        [ 3985],
        [ 2147],
        [14978],
        [14978],
        [ 3649],
        [ 3478],
        [22708],
        [14978],
        [19453],
        [ 2283],
        [ 6928],
        [ 4071],
        [ 6616],
        [ 5457],
        [22708],
        [14978],
        [26316],
        [22708],
        [22708]], dtype=int32)>, <tf.Tensor: shape=(34,), dtype=int32, numpy=
 array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1], dtype=int32)>)

The above output shows 
1. the first five and last five padded tweetss. From the last five tweet, we can see that the total number of words in the largest sentence were 21.
2. Therefore, in the first five tweets the 0s are added at the end of the sentences so that their total length is also 21.
 3.  The padding for the next batch will be different depending upon the size of the largest sentence in the batch.

Once we have applied padding to our dataset, the next step is to divide the dataset into test and training sets. 

We can do that with the help of following code:

In [0]:
import math

TOTAL_BATCHES = math.ceil(len(sorted_tweets_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

In the code above,
1. we first find the total number of batches by dividing the total records by 30.
2. Next, 10% of the data is left aside for testing. 
3. To do so, we use the take() method of batched_dataset() object to store 10% of the data in the test_data variable. 
4. The remaining data is stored in the train_data object for training using the skip() method.

The dataset has been prepared and now we are ready to create our text classification model.

**Creating the Model**
Now we are all set to create our model. 
1. To do so, we will create a class named TEXT_MODEL that inherits from the tf.keras.Model class. 
2. Inside the class we will define our model layers.
3. Our model will consist of three convolutional neural network layers.
4. You can use LSTM layers instead and can also increase or decrease the number of layers.
5. I have copied the number and types of layers from SuperDataScience's Google colab notebook and this architecture seems to work quite well for the  dataset as well.

Let's now create out model class:

In [0]:
class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

The above script is pretty straightforward. 
1. In the constructor of the class, we initialze some attributes with default values. 
2. These values will be replaced later on by the values passed when the object of the TEXT_MODEL class is created.

3. Next, three convolutional neural network layers have been initialized with the kernel or filter values of 2, 3, and 4, respectively.

4. Again, we can change the filter sizes if we want.

5. Next, inside the call() function, global max pooling is applied to the output of each of the convolutional neural network layer.
6. Finally, the three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network.
7. The second densely connected neural network is used to predict the output sentiment since it only contains 2 classes.
8. In case you have more classes in the output, you can updated the output_classes variable accordingly.

Let's now define the values for the hyper parameters of our model.

SEQ_LEN is a number of lengths of the sequence after tokenizing. It is set to 128. BERT has worked on at max 512 sequence length.

here we use VOCAB_LENGTH as SEQ_LEN

In [0]:
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2

DROPOUT_RATE = 0.2

NB_EPOCHS = 3

Next, we need to create an object of the TEXT_MODEL class and pass the hyper paramters values that we defined in the last step to the constructor of the TEXT_MODEL class.

In [0]:
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

In [0]:
if OUTPUT_CLASSES == 2:
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

Finally to train our model, we can use the fit method of the model class.

In [0]:
text_model.fit(train_data, epochs=NB_EPOCHS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f10d870f160>

In [0]:
results = text_model.evaluate(test_data)
print(results)

InvalidArgumentError: ignored

we can use BERT Tokenizer to create word embeddings that can be used to perform text classification