<a href="https://colab.research.google.com/github/Akshay-Kumar-Arya/Identify_the_sentiments/blob/master/Extracting_bert_vectors_from_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identify the Sentiments

In [7]:
# install modules
!pip install -q tensorflow
!pip install -q tensorflow_hub
!pip install -q bert-for-tf2
#!pip install -q sentencepiece


In [8]:
# import Modules
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import bert
import re
import pickle
#import math

# To visualize tweets upto larger width
pd.set_option('display.max_colwidth', 200)

print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

TF version:  2.2.0
Hub version:  0.8.0


## Dataset Preprocessing

In [27]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [104]:
# Data path
training_data_path = "/content/gdrive/My Drive/Identify_the_sentiments/train.csv"
test_data_path =  "/content/gdrive/My Drive/Identify_the_sentiments/test.csv"

save_path = "/content/gdrive/My Drive/Identify_the_sentiments/"

In [105]:
# reading data from csv
train_data = pd.read_csv(training_data_path)
test_data = pd.read_csv(test_data_path)

In [106]:
# data visualization
print(f"Number of training examples: {train_data.shape[0]}", '\n')
print(f"Number of test examples: {test_data.shape[0]}", '\n')

print(f"The fraction of positive and negative comments:")
print(train_data['label'].value_counts(normalize = True), '\n')

print("Training Dataframe:")
print(train_data.head())

Number of training examples: 7920 

Number of test examples: 1953 

The fraction of positive and negative comments:
0    0.744192
1    0.255808
Name: label, dtype: float64 

Training Dataframe:
   id  ...                                                                                                                                tweet
0   1  ...     #fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone
1   2  ...  Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/
2   3  ...          We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu
3   4  ...                     I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/
4   5  ...         What amazing service! Apple won't even talk to me about a question 

In [107]:
# removing URLs from data
train_data['clean_tweet'] = train_data['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))
test_data['clean_tweet'] = test_data['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))

In [108]:
# remove twitter handles
train_data['clean_tweet'] = train_data['clean_tweet'].apply(lambda x: re.sub("@[\w]*", '', x))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: re.sub("@[\w]*", '', x))

In [109]:
# remove punctuations
punctuation = '.,\'!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'
              
train_data['clean_tweet'] = train_data['clean_tweet'].apply(lambda x: "".join(ch for ch in x if ch not in set(punctuation)))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: "".join(ch for ch in x if ch not in set(punctuation)))

In [110]:
# convert to lower case

train_data['clean_tweet'] = train_data['clean_tweet'].str.lower()
test_data['clean_tweet'] = test_data['clean_tweet'].str.lower()

In [111]:
# remove the numbers

train_data['clean_tweet'] = train_data['clean_tweet'].str.replace("[0-9]", " ")
test_data['clean_tweet'] = test_data['clean_tweet'].str.replace("[0-9]", " ")

In [112]:
# remove white spaces

train_data['clean_tweet'] = train_data['clean_tweet'].apply(lambda x: ' '.join(x.split()))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: ' '.join(x.split()))

#### Bert feature embedding extraction

To get feature extraction follow the following steps
 * Install `bert-for-tf2`module and import `bert`.
 * First get the tokenizer using `get_tokenizer` function
 * Add `"[CLS]"` and `"[SEP]"` token respectively at the start and end of the tokenized sequence
 * Get inpus_ids, input_mask and input_segments using `get_ids`, `get_masks` and `get_segments` function respectively.
 * Get model using `embedding_model` function.
 * Convert input into Numpy arrays. The shape of input should be `[batch_size, maximum_sequence_length]`.
 * Use model.predict to get the output. Use `model.predict([[input_ids],[input_masks],[input_segments]])` or `model.predict([input_ids,input_masks,input_segments])`

In [113]:
# import bert_layer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=True)

**Bert tokenizer:** 
* The methodology on which BERT was trained using the WordPiece tokenization. It means that a word can be broken down into more than one sub-words.
* Import tokenizer using the original vocab file, do lower case all the word pieces and then tokenize the sentences.

For example:
* **Input:** `'Hi we are using BERT'`
* **Output:** `['hi', 'we', 'are', 'using', 'bert']`

In [114]:
# build tokenizer function

def get_tokenizer():
  FullTokenizer = bert.bert_tokenization.FullTokenizer
  vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
  do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
  tokenizer = FullTokenizer(vocab_file, do_lower_case)
  return tokenizer

In [115]:
# Building architecture of model

def embedding_model(max_seq_length = 128):
  # the input of model should be numpy arrays
  input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
  input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
  segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")

  pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

  model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])
  return model

**Code Explanation:**

* **max_seq_length = 128**
  * BERT has a constraint on the maximum length of a sequence after tokenizing. For any BERT model, the maximum sequence length after tokenization is 512. But we can set any sequence length equal to or below this value.

* **Inputs:**
  * input token ids (tokenizer converts tokens using vocab file)
  * input masks (1 for useful tokens, 0 for padding)
  * segment ids (for 2 text training: 0 for the first one, 1 for the second one)
* **Outputs:**
   * pooled_output of shape [batch_size, 768] with representations for the entire input sequences 
   * sequence_output of shape [batch_size, max_seq_length, 768] with representations for each input token (in context)

**BERT original implementation of generating segments and masks:**

In [116]:
# See BERT paper: https://arxiv.org/pdf/1810.04805.pdf
# And BERT implementation convert_single_example() at https://github.com/google-research/bert/blob/master/run_classifier.py

def get_masks(tokens, max_seq_length=128):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length=128):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length=128):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

#### One small example for extraction of embeddings

In [62]:
#Adding separator tokens according to the paper

#[CLS] provided by BERT for sentence embeddings without any combination or processing from all the word vectors in the sentence.
s = "Hi we are using BERT"

stokens = tokenizer.tokenize(s)

stokens = ["[CLS]"] + stokens + ["[SEP]"]

input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)

In [63]:
# converting list into arrays
input_ids = np.array(input_ids)
input_masks = np.array(input_masks)
input_segments = np.array(input_segments)

In [64]:
input_ids.shape

(128,)

In [65]:
# adding batch dimension
input_ids = input_ids[np.newaxis,]
input_masks = input_masks[np.newaxis,]
input_segments = input_segments[np.newaxis,]

In [68]:
input_ids.shape

(1, 128)

In [23]:
print(stokens)
print(input_ids)
print(input_masks)
print(input_segments)

['[CLS]', 'hi', 'we', 'are', 'using', 'bert', '[SEP]']
[[  101  7632  2057  2024  2478 14324   102     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]]
[[1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0

In [24]:
# extract feature embedding
pool_embs, all_embs = model.predict([input_ids,input_masks,input_segments])

In [26]:
pool_embs

array([[-8.37910175e-01, -2.29771748e-01,  9.79753584e-02,
         6.81168735e-01, -1.24071524e-01, -9.38553140e-02,
         8.68438244e-01,  1.70397267e-01,  2.66095459e-01,
        -9.99814689e-01,  1.50845036e-01,  3.17748368e-01,
         9.70117331e-01, -1.59248844e-01,  9.11800981e-01,
        -4.88027602e-01,  1.43237943e-02, -5.00323713e-01,
         3.12510669e-01, -6.54799461e-01,  5.54589808e-01,
         9.74681973e-01,  6.18222952e-01,  1.98275760e-01,
         2.67373919e-01,  4.54036117e-01, -5.74853301e-01,
         9.19355690e-01,  9.29784775e-01,  5.73712468e-01,
        -6.49026275e-01,  7.66068771e-02, -9.75605726e-01,
        -1.56982809e-01, -7.71182403e-02, -9.74963188e-01,
         1.49606138e-01, -7.06762195e-01,  2.16055159e-02,
         9.01458412e-02, -8.69107187e-01,  1.76298708e-01,
         9.97606575e-01, -3.85354370e-01, -2.00644415e-02,
        -2.90139973e-01, -9.99891102e-01,  1.18405864e-01,
        -8.36001635e-01, -2.53294766e-01, -8.05879757e-0

#### extracting embedding from tweets

In [None]:
# get tokenizer and model 
max_seq_length = 128
tokenizer = get_tokenizer()
model = embedding_model()

In [118]:
# function to convert clean tweets into embeddings.

def convert_tweets_into_embeddings(dataframe, tokenizer, model, max_seq_length =128):
  # tokenize the tweets
  dataframe['clean_tweet'] = dataframe['clean_tweet'].apply(lambda x: tokenizer.tokenize(x))
  dataframe['clean_tweet'] = dataframe['clean_tweet'].apply(lambda x: ["[CLS]"] + x + ["[SEP]"])

  # get input_ids, input_masks, input_segments
  dataframe['input_ids'] = dataframe['clean_tweet'].apply(lambda x: get_ids(x, tokenizer, max_seq_length))
  dataframe['input_masks'] = dataframe['clean_tweet'].apply(lambda x: get_masks(x, max_seq_length))
  dataframe['input_segments'] = dataframe['clean_tweet'].apply(lambda x: get_segments(x, max_seq_length))

  # convert them into numpy arrays
  input_ids = np.array(dataframe['input_ids'].values.tolist())
  input_masks = np.array(dataframe['input_masks'].values.tolist())
  input_segments = np.array(dataframe['input_segments'].values.tolist())

  pool_embs, all_embs = model.predict([input_ids, input_masks, input_segments])
  return pool_embs, all_embs

In [119]:
# get embeddings 
bert_train = convert_tweets_into_embeddings(train_data, tokenizer, model, max_seq_length =128)

bert_test = convert_tweets_into_embeddings(test_data, tokenizer, model, max_seq_length =128)

In [124]:
# save the preprocessed tweets
train_file = open(save_path + "bert_train.pickle", mode='wb')
pickle.dump(bert_train, train_file)
train_file.close()

test_file = open(save_path + "bert_test.pickle", mode='wb')
pickle.dump(bert_test, test_file)
test_file.close()