In [1]:
#Importing Pandas Library
import pandas as pd

In [2]:
#Reading the dataset
df = pd.read_csv("https://raw.githubusercontent.com/analyticsindiamagazine/MocksDatasets/main/food_review.csv")

In [3]:
#Visualizing Data
df.head()

Unnamed: 0,review,reaction
0,Service is friendly and inviting.,1
1,Awesome service and food.,1
2,Waitress was a little slow in service.,0
3,"Come hungry, leave happy and stuffed!",1
4,Horrible - don't waste your time and money.,0


# **Data Preprocessing**

In [4]:
#Visualizing the Shape of the data 
df.shape

(1000, 2)

In [5]:
df["reaction"].value_counts()

1    500
0    500
Name: reaction, dtype: int64

In [6]:
df["reaction"].value_counts()

1    500
0    500
Name: reaction, dtype: int64

In [7]:
#Checking for NULL values
df.isnull().sum()

review      0
reaction    0
dtype: int64

In [8]:
#Checking for NA values
df.isna().sum()

review      0
reaction    0
dtype: int64

In [9]:
#Checking for duplicate values
print("Total Number of duplicated:",df.duplicated().sum())
print("Shape of Data:",df.shape)

Total Number of duplicated: 4
Shape of Data: (1000, 2)


In [10]:
#Removing duplicate values 
df.drop_duplicates(inplace = True)
print("Total Number of duplicated:",df.duplicated().sum())
print("Shape of Data:",df.shape)

Total Number of duplicated: 0
Shape of Data: (996, 2)


# **BINARY ENCODER**

First, import the CountVectorizer from the sklearn library to perform vectorization of the texts. First, convert all characters to lowercase before tokenizing by setting the default parameter lowercase = True, and to provide a binary label to each of the unique words use another default parameter binary = True. By setting ‘binary= True’, the CountVectorizer does not count the frequency of the word but it represents 1 if the unique word is present in the text sample and 0 if the unique word is not present in the text sample.  

In [11]:
#Importing CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Vectorization of input variables using count vectorizer
cv = CountVectorizer(binary = True,lowercase = True)
X = cv.fit_transform(df["review"].values)

After transforming the textual data into vectors, using the pandas library we are creating a DataFrame that represents all the unique words in the columns and all the reviews in the rows. The parameter todense()  returns a matrix of the given series vectors.  

In [12]:
X = pd.DataFrame(X.todense(),columns= cv.get_feature_names())



In [13]:
#Input Variable
X

Unnamed: 0,00,10,100,11,12,15,17,1979,20,2007,...,yelpers,yet,you,your,yourself,yucky,yukon,yum,yummy,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
992,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
993,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
994,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# **Bag of Words**

CountVectorizer is a tool provided by the scikit-learn library which is used to transform a given text into a vector on the basis of the frequency or count of each word that occurs in the entire text. In-text analysis, Countvectorizer is helpful to convert each word in each text into vectors. CountVectorizer creates a matrix in which each unique word is represented in a column of the matrix and each of the text from the document is a row in the matrix. The value of each cell is the count of the word in the particular text sample. 

First, import the CountVectorizer from the sklearn library to perform vectorization of the texts.

First, convert all characters to lowercase before tokenizing by setting the default parameter lowercase = True, and to count the frequency of each of the unique words use another default parameter binary = Fasle.



In [15]:
#Importing CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Vectorization of input variables using count vectorizer
cv = CountVectorizer(binary =False,lowercase = True)
X = cv.fit_transform(df["review"].values)

In [16]:
X = pd.DataFrame(X.todense(),columns= cv.get_feature_names())



In [17]:
#Input Variable
X

Unnamed: 0,00,10,100,11,12,15,17,1979,20,2007,...,yelpers,yet,you,your,yourself,yucky,yukon,yum,yummy,zero
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
992,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
993,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
994,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# **Word embedding using TF-IDF**

The count vectorizer faces two main drawbacks i.e overall document weightage issue and the inability to deal with contextual stopwords. In order to overcome these issues faced by the CountVectorizer, the TF-IDF (Term Frequency Inverse Document Frequency) word embedding technique is adopted which could potentially overcome the problem of the weightage as well as contextual stopwords. The Term-Frequency is used to resolve the weightage issue and respectively Inverse Document Frequency is used to resolve the problem of contextual stopwords. 

TF-IDF is broken down into two parts TF(Term Frequency) and IDF(Inverse Document Frequency). Term Frequency uses row normalization ( L1 or L2 ) to overcome the weightage program. Inverse Document Frequency tries to come up with a weightage factor for each of the unique words i.e provides a score for each of the unique words. For potential contextual stopwords, the score is high and the score is low for non-stop words i.e we minimize the weightage of the frequent terms.  After computing the TF and IDF we multiply these values together to obtain the TF-IDF value. The important (non-frequent) words have higher TF-IDF scores and corresponding low TF-IDF scores for less important or relevant words.  

First, import the TfidfVectorizer from the sklearn library to perform vectorization of the texts.

First, convert all characters to lowercase before tokenizing by setting the default parameter lowercase = True, and to count the frequency of each of the unique words use another default parameter binary = Fasle.

In [18]:
#Importing Term Frequency-Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
#Vectorization of input variables using TF-IDF
tv = TfidfVectorizer()
X = tv.fit_transform(df["review"].values)

In [19]:
X = pd.DataFrame(X.todense(),columns=tv.get_feature_names())



In [20]:
#Input Variable
X

Unnamed: 0,00,10,100,11,12,15,17,1979,20,2007,...,yelpers,yet,you,your,yourself,yucky,yukon,yum,yummy,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.353539,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


# **BERT**

The BERT model can understand the context of the statement and can generate meaningful vector representations of the given word. BERT can also generate an embedding for the entire sentence. It generates a single vector for the entire sentence. Usually, a BERT model will generate a vector of size 768 dimensions. BERT is based on a transformer architecture that is widely used in the NLP domain. There are two models versions of BERT :

BERT Base

BERT Large 

BERT Base - Comparable in size to the OpenAI Transformer in order to compare performance. The base version contains 12 encoding layers, 768 feedforward hidden units, and 12 attention heads. 

BERT large - A ridiculously huge model which is made up of 24 encoding layers, 1024 feedforward hidden units, and 16 attention heads. 

Model Inputs: The first input token is supplied with a special [CLS] token for reasons that will become apparent later on. CLS here stands for Classification.BERT takes a sequence of words as input which keep flowing up the stack. Each layer applies self-attention, passes its results through a feed-forward network, and then hands it off to the next encoder.

Model Outputs: Each position outputs a vector of size hidden_size (768 in BERT Base).

 

BERT was trained by Google on 2500 million words in Wikipedia and 800 million words on different books. The Google trained BERT using two approaches:

Masked Language model 

Next sentence prediction  

 

Now we perform word embedding for the food review dataset using the BERT model. 

The BERT model has two steps in the process:

BERT Preprocessing

BERT Embedding 

Let's try to locate the BERT model on the Tensor flow hub website. The tensor flow hub is a repository of all the trained machine learning models. We are going to use the BERT Base model which has 12 encoders. 

In [21]:
#Importing BERT Libraries
!pip3 install --quiet tensorflow-text
import tensorflow_hub as hub
import tensorflow_text as text

[K     |████████████████████████████████| 5.9 MB 2.0 MB/s 
[K     |████████████████████████████████| 578.0 MB 14 kB/s 
[K     |████████████████████████████████| 1.7 MB 45.3 MB/s 
[K     |████████████████████████████████| 438 kB 42.3 MB/s 
[K     |████████████████████████████████| 5.9 MB 24.5 MB/s 
[?25h

We can directly use the URL from the Tensor flow hub to download the model to the working directory. So download the encoder model using the encoder URL and for each of the encoding models, there is a corresponding preprocessing model which can be downloaded using a preprocessing URL present in the Tensor flow hub. 

In [22]:
#BERT Preprocessing URL
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
#BERT Encoding URL
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

# BERT Preprocessing 

So create a preprocessing layer that can certainly perform preprocessing of the textual data. The output of the preprocessing layer is in the form of a dictionary, so visualize the keys of the dictionary to understand the operation of the preprocessing layer. 

In [23]:
#Text preprocessing layer
preprocessing_model = hub.KerasLayer(preprocess_url)
#Text Preprocessing using BERT
preprocessed_text = preprocessing_model(df["review"])
preprocessed_text.keys()



dict_keys(['input_word_ids', 'input_mask', 'input_type_ids'])

So the preprocessing layer preprocessed the textual data and produced a dictionary as output containing three dictionary elements such as ​​'input_mask', 'input_word_ids', 'input_type_ids'.  Now let's understand each of the individual elements of this dictionary. 

In [24]:
#Input_mask
preprocessed_text['input_mask']

<tf.Tensor: shape=(996, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>

The first element of the dictionary is 'input_mask'. The shape of the 'input_mask' is (996,128) because there are 996 reviews and the maximum length of the sentence is 128.

In [25]:
#Input_type_ids
preprocessed_text["input_type_ids"]

<tf.Tensor: shape=(996, 128), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>

The second element of the dictionary is "input_type_ids". The input type ids are really useful if they have multiple text or sentences. The shape of the 'input_mask' is (996,128) because there are 996 reviews and the maximum length of the sentence is 128.  

In [26]:
#input_word_ids
preprocessed_text['input_word_ids']

<tf.Tensor: shape=(996, 128), dtype=int32, numpy=
array([[  101,  2326,  2003, ...,     0,     0,     0],
       [  101, 12476,  2326, ...,     0,     0,     0],
       [  101, 13877,  2001, ...,     0,     0,     0],
       ...,
       [  101,  1045,  3825, ...,     0,     0,     0],
       [  101,  1996,  2028, ...,     0,     0,     0],
       [  101,  1045,  2428, ...,     0,     0,     0]], dtype=int32)>

The third element of the dictionary is 'input_word_ids'. The input word ids provide individual unique word ids to each of the words and these unique ids could be ids from the vocabulary. The word id for CLS is 101 and SEP is 102. 

# **BERT Embedding**

In [28]:
import warnings
warnings.simplefilter("ignore")

After the preprocessing stage, we will create an encoding layer that will contain an encoder URL. So this layer will act as a function pointer taking the preprocessed text and will generate the sentence or word embedding of the preprocessed text.  Since the output of this layer is also a dictionary, we can visualize the keys of the dictionary to understand the word embedding outputs of this encoding layer. 

In [29]:
#Word Embedding layer
embedding_model = hub.KerasLayer(encoder_url)
embedded_output = embedding_model(preprocessed_text)
embedded_output.keys()

dict_keys(['pooled_output', 'sequence_output', 'encoder_outputs', 'default'])

So the dictionary has three key elements i.e 'encoder_outputs','pooled_output', and 'sequence_output'. Now let's try to examine the different types of keys in the resultant dictionary. 

In [30]:
#pooled_output
embedded_output['pooled_output']

<tf.Tensor: shape=(996, 768), dtype=float32, numpy=
array([[-0.92820835, -0.53307813, -0.98807347, ..., -0.94718045,
        -0.7754499 ,  0.9370414 ],
       [-0.8528169 , -0.35731667, -0.8730308 , ..., -0.7974772 ,
        -0.6556542 ,  0.9271541 ],
       [-0.8450236 , -0.4425657 , -0.84763074, ..., -0.85680586,
        -0.5416942 ,  0.93708444],
       ...,
       [-0.76845974, -0.46657893, -0.8840185 , ..., -0.79948694,
        -0.6214458 ,  0.9042822 ],
       [-0.7747236 , -0.32115898, -0.8614529 , ..., -0.72820914,
        -0.5732518 ,  0.86374104],
       [-0.8754774 , -0.41060072, -0.9194106 , ..., -0.86812526,
        -0.6718024 ,  0.8926041 ]], dtype=float32)>

First, we are going to examine the pooled output. The pooled output is the embedding for the entire sentence. The shape of the pooled output is (996, 768) provided 996 is the total number of reviews and 768 is the embedding vector size. So now we can use these vectors to perform various NLP tasks like classification, NER, etc. 

In [31]:
#sequence_output
embedded_output['sequence_output']


<tf.Tensor: shape=(996, 128, 768), dtype=float32, numpy=
array([[[-0.16914892,  0.10955714,  0.06207668, ..., -0.2665729 ,
          0.3922215 ,  0.4045772 ],
        [ 0.43808356, -0.40459052,  0.49144763, ..., -0.20893475,
          0.20450285, -0.08185712],
        [-0.192302  , -0.34165952,  0.07631842, ..., -0.3218694 ,
          0.1680937 ,  0.57546115],
        ...,
        [ 0.21660924, -0.16680014,  0.4273146 , ...,  0.2703691 ,
         -0.1244054 ,  0.14376658],
        [ 0.10489924, -0.19473974,  0.403444  , ...,  0.27094305,
         -0.07457556,  0.15557198],
        [ 0.1582209 , -0.17861842,  0.41508353, ...,  0.31933105,
         -0.08395606,  0.09886038]],

       [[-0.10519296, -0.06884409, -0.29852688, ..., -0.28938407,
          0.16161281,  0.23380975],
        [ 0.23456024, -0.2770056 ,  0.15491068, ..., -0.2441566 ,
          0.2038424 , -0.09215128],
        [ 0.81120396, -0.23502271,  0.61968386, ..., -0.02933439,
          0.30619276, -0.29426754],
        ..

Now the second key element is sequence output. The sequence output is the individual word embedding vectors. The shape of the sequence output is (996, 128, 768) because there are 996 input reviews, the maximum length of the sentence is 128 along with the padding and for each of the words, there are 768 size vectors. Since this is a contextual embedding the vectors for the padding have some vectors. 

In [32]:
#Encoder output
len(embedded_output['encoder_outputs'])

12

The last key element is encoder output. The encoder output is nothing but the output of each individual encoder in the BERT base model. The length of the encoder output is 12 because there are 12 encoding layers in the BERT base model. The shape of the encoder output is also (996, 128, 768) because there are 996 input reviews, the maximum length of the sentence is 128 along with the padding and for each of the words, there are 768 size vectors.