# BERT Tutorial: Using BERTModel

#### **<ins>Version:</ins>**
@March 2023 / Quek Jing Hao


#### _<ins>**Objective</ins>:**_ 

Learn to extract sentence vectors from BERTModel, and be comfortable in using BERTModel in the transformers package.


#### **<ins>Introduction:</ins>**

Just to recap, in NLP, we use a model to transform text into a vector of floats. This technique is called sentence embedding. In this walkthrough, we shall use the BERTModel to access the sentence embeddings. After some manipulations on the sentence vectors to a dataframe of feature vectors, we will send that dataframe into a few different machine learning classifiers.

Because BERTModel is a deep learning model, it is extremely computationally intensive. Hence, we require GPU for this tutorial, we'll just use Google Colab's free GPU instance. As this notebook is self-contained, and you do not need to download the dataset together with this notebook. We will download the dataset by cloning the sample datasets repository hosted on my Github.

In this tutorial, we do a simple walkthrough by using BERTModel for sentiment analysis. We will use a BERTModel to predict if a sentiment is positive or negative. 

### Environment Configuration

First, we need to set up the environment in Google Colab - we need to download some packages not available in Google Colab. We can use the magic function to surppress the output.

In [10]:
%%capture
!pip install datasets
!pip install evaluate
!pip install fastBPE sacremoses subword_nmt sentencepiece
!pip install transformers

In [11]:
# import modules and dependencies
import numpy as np
import pandas as pd
import re
import torch

from datasets.dataset_dict import DatasetDict, Dataset

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import get_scheduler
from transformers import BertModel, BertTokenizer
from transformers import EarlyStoppingCallback
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

pd.set_option('display.max_colwidth', 1000)

Next, we determine if GPU is available. If you are using Google Colab's GPU, yopu'll see the name of the GPU instance

In [12]:
detected_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
try:
    device_name = torch.cuda.get_device_name()
except:
    device_name = 'CPU'
print(f'Detected device: {device_name}')

Detected device: Tesla T4


### Read Dataset

In this tutorial, we will use the IMDB movies dataset. The dataset consist of 50K movie reviews, each review has a sentiment - tagged positive or negative. 

Learn more about the dataset here https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

The dataset will be downloaded from my Github's Sample Datasets repository.

In [13]:
!git clone https://github.com/QuekJingHao/imdb-sample-dataset.git

fatal: destination path 'imdb-sample-dataset' already exists and is not an empty directory.


In [14]:
df = pd.read_csv('/content/imdb-sample-dataset/imdb_sample.csv')
df.head(2)

Unnamed: 0.1,Unnamed: 0,review,sentiment
0,29430,"I love most movies and I'm a big fan of Sean Bean so I thought that I would at least LIKE this movie. Also, I'm Canadian and this is a mostly-Canadian movie so I was prepared to cut it some serious slack. Nothing could have prepared me for the garbage that is ""Airborne"". Steve Guttenberg as an action hero? Give me a break. The acting throughout the movie was so bad I am going to have trouble sleeping tonight. I now have only two wishes in my life.<br /><br />1. I hope that you never have to sit through this movie. 2. I wish I could get those 6 hours back. Oh wait, the movie's under 2 hours - it only seemed like 6 hours...<br /><br />Don't watch this. Seriously.",negative
1,27750,"A film that tends to get buried under prejudice and preconception - It's a remake! Doris Day is in it! She sings! - Hitchcock's second crack at 'The Man Who Knew Too Much' is his most under-rated film, and arguably a fully fledged masterpiece in its own right.<br /><br />This is, in more ways than one, Doris Day's film. Not only does she give the finest performance of her career, more than holding her own against James Stewart, but the whole film is subtly structured around her character rather than his. This is, after all, a film in which music is both motif and plot device. What better casting than the most popular singer of her generation? Consider: Day's Jo McKenna has given up her career on the stage in order to settle down with her husband and raise their son. This seems to be a mutual decision, and she doesn't appear to be unhappy. But look at the way Stewart teases her in the horse-drawn carriage over her concerns about Louis Bernard, implying that she is jealous that Berna...",positive


### Data Cleaning and Sampling

Usually text data is never clean. This is especially so for comments, tweets, reviews. Unless the text you are dealing with is from an authoritative source, text data typically contains lexical, grammatical and spelling erros. So, depending on the situation or the use case, the user has to perform several data cleaning steps to remove say special characters etc. Just as an example, we will defne a text processing function to perform the following actions:

1. Remove special characters (e.g. line breaks)
2. Lowercase all words in the reviews and remove all starting and sending white spaces

In [15]:
def text_processing(text):
    
    remove_breaks = r"<br />"

    text_rtn = re.sub(remove_breaks, ' ', text)

    return text_rtn.lower().strip()

In [16]:
df['clean review'] = df['review'].apply(text_processing)

df.head(2)

Unnamed: 0.1,Unnamed: 0,review,sentiment,clean review
0,29430,"I love most movies and I'm a big fan of Sean Bean so I thought that I would at least LIKE this movie. Also, I'm Canadian and this is a mostly-Canadian movie so I was prepared to cut it some serious slack. Nothing could have prepared me for the garbage that is ""Airborne"". Steve Guttenberg as an action hero? Give me a break. The acting throughout the movie was so bad I am going to have trouble sleeping tonight. I now have only two wishes in my life.<br /><br />1. I hope that you never have to sit through this movie. 2. I wish I could get those 6 hours back. Oh wait, the movie's under 2 hours - it only seemed like 6 hours...<br /><br />Don't watch this. Seriously.",negative,"i love most movies and i'm a big fan of sean bean so i thought that i would at least like this movie. also, i'm canadian and this is a mostly-canadian movie so i was prepared to cut it some serious slack. nothing could have prepared me for the garbage that is ""airborne"". steve guttenberg as an action hero? give me a break. the acting throughout the movie was so bad i am going to have trouble sleeping tonight. i now have only two wishes in my life. 1. i hope that you never have to sit through this movie. 2. i wish i could get those 6 hours back. oh wait, the movie's under 2 hours - it only seemed like 6 hours... don't watch this. seriously."
1,27750,"A film that tends to get buried under prejudice and preconception - It's a remake! Doris Day is in it! She sings! - Hitchcock's second crack at 'The Man Who Knew Too Much' is his most under-rated film, and arguably a fully fledged masterpiece in its own right.<br /><br />This is, in more ways than one, Doris Day's film. Not only does she give the finest performance of her career, more than holding her own against James Stewart, but the whole film is subtly structured around her character rather than his. This is, after all, a film in which music is both motif and plot device. What better casting than the most popular singer of her generation? Consider: Day's Jo McKenna has given up her career on the stage in order to settle down with her husband and raise their son. This seems to be a mutual decision, and she doesn't appear to be unhappy. But look at the way Stewart teases her in the horse-drawn carriage over her concerns about Louis Bernard, implying that she is jealous that Berna...",positive,"a film that tends to get buried under prejudice and preconception - it's a remake! doris day is in it! she sings! - hitchcock's second crack at 'the man who knew too much' is his most under-rated film, and arguably a fully fledged masterpiece in its own right. this is, in more ways than one, doris day's film. not only does she give the finest performance of her career, more than holding her own against james stewart, but the whole film is subtly structured around her character rather than his. this is, after all, a film in which music is both motif and plot device. what better casting than the most popular singer of her generation? consider: day's jo mckenna has given up her career on the stage in order to settle down with her husband and raise their son. this seems to be a mutual decision, and she doesn't appear to be unhappy. but look at the way stewart teases her in the horse-drawn carriage over her concerns about louis bernard, implying that she is jealous that bernard wasn't ..."


Furthermore, we change the mapping of the sentiments by encoding from text to integers as follwows:
- positive : 1
- negative : 0

To save computational time, we will only pick a sample of 10000 reviews out of the 50K movie reviews, so 1/5 of the entire dataset

In [17]:
df['sentiment'] = df['sentiment'].replace({'positive' : 1, 
                                           'negative' : 0})

df_sample = df.copy().sample(n = 10000, random_state = 32)
df_sample = df_sample.reset_index(drop = True)

df_sample.head(2)

Unnamed: 0.1,Unnamed: 0,review,sentiment,clean review
0,28344,"This movie was not that good at all. Here is the first clue and that it is not gonna be a strong movie, Harrison Ford's name not only appears first but it is also bigger than the title. The music was nominated for an Oscar, What the heck was that? That music was probably the most annoying thing in the movie. The acting was sub par at best, except the Amish boy he did a decent job for being so young. Then you have the story which was weak and a little over the place, and it won for adapted! The music was horrid, I know I already said something but it was really bad. The premises was real good and it should be remade. Well that's all I really have on that.<br /><br />Your Average Movie Guy,<br /><br />-Trever",0,"this movie was not that good at all. here is the first clue and that it is not gonna be a strong movie, harrison ford's name not only appears first but it is also bigger than the title. the music was nominated for an oscar, what the heck was that? that music was probably the most annoying thing in the movie. the acting was sub par at best, except the amish boy he did a decent job for being so young. then you have the story which was weak and a little over the place, and it won for adapted! the music was horrid, i know i already said something but it was really bad. the premises was real good and it should be remade. well that's all i really have on that. your average movie guy, -trever"
1,36062,"Kojak meets the mafia. Telly Savales is one of those guys from the past that seems pretty forgettable. I never thought that his show was all that great. This is his one dimensional characterization of a crime boss, with very predictable results. If you take the car chases and the general rambling out, there isn't much plot development or action. I find mafia movies to be dull because I have no respect or interest in common criminals and their actions. Hollywood, and in this case, the Italian cinema, treat these guys as heroes. I saw the film and in a few days I won't remember much about it. Lots of shooting, innocent bystanders dying, betrayal, and that sick loyalty. The film is photographed pretty well and the acting is decent. But the dubbing is so bad (due to voices that just couldn't come out of those bodies), that I almost started looking for Godzilla approaching the bay.",0,"kojak meets the mafia. telly savales is one of those guys from the past that seems pretty forgettable. i never thought that his show was all that great. this is his one dimensional characterization of a crime boss, with very predictable results. if you take the car chases and the general rambling out, there isn't much plot development or action. i find mafia movies to be dull because i have no respect or interest in common criminals and their actions. hollywood, and in this case, the italian cinema, treat these guys as heroes. i saw the film and in a few days i won't remember much about it. lots of shooting, innocent bystanders dying, betrayal, and that sick loyalty. the film is photographed pretty well and the acting is decent. but the dubbing is so bad (due to voices that just couldn't come out of those bodies), that i almost started looking for godzilla approaching the bay."


In [18]:
df_sample['sentiment'].value_counts()

0    5012
1    4988
Name: sentiment, dtype: int64

### Data Processing for BERT 

After we have cleaned the reviews text, we have to perform some processing to so that BERTModel recognizes the input data structure. In particular, BERTModel recognizes PyTorch's tensor data structure. This integration means that if you are comfortable with PyTorch, the leaning curve will be less steep for BERTModel. The computation can be accelerated by GPU by using the ".to" method in PyTorch to push the BERTModel to GPU. 

#### Load BERTModel and BERT Tokenizer

Before we process the data, we load the BERTModel and the above mentioned tokenizer. 

In [19]:
%%capture
# you can uncomment the magic function above to see the output when loading BERTModel. The output shows the BERTModel architecture

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

bert_model = BertModel.from_pretrained(model_name)

bert_model.to(detected_device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### BERT Tokenizer

This data processing step involves tokenizing the text data - that is, spliting the sentences into words i.e. 'tokens'. We can use BERT's tokenizer for this purpose. We restrict the review text to be 250 words, and truncate the remaining text.

We can see an example of this. Let us just use Bert tokenizer on two sentences. We'll print out the output and see what it is.

In [20]:
example_sent = ['Office of Data and Intelligence is the most capabale department in the entire University.', 
                'This tutorial goes over BERT Model!']

sent_tokenized = tokenizer(example_sent, padding = True)
sent_tokenized

{'input_ids': [[101, 2436, 1997, 2951, 1998, 4454, 2003, 1996, 2087, 6178, 19736, 2571, 2533, 1999, 1996, 2972, 2118, 1012, 102], [101, 2023, 14924, 4818, 3632, 2058, 14324, 2944, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

The tokenizer function returns us a dictionary with the following keys:

In [21]:
sent_tokenized.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

For each of the sentences, the tokenizer produces three arrays. Of these keys, the most important is the input_ids. Later, this will be converted to tensor that will be pushed into BERTModel. The tokenizer replaced each of the tokens with an integer in the input_ids, which maintains the length of original sentences. Say for the first sentence, the tokenizer has produced the following input_ids that represents the sentence. 

In [22]:
np.array(sent_tokenized['input_ids'][0])

array([  101,  2436,  1997,  2951,  1998,  4454,  2003,  1996,  2087,
        6178, 19736,  2571,  2533,  1999,  1996,  2972,  2118,  1012,
         102])

Now, we will tokenize the entire clean review Series. Similarly, we convert the series into a list and throw inside the BERT Tokenizer. We specify a a maximum length of the tokens, and also truncate the remaining off

In [23]:
%%time
clean_review = list(df_sample['clean review'])
clean_tokens = tokenizer(clean_review, padding = True, truncation = True, max_length = 250)

# access the input Ids
input_ids = clean_tokens['input_ids']

# convert the list into a numpy array and then a PyTorch tensor. Copy the tensor to GPU
input_ids_tensor = torch.tensor(np.array(input_ids)).to(detected_device)

CPU times: user 50.9 s, sys: 166 ms, total: 51.1 s
Wall time: 1min 1s


We also can assign the column to the sample dataframe. This shows how the input_ids of each sentence generated by BERT

In [24]:
df_sample['input_ids'] = input_ids

df_sample.head(2)

Unnamed: 0.1,Unnamed: 0,review,sentiment,clean review,input_ids
0,28344,"This movie was not that good at all. Here is the first clue and that it is not gonna be a strong movie, Harrison Ford's name not only appears first but it is also bigger than the title. The music was nominated for an Oscar, What the heck was that? That music was probably the most annoying thing in the movie. The acting was sub par at best, except the Amish boy he did a decent job for being so young. Then you have the story which was weak and a little over the place, and it won for adapted! The music was horrid, I know I already said something but it was really bad. The premises was real good and it should be remade. Well that's all I really have on that.<br /><br />Your Average Movie Guy,<br /><br />-Trever",0,"this movie was not that good at all. here is the first clue and that it is not gonna be a strong movie, harrison ford's name not only appears first but it is also bigger than the title. the music was nominated for an oscar, what the heck was that? that music was probably the most annoying thing in the movie. the acting was sub par at best, except the amish boy he did a decent job for being so young. then you have the story which was weak and a little over the place, and it won for adapted! the music was horrid, i know i already said something but it was really bad. the premises was real good and it should be remade. well that's all i really have on that. your average movie guy, -trever","[101, 2023, 3185, 2001, 2025, 2008, 2204, 2012, 2035, 1012, 2182, 2003, 1996, 2034, 9789, 1998, 2008, 2009, 2003, 2025, 6069, 2022, 1037, 2844, 3185, 1010, 6676, 4811, 1005, 1055, 2171, 2025, 2069, 3544, 2034, 2021, 2009, 2003, 2036, 7046, 2084, 1996, 2516, 1012, 1996, 2189, 2001, 4222, 2005, 2019, 7436, 1010, 2054, 1996, 17752, 2001, 2008, 1029, 2008, 2189, 2001, 2763, 1996, 2087, 15703, 2518, 1999, 1996, 3185, 1012, 1996, 3772, 2001, 4942, 11968, 2012, 2190, 1010, 3272, 1996, 26445, 4095, 2879, 2002, 2106, 1037, 11519, 3105, 2005, 2108, 2061, 2402, 1012, 2059, 2017, 2031, 1996, 2466, 2029, 2001, ...]"
1,36062,"Kojak meets the mafia. Telly Savales is one of those guys from the past that seems pretty forgettable. I never thought that his show was all that great. This is his one dimensional characterization of a crime boss, with very predictable results. If you take the car chases and the general rambling out, there isn't much plot development or action. I find mafia movies to be dull because I have no respect or interest in common criminals and their actions. Hollywood, and in this case, the Italian cinema, treat these guys as heroes. I saw the film and in a few days I won't remember much about it. Lots of shooting, innocent bystanders dying, betrayal, and that sick loyalty. The film is photographed pretty well and the acting is decent. But the dubbing is so bad (due to voices that just couldn't come out of those bodies), that I almost started looking for Godzilla approaching the bay.",0,"kojak meets the mafia. telly savales is one of those guys from the past that seems pretty forgettable. i never thought that his show was all that great. this is his one dimensional characterization of a crime boss, with very predictable results. if you take the car chases and the general rambling out, there isn't much plot development or action. i find mafia movies to be dull because i have no respect or interest in common criminals and their actions. hollywood, and in this case, the italian cinema, treat these guys as heroes. i saw the film and in a few days i won't remember much about it. lots of shooting, innocent bystanders dying, betrayal, and that sick loyalty. the film is photographed pretty well and the acting is decent. but the dubbing is so bad (due to voices that just couldn't come out of those bodies), that i almost started looking for godzilla approaching the bay.","[101, 12849, 18317, 6010, 1996, 13897, 1012, 2425, 2100, 28350, 4244, 2003, 2028, 1997, 2216, 4364, 2013, 1996, 2627, 2008, 3849, 3492, 5293, 10880, 1012, 1045, 2196, 2245, 2008, 2010, 2265, 2001, 2035, 2008, 2307, 1012, 2023, 2003, 2010, 2028, 8789, 23191, 1997, 1037, 4126, 5795, 1010, 2007, 2200, 21425, 3463, 1012, 2065, 2017, 2202, 1996, 2482, 29515, 1998, 1996, 2236, 8223, 9709, 2041, 1010, 2045, 3475, 1005, 1056, 2172, 5436, 2458, 2030, 2895, 1012, 1045, 2424, 13897, 5691, 2000, 2022, 10634, 2138, 1045, 2031, 2053, 4847, 2030, 3037, 1999, 2691, 12290, 1998, 2037, 4506, 1012, 5365, 1010, 1998, 1999, ...]"


### Accessing Last Hidden State from BERTModel

Now the fun begins!

In BERTModel, the embedding of a single sentence is called the _last hidden state_ . One can try - the not so astute method of sending the entire 10000 rows of input_ids into BERTModel and obtain the sentence vectors. Doing so will cause the model (and your GPU) to crash. So, we shall pass in each sentence's input_ids into the model, and access the last hidden state directly from the model output. 

We loop through the rows of the sample datafame, accessing the last hidden state and vertically stacking them until we reach the last input_ids of the last sentence. 

In [27]:
%%time
features_vector = torch.zeros(1, 768)

for i, row in enumerate(input_ids_tensor):
    
    input_id = torch.unsqueeze(row, 0)
    
    with torch.no_grad():
        output = bert_model(input_id)
        last_hidden_states = output[0][:,0,:]
    
    if i == 0:
        features_vector = last_hidden_states
    else:
        features_vector = torch.vstack((features_vector, last_hidden_states))
        
features_vector = features_vector.cpu()

print('[*]-----------------------------------------------  SUCCESS  -----------------------------------------------[*]')

[*]-----------------------------------------------  SUCCESS  -----------------------------------------------[*]
CPU times: user 2min 28s, sys: 30.8 s, total: 2min 59s
Wall time: 3min 3s


Now, we can also see what does the features dataframe look like. But this step is not necessary.

In [28]:
df_ML = pd.DataFrame(features_vector, 
                     columns = ['feature_' + str(i + 1) for i in range(768)])
df_ML = pd.concat([df_sample['sentiment'], df_ML], axis = 1)

df_ML.head(3)

Unnamed: 0,sentiment,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_759,feature_760,feature_761,feature_762,feature_763,feature_764,feature_765,feature_766,feature_767,feature_768
0,0,-0.329462,0.132363,-0.348127,0.458388,-0.369615,-0.154923,0.323158,-0.205093,0.234229,...,0.081855,-0.316305,0.155609,-0.492828,0.144588,-0.244067,-0.49878,-0.208091,0.865001,-0.431351
1,0,-0.363119,-0.168779,-0.192008,-0.073103,-0.603808,-0.004025,0.274702,-0.005901,-0.041459,...,0.314191,-0.139299,0.490739,-0.545838,0.117402,-0.003479,-0.007289,0.007684,0.918179,-0.227412
2,0,-0.579074,0.381138,-0.3395,0.126223,-0.571157,0.186502,1.052447,-0.329012,0.011446,...,-0.049435,-0.267388,0.11193,-0.355467,0.422285,-0.30817,-0.075023,-0.523573,0.750236,-0.70094


Now, we are ready to do machine learning classification!

### Classification using Various Machine Learning Models

As a rule of thumb, you should use many different kinds of machine learning classifiers and see which one performms the best. This is to ensure that the model you came up with is robust enough. 

As an example, we will use three different classifiers: K-Nearest Neighbour, Logsitic Regression and Random Forest. We will perform the classic train-test split here, and define a function to return the evalutation metric in a dataframe.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(features_vector, 
                                                    df_sample['sentiment'], 
                                                    random_state = 14)

def eval_metric_df(clf, X_test, y_test, clf_name):
    
    y_predict = clf.predict(X_test)

    # calculate the evaluation metrices of the classifier
    auc_score    = roc_auc_score(y_test, y_predict)
    recall       = recall_score(y_test, y_predict)
    precision    = precision_score(y_test, y_predict)
    f1           = f1_score(y_test, y_predict)
    classifier_score = clf.score(X_test, y_test)
    confusion    = confusion_matrix(y_test, y_predict)

    print('Confusion matrix:\n', confusion, '\n')

    performance_dict = {clf_name : [auc_score, recall, precision, f1, classifier_score]}
    performance_df_clf = pd.DataFrame(data  = performance_dict, 
                                         index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])
    
    return performance_df_clf

#### K-Nearest Neighbour classifier

In [30]:
knn_clf = KNeighborsClassifier(n_neighbors = 10)
knn_clf.fit(X_train, y_train)

knn_clf_eval_metrics = eval_metric_df(knn_clf, X_test, y_test, 'K-Nearest Neighbour')
knn_clf_eval_metrics

Confusion matrix:
 [[1047  195]
 [ 641  617]] 



Unnamed: 0,K-Nearest Neighbour
AUC,0.666728
Recall,0.490461
Precision,0.759852
F1,0.596135
Score,0.6656


#### Logsitic Regression

In [31]:
lr_clf = LogisticRegression(max_iter = 10000)
lr_clf.fit(X_train, y_train)

lr_clf_eval_metrics = eval_metric_df(lr_clf, X_test, y_test, 'Logistic Regression')
lr_clf_eval_metrics

Confusion matrix:
 [[ 992  250]
 [ 243 1015]] 



Unnamed: 0,Logistic Regression
AUC,0.802774
Recall,0.806836
Precision,0.802372
F1,0.804598
Score,0.8028


#### Random Forest

In [32]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

rf_clf_eval_metrics = eval_metric_df(rf_clf, X_test, y_test, 'Random Forest')
rf_clf_eval_metrics

Confusion matrix:
 [[908 334]
 [292 966]] 



Unnamed: 0,Random Forest
AUC,0.749482
Recall,0.767886
Precision,0.743077
F1,0.755278
Score,0.7496


### Concluding Remarks

And there you have it! From our example here, it is evident that Logistic Regresion is the best clasifier. But since we are using only 10000 reviews, it is considered a small dataset. You can increase the sample size and see how does the performance of each of the classifiers do!