---

# Ioannou_Georgios


## Copyright © 2023 by Georgios Ioannou


---

<h1 align="center"> NLP Fine-Tuning With Hugging Face </h1>

<h2 align="center"> dem-vs-rep-tweets.csv </h2>


---

- **_Fine-tuning_** a natural language processing (NLP) model involves **_adjusting the hyperparameters_** and **_architecture_** of the model, and often also involves **_adjusting the dataset_**, to **_improve the performance of the model on a specific task_**. This can be done by adjusting the **_learning rate, the number of layers in the model,the size of the embeddings, and many other factors_**. Fine-tuning is often used to **_adapt a pre-trained model to a new dataset or task_**, and can be a time-consuming process that requires a good understanding of the model and the task at hand.

- In simpler terms: **_Fine-tuning a model can help to improve its performance on a specific task, by adjusting the hyperparameters and architecture of the model to suit the characteristics of the task and the dataset._**


---

# INSTALLATIONS


In [1]:
# ! pip install transformers
# ! pip install beautifulsoup4
# ! pip install lxml
# ! pip install evaluate

---

# LIBRARIES


In [2]:
import evaluate
import numpy as np
import pandas as pd

from datasets import Dataset
from sklearn.metrics import classification_report
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


---

# PRE-TRAINED MODEL


In [3]:
# FINE TUNING THIS PRETRAINED MODEL.

model_name = "distilbert-base-uncased"

---

# DATASET


In [4]:
# Load the dem-vs-rep-tweets.csv data into a dataframe.

# Read the file dem-vs-rep-tweets.csv located inside the data folder and then load the data.

df = pd.read_csv("dem-vs-rep-tweets.csv")

# Print/Display/Return the first 5 rows of the file dem-vs-rep-tweets.csv to make sure the file was loaded successfully.

df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [5]:
# Print the shape

print("df.shape =", df.shape)

df.shape = (86460, 3)


In [6]:
# Inspect / remove nulls and duplicates.

print("NULLS")
print(df.isnull().sum(), "\n")
print("Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
df.drop_duplicates(inplace=True)
print("Removing Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
print("df.shape =", df.shape)

NULLS
Party     0
Handle    0
Tweet     0
dtype: int64 

Duplicates
df.duplicated().sum() = 57 

Removing Duplicates
df.duplicated().sum() = 0 

df.shape = (86403, 3)


In [7]:
# Find class balances, print out how many of each Party(Republican and Democrat) there are.

print("Class Balances")
df["Party"].value_counts()

Class Balances


Party
Republican    44362
Democrat      42041
Name: count, dtype: int64

In [8]:
# Group tweets by Party.

df.groupby("Party").count()

Unnamed: 0_level_0,Handle,Tweet
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
Democrat,42041,42041
Republican,44362,44362


In [9]:
# Print/Display/Return the first 5 rows of the file dem-vs-rep-tweets.csv to make sure the file was loaded successfully.

df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [10]:
# Print the shape again.

print("df.shape =", df.shape)

df.shape = (86403, 3)


In [11]:
# Inspect / remove nulls and duplicates.

print("NULLS")
print(df.isnull().sum(), "\n")
print("Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
df.drop_duplicates(inplace=True)
print("Removing Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
print("df.shape =", df.shape)

NULLS
Party     0
Handle    0
Tweet     0
dtype: int64 

Duplicates
df.duplicated().sum() = 0 

Removing Duplicates
df.duplicated().sum() = 0 

df.shape = (86403, 3)


In [12]:
# Find class balances, print out how many of each Party(Republican and Democrat) there are.

print("Class Balances")
df["Party"].value_counts()

Class Balances


Party
Republican    44362
Democrat      42041
Name: count, dtype: int64

In [13]:
# Group tweets by Party.

df.groupby("Party").count()

Unnamed: 0_level_0,Handle,Tweet
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
Democrat,42041,42041
Republican,44362,44362


---

# CLEAN DATASET


In [14]:
# NLTK is our Natural-Language-Took-Kit.

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

# Regular Expression Library.

import re

# Download packages from nltk.

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
stopwords = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\georg\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\georg\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\georg\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## The 5 cells below will be combine in a class called Cleaner.


In [15]:
# # 1. Make a function that makes all text lowercase.


# def make_lowercase(input_string):
#     return input_string.lower()


# test_string = "This is A SENTENCE with LOTS OF CAPS."
# make_lowercase(test_string)

In [16]:
# # 2. Make a function that removes all punctuation.


# def remove_punctuation(input_string):
#     input_string = re.sub(r"[^\w\s]", "", input_string)
#     return input_string


# test_string = "This is a sentence! 50 With lots of punctuation??? & other #things."
# remove_punctuation(test_string)

In [17]:
# # 3. Make a function that removes all stopwords.


# def remove_stopwords(input_string):
#     words = word_tokenize(input_string)
#     valid_words = []

#     for word in words:
#         if word not in stopwords:
#             valid_words.append(word)

#     input_string = " ".join(valid_words)

#     return input_string


# test_string = "This is a sentence! With some different stopwords i have added in here."
# remove_stopwords(test_string)

In [18]:
# # 4. Make a function that breaks words into their stem words.


# def stem_words(input_string):
#     porter = PorterStemmer()
#     words = word_tokenize(input_string)
#     valid_words = []

#     for word in words:
#         stemmed_word = porter.stem(word)
#         valid_words.append(stemmed_word)

#     input_string = " ".join(valid_words)

#     return input_string


# test_string = (
#     "I played and started playing with players and we all love to play with plays"
# )

# stem_words(test_string)

In [19]:
# # 5. Make a pipeline function that applies all the text processing functions you just built.


# def pipeline(input_string):
#     input_string = make_lowercase(input_string)
#     input_string = remove_punctuation(input_string)
#     input_string = remove_stopwords(input_string)
#     input_string = stem_words(input_string)
#     return input_string


# test_string = (
#     "I played and started playing with players and we all love to play with plays"
# )
# pipeline(test_string)

## **_Cleaner_** class is responsible for cleaning the tweets using the pipeline function.


In [20]:
class Cleaner:
    def __init__(self):
        pass

    # 1. Make a function that makes all text lowercase.

    def make_lowercase(self, input_string):
        return input_string.lower()

    # 2. Make a function that removes all punctuation.

    def remove_punctuation(self, input_string):
        input_string = re.sub(r"[^\w\s]", "", input_string)
        return input_string

    # 3. Make a function that removes all stopwords.

    def remove_stopwords(self, input_string):
        words = word_tokenize(input_string)
        valid_words = []

        for word in words:
            if word not in stopwords:
                valid_words.append(word)

        input_string = " ".join(valid_words)

        return input_string

    # 4. Make a function that breaks words into their stem words.

    def stem_words(self, input_string):
        porter = PorterStemmer()
        words = word_tokenize(input_string)
        valid_words = []

        for word in words:
            stemmed_word = porter.stem(word)
            valid_words.append(stemmed_word)

        input_string = " ".join(valid_words)

        return input_string

    # 5. Make a pipeline function that applies all the text processing functions you just built.

    def pipeline(self, input_string):
        input_string = self.make_lowercase(input_string)
        input_string = self.remove_punctuation(input_string)
        input_string = self.remove_stopwords(input_string)
        input_string = self.stem_words(input_string)
        return input_string

In [21]:
# Clean the tweets.

cleaner = Cleaner()

df["Tweet_clean"] = df["Tweet"]
df["Tweet_clean"] = df["Tweet"].apply(cleaner.pipeline)

print("ORIGINAL TWEET:\n\n", df["Tweet"][0])
print("\nCLEANED TWEET:\n\n", df["Tweet_clean"][0])

ORIGINAL TWEET:

 Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… https://t.co/n3tggDLU1L

CLEANED TWEET:

 today senat dem vote savetheinternet proud support similar netneutr legisl hous httpstcon3tggdlu1l


In [22]:
# Print/Display/Return the first 5 rows of the file dem-vs-rep-tweets.csv to make sure the file was loaded successfully.

df.head()

Unnamed: 0,Party,Handle,Tweet,Tweet_clean
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",today senat dem vote savetheinternet proud sup...
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,rt winterhavensun winter resid alta vista teac...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,rt nbclatino repdarrensoto note hurrican maria...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...,rt nalcabpolici meet repdarrensoto thank take ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...,rt vegalteno hurrican season start june 1st pu...


---

# Label Encoder


In [23]:
# Label Democrat as 0. Democrat is 0 because D < R.
# Label Rebuplican as 1.

le = preprocessing.LabelEncoder()
le.fit(df["Party"].tolist())

df["label"] = le.transform(df["Party"].tolist())

In [24]:
df.head()

Unnamed: 0,Party,Handle,Tweet,Tweet_clean,label
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",today senat dem vote savetheinternet proud sup...,0
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,rt winterhavensun winter resid alta vista teac...,0
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,rt nbclatino repdarrensoto note hurrican maria...,0
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...,rt nalcabpolici meet repdarrensoto thank take ...,0
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...,rt vegalteno hurrican season start june 1st pu...,0


---

# Split data (train_test_split)


In [25]:
# Split the data into training and testing.

df_train, df_test = train_test_split(df, test_size=0.2)

In [26]:
df_train

Unnamed: 0,Party,Handle,Tweet,Tweet_clean,label
27542,Democrat,BobbyScott,RT @MarkWarner: Despite the Trump Administrati...,rt markwarn despit trump administr attempt sla...,0
61442,Republican,RepDavidYoung,Scammers are targeting MidAmerican Energy cust...,scammer target midamerican energi custom recei...,1
39003,Democrat,RepGwenMoore,"Propped up by @SpeakerRyan, @GOP's House Intel...",prop speakerryan gop hous intel investig prote...,0
13543,Democrat,USRepRickNolan,"They marched in the streets, with signs bearin...",march street sign bear simpl power word man ia...,0
82749,Republican,RepPeteKing,RT @NYGovCuomo: Today Representatives @NitaLow...,rt nygovcuomo today repres nitalowey reppetek ...,1
...,...,...,...,...,...
81371,Republican,RepMikeCoffman,Wishing everyone a happy #Passover. May it be ...,wish everyon happi passov may full happi peac ...,1
58577,Republican,RepThomasMassie,@MaryBro77801894 Are there 51 senators you tru...,marybro77801894 51 senat trust id rather keep ...,1
78516,Republican,KenCalvert,"RT @GOPLeader: As a matter of principle, globa...",rt goplead matter principl global secur agreem...,1
60217,Republican,RepLanceNJ7,With many New Jersey taxpayers likely to be hi...,mani new jersey taxpay like hit hard tax cut b...,1


In [27]:
df_test

Unnamed: 0,Party,Handle,Tweet,Tweet_clean,label
1798,Democrat,RepKihuen,Identity theft is a serious concern. Get infor...,ident theft seriou concern get inform next tue...,0
63771,Republican,RepEdRoyce,I was honored to meet Ji Seong-ho last October...,honor meet ji seongho last octob commit freedo...,1
82016,Republican,SpeakerRyan,RT @HouseGOP: Soon we will say… goodbye and go...,rt housegop soon say goodby good riddanc old o...,1
70612,Republican,RepMarthaRoby,More tax reform news! Great Southern Wood in A...,tax reform news great southern wood abbevil an...,1
32626,Democrat,JacksonLeeTX18,Federal Budget Deficit Projected to Top $1 Tri...,feder budget deficit project top 1 trillion 20...,0
...,...,...,...,...,...
35608,Democrat,RepMaxineWaters,RT @janschakowsky: I’m standing with my collea...,rt janschakowski im stand colleagu protectmuel...,0
19855,Democrat,Call_Me_Dutch,Enjoyed time meeting w/ Major General Randy Ta...,enjoy time meet w major gener randi taylor hus...,0
56970,Republican,RepTomRice,A typical middle-income family of four in Sout...,typic middleincom famili four south carolina 7...,1
52663,Republican,RepSanfordSC,Joaquin. Matthew. Irma.\nSC has seen 3 years w...,joaquin matthew irma sc seen 3 year wconsecut ...,1


---

# Convert to Huggingface Dataset


In [28]:
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

---

# Tokenizer


In [29]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [30]:
def preprocess_function(examples):
    return tokenizer(examples["Tweet_clean"], truncation=True)

In [31]:
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 69122/69122 [00:05<00:00, 13504.68 examples/s]
Map: 100%|██████████| 17281/17281 [00:01<00:00, 14169.89 examples/s]


In [47]:
tokenized_test

Dataset({
    features: ['Party', 'Handle', 'Tweet', 'Tweet_clean', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 17281
})

In [48]:
tokenized_test[0]

{'Party': 'Democrat',
 'Handle': 'RepKihuen',
 'Tweet': 'Identity theft is a serious concern. Get informed next Tuesday, April 3rd at @LVMPDBAC\'s "First Tuesday" community… https://t.co/pm6WzNEXoh',
 'Tweet_clean': 'ident theft seriou concern get inform next tuesday april 3rd lvmpdbac first tuesday commun httpstcopm6wznexoh',
 'label': 0,
 '__index_level_0__': 1798,
 'input_ids': [101,
  8909,
  4765,
  11933,
  14262,
  3695,
  2226,
  5142,
  2131,
  12367,
  2279,
  9857,
  2258,
  3822,
  1048,
  2615,
  8737,
  18939,
  6305,
  2034,
  9857,
  4012,
  23041,
  16770,
  13535,
  7361,
  2213,
  2575,
  2860,
  2480,
  2638,
  2595,
  11631,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [49]:
tokenized_test[0]["Party"]

'Democrat'

---

# Initialize Model


In [32]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


---

# Train Model


In [33]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [34]:
metric = evaluate.load("accuracy")

In [35]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [36]:
# Hyperparameters.

num_train_epochs = 3
learning_rate = 2e-4
per_device_train_batch_size = 8
per_device_eval_batch_size = 8
weight_decay = 0.01

In [37]:
evaluation_strategy = "epoch"
logging_strategy = "epoch"

training_args = TrainingArguments(
    output_dir="./results_tweets",
    learning_rate=learning_rate,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay,
    evaluation_strategy=evaluation_strategy,
    logging_strategy=logging_strategy,
)

In [38]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [39]:
trainer.train()  # 4.3 hours training on CPU.

  0%|          | 0/25923 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 33%|███▎      | 8641/25923 [1:20:52<2:24:45,  1.99it/s]

{'loss': 0.6958, 'learning_rate': 0.00013333333333333334, 'epoch': 1.0}


                                                        
 33%|███▎      | 8641/25923 [1:24:31<2:24:45,  1.99it/s]

{'eval_loss': 0.6931149959564209, 'eval_accuracy': 0.51131300271975, 'eval_runtime': 218.984, 'eval_samples_per_second': 78.914, 'eval_steps_per_second': 9.868, 'epoch': 1.0}


 67%|██████▋   | 17282/25923 [2:47:49<1:16:46,  1.88it/s] 

{'loss': 0.6931, 'learning_rate': 6.666666666666667e-05, 'epoch': 2.0}


                                                         
 67%|██████▋   | 17282/25923 [2:51:29<1:16:46,  1.88it/s]

{'eval_loss': 0.6929592490196228, 'eval_accuracy': 0.51131300271975, 'eval_runtime': 220.0874, 'eval_samples_per_second': 78.519, 'eval_steps_per_second': 9.819, 'epoch': 2.0}


100%|██████████| 25923/25923 [4:17:04<00:00,  1.95it/s]    

{'loss': 0.6929, 'learning_rate': 0.0, 'epoch': 3.0}


                                                       
100%|██████████| 25923/25923 [4:20:54<00:00,  1.66it/s]

{'eval_loss': 0.6928938627243042, 'eval_accuracy': 0.51131300271975, 'eval_runtime': 230.553, 'eval_samples_per_second': 74.955, 'eval_steps_per_second': 9.373, 'epoch': 3.0}
{'train_runtime': 15654.8999, 'train_samples_per_second': 13.246, 'train_steps_per_second': 1.656, 'train_loss': 0.693940347374918, 'epoch': 3.0}





TrainOutput(global_step=25923, training_loss=0.693940347374918, metrics={'train_runtime': 15654.8999, 'train_samples_per_second': 13.246, 'train_steps_per_second': 1.656, 'train_loss': 0.693940347374918, 'epoch': 3.0})

---

# Save Model


In [40]:
trainer.save_model("tweets_model")

---

# Evaluate Model


# Training Data


In [41]:
train_predictions = trainer.predict(tokenized_train)[1]

100%|██████████| 8641/8641 [15:19<00:00,  9.40it/s]


In [42]:
train_list = []

for i in range(tokenized_train.num_rows):
    if tokenized_train[i]["Party"] == "Democrat":
        train_list.append(0)
    elif tokenized_train[i]["Party"] == "Republican":
        train_list.append(1)

print(np.array(train_list))

print()

np.array(train_list) == train_predictions

[0 1 0 ... 1 1 0]



array([ True,  True,  True, ...,  True,  True,  True])

In [43]:
# Evaluating on the training data.

GT = df_train["label"].tolist()
print(classification_report(GT, train_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     33596
           1       1.00      1.00      1.00     35526

    accuracy                           1.00     69122
   macro avg       1.00      1.00      1.00     69122
weighted avg       1.00      1.00      1.00     69122



# Testing Data


In [44]:
test_predictions = trainer.predict(tokenized_test)[1]

100%|██████████| 2161/2161 [03:43<00:00,  9.66it/s]


In [45]:
test_list = []

for i in range(tokenized_test.num_rows):
    if tokenized_test[i]["Party"] == "Democrat":
        test_list.append(0)
    elif tokenized_test[i]["Party"] == "Republican":
        test_list.append(1)

print(np.array(test_list))

print()

np.array(test_list) == test_predictions

[0 1 1 ... 1 1 0]



array([ True,  True,  True, ...,  True,  True,  True])

In [54]:
print(df_test["label"].tolist())
print(len(df_test["label"].tolist()))

[0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 

In [55]:
print(test_predictions.tolist())
print(len(test_predictions.tolist()))

[0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 

In [46]:
# Evaluating on the testing data.
# GT = Ground Truth.

GT = df_test["label"].tolist()
print(classification_report(GT, test_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      8445
           1       1.00      1.00      1.00      8836

    accuracy                           1.00     17281
   macro avg       1.00      1.00      1.00     17281
weighted avg       1.00      1.00      1.00     17281

