---

# Ioannou_Georgios


## Copyright © 2023 by Georgios Ioannou


---

<h1 align="center"> NLP Fine-Tuning With Hugging Face </h1>

<h2 align="center"> dem-vs-rep-tweets.csv </h2>


---

- Fine-tuning a natural language processing (NLP) model involves adjusting the hyperparameters and architecture of the model, and often also involves adjusting the dataset, to improve the performance of the model on a specific task. This can be done by adjusting the learning rate, the number of layers in the model,the size of the embeddings, and many other factors. Fine-tuning is often used to adapt a pre-trained model to a new dataset or task, and can be a time-consuming process that requires a good understanding of the model and the task at hand.

- Fine-tuning a model can help to improve its performance on a specific task, by adjusting the hyperparameters and architecture of the model
  to suit the characteristics of the task and the dataset.


---

# INSTALLATIONS


In [1]:
# ! pip install transformers
# ! pip install beautifulsoup4
# ! pip install lxml
# ! pip install evaluate

---

# LIBRARIES


In [2]:
import evaluate
import numpy as np
import pandas as pd

from datasets import Dataset
from sklearn.metrics import classification_report
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


---

# PRE-TRAINED MODEL


In [3]:
# FINE TUNING THIS PRETRAINED MODEL.

model_name = "distilbert-base-uncased"

---

# DATASET


In [4]:
# Load the dem-vs-rep-tweets.csv data into a dataframe.

# Read the file dem-vs-rep-tweets.csv located inside the data folder and then load the data.

df = pd.read_csv("dem-vs-rep-tweets.csv")

# Print/Display/Return the first 5 rows of the file dem-vs-rep-tweets.csv to make sure the file was loaded successfully.

df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [5]:
# Print the shape

print("df.shape =", df.shape)

df.shape = (86460, 3)


In [6]:
# Inspect / remove nulls and duplicates.

print("NULLS")
print(df.isnull().sum(), "\n")
print("Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
df.drop_duplicates(inplace=True)
print("Removing Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
print("df.shape =", df.shape)

NULLS
Party     0
Handle    0
Tweet     0
dtype: int64 

Duplicates
df.duplicated().sum() = 57 

Removing Duplicates
df.duplicated().sum() = 0 

df.shape = (86403, 3)


In [7]:
# Find class balances, print out how many of each Party(Republican and Democrat) there are.

print("Class Balances")
df["Party"].value_counts()

Class Balances


Party
Republican    44362
Democrat      42041
Name: count, dtype: int64

In [8]:
df.groupby("Party").count()

Unnamed: 0_level_0,Handle,Tweet
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
Democrat,42041,42041
Republican,44362,44362


---

# TAKING AN EXTREMELLY SMALL SUBSET FOR THE LECTURE


In [9]:
np.random.seed(42)


shuffled_indices = np.random.permutation(df.index)
df = df.loc[shuffled_indices].reset_index(drop=True)
df = df[:1000]

In [10]:
# Print/Display/Return the first 5 rows of the file dem-vs-rep-tweets.csv to make sure the file was loaded successfully.

df.head()

Unnamed: 0,Party,Handle,Tweet
0,Republican,farenthold,I’m hopeful that President @realDonaldTrump’s ...
1,Republican,RepChrisStewart,"RT @OutFrontCNN: .@RepChrisStewart: Kushner ""a..."
2,Democrat,RepBradAshford,Join me today @UNOmaha as I honor Dr. Lourdes ...
3,Republican,GOPpolicy,#GOPWorkingforWomen hearing LIVE now on https:...
4,Democrat,RepBonamici,"Portland Town Hall Meeting\nSaturday, March 10..."


In [11]:
# Print the shape

print("df.shape =", df.shape)

df.shape = (1000, 3)


In [12]:
# Inspect / remove nulls and duplicates.

print("NULLS")
print(df.isnull().sum(), "\n")
print("Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
df.drop_duplicates(inplace=True)
print("Removing Duplicates")
print("df.duplicated().sum() =", df.duplicated().sum(), "\n")
print("df.shape =", df.shape)

NULLS
Party     0
Handle    0
Tweet     0
dtype: int64 

Duplicates
df.duplicated().sum() = 0 

Removing Duplicates
df.duplicated().sum() = 0 

df.shape = (1000, 3)


In [13]:
# Find class balances, print out how many of each Party(Republican and Democrat) there are.

print("Class Balances")
df["Party"].value_counts()

Class Balances


Party
Democrat      510
Republican    490
Name: count, dtype: int64

In [14]:
df.groupby("Party").count()

Unnamed: 0_level_0,Handle,Tweet
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
Democrat,510,510
Republican,490,490


---

# CLEAN DATASET


In [15]:
# NLTK is our Natural-Language-Took-Kit.

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

# Regular Expression Library.

import re

# Download packages from nltk.

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
stopwords = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\georg\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\georg\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\georg\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [16]:
# # 1. Make a function that makes all text lowercase.


# def make_lowercase(input_string):
#     return input_string.lower()


# test_string = "This is A SENTENCE with LOTS OF CAPS."
# make_lowercase(test_string)

In [17]:
# # 2. Make a function that removes all punctuation.


# def remove_punctuation(input_string):
#     input_string = re.sub(r"[^\w\s]", "", input_string)
#     return input_string


# test_string = "This is a sentence! 50 With lots of punctuation??? & other #things."
# remove_punctuation(test_string)

In [18]:
# # 3. Make a function that removes all stopwords.


# def remove_stopwords(input_string):
#     words = word_tokenize(input_string)
#     valid_words = []

#     for word in words:
#         if word not in stopwords:
#             valid_words.append(word)

#     input_string = " ".join(valid_words)

#     return input_string


# test_string = "This is a sentence! With some different stopwords i have added in here."
# remove_stopwords(test_string)

In [19]:
# # 4. Make a function that breaks words into their stem words.


# def stem_words(input_string):
#     porter = PorterStemmer()
#     words = word_tokenize(input_string)
#     valid_words = []

#     for word in words:
#         stemmed_word = porter.stem(word)
#         valid_words.append(stemmed_word)

#     input_string = " ".join(valid_words)

#     return input_string


# test_string = (
#     "I played and started playing with players and we all love to play with plays"
# )

# stem_words(test_string)

In [20]:
# # 5. Make a pipeline function that applies all the text processing functions you just built.


# def pipeline(input_string):
#     input_string = make_lowercase(input_string)
#     input_string = remove_punctuation(input_string)
#     input_string = remove_stopwords(input_string)
#     input_string = stem_words(input_string)
#     return input_string


# test_string = (
#     "I played and started playing with players and we all love to play with plays"
# )
# pipeline(test_string)

In [21]:
class Cleaner:
    def __init__(self):
        pass

    # 1. Make a function that makes all text lowercase.

    def make_lowercase(self, input_string):
        return input_string.lower()

    # 2. Make a function that removes all punctuation.

    def remove_punctuation(self, input_string):
        input_string = re.sub(r"[^\w\s]", "", input_string)
        return input_string

    # 3. Make a function that removes all stopwords.

    def remove_stopwords(self, input_string):
        words = word_tokenize(input_string)
        valid_words = []

        for word in words:
            if word not in stopwords:
                valid_words.append(word)

        input_string = " ".join(valid_words)

        return input_string

    # 4. Make a function that breaks words into their stem words.

    def stem_words(self, input_string):
        porter = PorterStemmer()
        words = word_tokenize(input_string)
        valid_words = []

        for word in words:
            stemmed_word = porter.stem(word)
            valid_words.append(stemmed_word)

        input_string = " ".join(valid_words)

        return input_string

    # 5. Make a pipeline function that applies all the text processing functions you just built.

    def pipeline(self, input_string):
        input_string = self.make_lowercase(input_string)
        input_string = self.remove_punctuation(input_string)
        input_string = self.remove_stopwords(input_string)
        input_string = self.stem_words(input_string)
        return input_string

In [22]:
cleaner = Cleaner()

df["Tweet_clean"] = df["Tweet"]
df["Tweet_clean"] = df["Tweet"].apply(cleaner.pipeline)

print("ORIGINAL TWEET:\n\n", df["Tweet"][0])
print("\nCLEANED TWEET:\n\n", df["Tweet_clean"][0])

ORIGINAL TWEET:

 I’m hopeful that President @realDonaldTrump’s visit to Asia will push North Korea to come to the table.  https://t.co/f7XAPwma4l

CLEANED TWEET:

 im hope presid realdonaldtrump visit asia push north korea come tabl httpstcof7xapwma4l


In [23]:
# Print/Display/Return the first 5 rows of the file dem-vs-rep-tweets.csv to make sure the file was loaded successfully.

df.head()

Unnamed: 0,Party,Handle,Tweet,Tweet_clean
0,Republican,farenthold,I’m hopeful that President @realDonaldTrump’s ...,im hope presid realdonaldtrump visit asia push...
1,Republican,RepChrisStewart,"RT @OutFrontCNN: .@RepChrisStewart: Kushner ""a...",rt outfrontcnn repchrisstewart kushner great w...
2,Democrat,RepBradAshford,Join me today @UNOmaha as I honor Dr. Lourdes ...,join today unomaha honor dr lourd gouveia cont...
3,Republican,GOPpolicy,#GOPWorkingforWomen hearing LIVE now on https:...,gopworkingforwomen hear live httpstco9thhlxfu7k
4,Democrat,RepBonamici,"Portland Town Hall Meeting\nSaturday, March 10...",portland town hall meet saturday march 10 1030...


---

# Label Encoder


In [24]:
le = preprocessing.LabelEncoder()
le.fit(df["Party"].tolist())

df["label"] = le.transform(df["Party"].tolist())

In [25]:
df.head()

Unnamed: 0,Party,Handle,Tweet,Tweet_clean,label
0,Republican,farenthold,I’m hopeful that President @realDonaldTrump’s ...,im hope presid realdonaldtrump visit asia push...,1
1,Republican,RepChrisStewart,"RT @OutFrontCNN: .@RepChrisStewart: Kushner ""a...",rt outfrontcnn repchrisstewart kushner great w...,1
2,Democrat,RepBradAshford,Join me today @UNOmaha as I honor Dr. Lourdes ...,join today unomaha honor dr lourd gouveia cont...,0
3,Republican,GOPpolicy,#GOPWorkingforWomen hearing LIVE now on https:...,gopworkingforwomen hear live httpstco9thhlxfu7k,1
4,Democrat,RepBonamici,"Portland Town Hall Meeting\nSaturday, March 10...",portland town hall meet saturday march 10 1030...,0


---

# train_test_split


In [26]:
df_train, df_test = train_test_split(df, test_size=0.2)

In [27]:
df_train

Unnamed: 0,Party,Handle,Tweet,Tweet_clean,label
563,Republican,RepRalphNorman,RT @RepRalphNorman: Participate in my Twitter ...,rt repralphnorman particip twitter qampa ask q...,1
790,Democrat,RepDonaldPayne,RT @HomelandDems: Was Tillerson fired today be...,rt homelanddem tillerson fire today stood uk p...,0
143,Republican,GOPoversight,RT @FoxNews: .@TGowdySC on missing FBI text me...,rt foxnew tgowdysc miss fbi text messag bia in...,1
646,Democrat,RepRaskin,If your DACA status expired on or after Septem...,daca statu expir septemb 5 2016 may submit dac...,0
890,Republican,RepSeanDuffy,RT @RepSeanDuffy: Hey Wisconsin high school st...,rt repseanduffi hey wisconsin high school stud...,1
...,...,...,...,...,...
468,Democrat,RepMaxineWaters,RT @amjoyshow: .@RepMaxineWaters Demands Twitt...,rt amjoyshow repmaxinewat demand twitter discl...,0
286,Republican,HouseHomeland,In this month's #TerrorThreatSnapshot: Terror ...,month terrorthreatsnapshot terror europ rise 1...,1
869,Democrat,AGBecerra,TUNE IN at 12PM: I will be joining the OC Sher...,tune 12pm join oc sheriff depart announc joint...,0
186,Republican,PeteSessions,Tax reform created pop in our economy by encou...,tax reform creat pop economi encourag busi cre...,1


In [28]:
df_test

Unnamed: 0,Party,Handle,Tweet,Tweet_clean,label
636,Republican,boblatta,"Today is #VietnamWarVeteransDay, recognizing t...",today vietnamwarveteransday recogn 8 million a...,1
908,Democrat,RepLoisCapps,Thanks @SLOTribune for running my op-ed highli...,thank slotribun run ope highlight need protect...,0
684,Republican,RepSmucker,The deal fails to prevent Iran from testing ba...,deal fail prevent iran test ballist missil add...,1
375,Republican,RepScottPerry,I’ll be on Your World with Neil Cavuto on @Fox...,ill world neil cavuto foxnew around 410 pm aft...,1
884,Republican,Jim_Jordan,"RT @FoxNews: .@Jim_Jordan on FISA memo: ""Tomor...",rt foxnew jim_jordan fisa memo tomorrow memo c...,1
...,...,...,...,...,...
816,Republican,SpeakerRyan,House Republicans are continuing to make good ...,hous republican continu make good agenda promi...,1
8,Democrat,RepDennyHeck,Great to join the unveiling of the new Eastsid...,great join unveil new eastsid commun center lo...,0
367,Democrat,RepCarbajal,As the son of a farmworker who worked in the f...,son farmwork work field make live famili centr...,0
393,Republican,RepChrisCollins,About to join @wolfblitzer on CNN to discuss t...,join wolfblitz cnn discuss today news httpstco...,1


---

# Convert to Huggingface Dataset


In [29]:
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)

---

# Tokenizer


In [30]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [31]:
def preprocess_function(examples):
    return tokenizer(examples["Tweet_clean"], truncation=True)

In [32]:
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 800/800 [00:00<00:00, 36338.70 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 200157.67 examples/s]


---

# Initialize Model


In [33]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


---

# Train Model


In [34]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [35]:
metric = evaluate.load("accuracy")

In [36]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [37]:
# Hyperparameters.

num_train_epochs = 3
learning_rate = 2e-4
per_device_train_batch_size = 8
per_device_eval_batch_size = 8
weight_decay = 0.01

In [38]:
evaluation_strategy = "epoch"
logging_strategy = "epoch"

training_args = TrainingArguments(
    output_dir="./results_tweets",
    learning_rate=learning_rate,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay,
    evaluation_strategy=evaluation_strategy,
    logging_strategy=logging_strategy,
)

In [39]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [40]:
trainer.train()

  0%|          | 0/300 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 33%|███▎      | 100/300 [00:55<02:14,  1.49it/s]

{'loss': 0.7171, 'learning_rate': 0.00013333333333333334, 'epoch': 1.0}


                                                 
 33%|███▎      | 100/300 [00:59<02:14,  1.49it/s]

{'eval_loss': 0.704609215259552, 'eval_accuracy': 0.47, 'eval_runtime': 3.3705, 'eval_samples_per_second': 59.338, 'eval_steps_per_second': 7.417, 'epoch': 1.0}


 67%|██████▋   | 200/300 [01:58<00:47,  2.10it/s]

{'loss': 0.7042, 'learning_rate': 6.666666666666667e-05, 'epoch': 2.0}


                                                 
 67%|██████▋   | 200/300 [02:01<00:47,  2.10it/s]

{'eval_loss': 0.6984726190567017, 'eval_accuracy': 0.47, 'eval_runtime': 2.9708, 'eval_samples_per_second': 67.323, 'eval_steps_per_second': 8.415, 'epoch': 2.0}


100%|██████████| 300/300 [03:07<00:00,  1.58it/s]

{'loss': 0.6972, 'learning_rate': 0.0, 'epoch': 3.0}


                                                 
100%|██████████| 300/300 [03:10<00:00,  1.57it/s]

{'eval_loss': 0.7091423869132996, 'eval_accuracy': 0.47, 'eval_runtime': 3.3292, 'eval_samples_per_second': 60.074, 'eval_steps_per_second': 7.509, 'epoch': 3.0}
{'train_runtime': 190.757, 'train_samples_per_second': 12.581, 'train_steps_per_second': 1.573, 'train_loss': 0.7061505126953125, 'epoch': 3.0}





TrainOutput(global_step=300, training_loss=0.7061505126953125, metrics={'train_runtime': 190.757, 'train_samples_per_second': 12.581, 'train_steps_per_second': 1.573, 'train_loss': 0.7061505126953125, 'epoch': 3.0})

In [41]:
trainer.save_model("tweets_model")

---

# Evaluate Model


---

# Training Data


In [42]:
train_predictions = trainer.predict(tokenized_train)[1]

100%|██████████| 100/100 [00:13<00:00,  7.68it/s]


In [43]:
train_list = []

for i in range(tokenized_train.num_rows):
    if tokenized_train[i]["Party"] == "Democrat":
        train_list.append(0)
    elif tokenized_train[i]["Party"] == "Republican":
        train_list.append(1)

print(np.array(train_list))

print()

np.array(train_list) == train_predictions

[1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 1 0
 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0
 1 1 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1
 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0
 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0
 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1
 0 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0
 0 1 0 0 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 0 0 1 1
 1 1 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 0
 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0
 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0
 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1
 1 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1
 1 1 1 0 0 0 0 0 1 1 0 1 

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [44]:
# Evaluating on the training data.

GT = df_train["label"].tolist()
print(classification_report(GT, train_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       404
           1       1.00      1.00      1.00       396

    accuracy                           1.00       800
   macro avg       1.00      1.00      1.00       800
weighted avg       1.00      1.00      1.00       800



---

# Testing Data


In [45]:
test_predictions = trainer.predict(tokenized_test)[1]

100%|██████████| 25/25 [00:03<00:00,  7.38it/s]


In [46]:
test_list = []

for i in range(tokenized_test.num_rows):
    if tokenized_test[i]["Party"] == "Democrat":
        test_list.append(0)
    elif tokenized_test[i]["Party"] == "Republican":
        test_list.append(1)

print(np.array(test_list))

print()

np.array(test_list) == test_predictions

[1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 0 0 1
 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1
 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 0
 1 1 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1
 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0]



array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [47]:
# Evaluating on the testing data.

GT = df_test["label"].tolist()
print(classification_report(GT, test_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       106
           1       1.00      1.00      1.00        94

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200



---

# PROMISING RESULTS EVEN THOUGH WE USED ONLY A SUBEST OF THE DATA (1000 SAMPLES/ROWS/TWEETS)
