# Description of Data

The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of paper abstracts, either AI-generated or original.

The AI-generated abstracts are generated using state-of-the-art language generation techniques (GPT-3 model).

The dataset is provided in CSV format, with each row representing a single sample (i.e.,  a single abstract).

*The ultimate goal of this assignment is to classify the abstracts based on the source (i.e., whether it is AI-generated or original).*

Total sample size: 14,331 (7,248 AI-generated and 7,082 original)

Each sample contains three columns: abstract, title, and label. The label indicates whether the sample is an original abstract (labeled as 0) or an AI-generated abstract (labeled as 1).

##Package installs and imports

In [None]:
!pip3 install nltk spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Load dataset **"ai-ga-dataset.csv"** as a csv file and save it as a dataframe named **"abstracts_df"**

In [None]:
abstracts_df = pd.read_csv("https://raw.githubusercontent.com/elhamod/BA820/main/Assignment/Assignment2/ai-ga-dataset.csv")
abstracts_df.head()

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,\n\nThis study presents a novel transcriptome ...,1
1,2,ABO blood types and sepsis mortality,\n\nThe ABO blood types have been associated w...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,\n\nTitle: AAV8-Mediated Angiotensin-Convertin...,1
3,4,MyCare study: protocol for a controlled trial ...,INTRODUCTION: People with serious mental illne...,0
4,5,Exploring collective emotion transmission in f...,Collective emotion is the synchronous converge...,0


##Inspection:

**Maximum marks: 5**

- Print the number of abstracts that are human or AI generated, respectively.
- Check if any abstracts have invalid values. Address them appropriately.
- Check if any labels have invalid values. Address them appropriately

**Number of abstracts that are human or AI generated**

In [None]:
# Note: '0' indicates human abstracts | '1' indicates AI abstracts
pd.DataFrame(abstracts_df['label'].value_counts()).reset_index().rename(columns={'index':'label', 'label':'count'})

Unnamed: 0,label,count
0,1,7248
1,0,7082


**Answer**

The dataset contains 7248 AI-generated and 7082 human-generated abstracts, indicating a nearly balanced distribution between the two classes.

**Checking for invalid values in the `abstract` column**

In [None]:
# Checking for missing values
missing_abs = abstracts_df[abstracts_df['abstract'].isnull()]
print("Missing Values in the abstract column:", len(missing_abs))

# Checking for empty strings
empty_abs = abstracts_df[abstracts_df['abstract'] == '']
print("Empty Strings in the abstract column: ", len(empty_abs))

# Checking for unusually short or long text lengths
unusual_length_abs = abstracts_df[(abstracts_df['abstract'].str.len() < 50)]
print("Abstracts with unusually short length:", len(unusual_length_abs))

Missing Values in the abstract column: 0
Empty Strings in the abstract column:  0
Abstracts with unusually short length: 8


**Answer**

The abstracts donot contain any NULLs or empty strings or other invalid values as such, eliminating the need for data cleaning regarding missing values. There are few abstracts with ununsually short length but these can be retained since they don't necessarily indicate invalid data.

**Checking for invalid values in the `label` column**

In [None]:
# Checking for missing values
missing_labels = abstracts_df[abstracts_df['label'].isnull()]
print("Missing Values in the label column:", len(missing_labels))

Missing Values in the label column: 0


**Answer**

There are no NULLs in the labels and no values other than 0 and 1 which are the expected valid labels for this dataset.

#Pre-processing

##Question 1.1: text cleaning

**Maximum marks: 5**

Perform pre-processing on all abstracts by lower-casing and removing all non-alpha-numeric characters (i.e., only keep numbers, English alphabet letters, and white spaces).

In [None]:
# Defining a 'clean_text' function to perform the above cleaning steps
import re

def clean_text(text):

    # Converting text to string
    text = str(text)

    # Lower-casing the text
    text = text.lower()

    # Removing '\n' from the text
    text = re.sub('[\\n]', '', text)

    # Removing all non-alpha-numeric characters  from the text (keeping only numbers, English alphabet letters, and white spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    return text

# Calling the above function to clean the 'abstract' column
abstracts_df['abstract'] = abstracts_df['abstract'].apply(clean_text)
abstracts_df.head(3)

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,this study presents a novel transcriptome pilo...,1
1,2,ABO blood types and sepsis mortality,the abo blood types have been associated with ...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,title aav8mediated angiotensinconverting enzym...,1


## Question 1.2: Stemming or Lemmatization

**Maximum Marks: 7.5**

We enhance the effectiveness of our text analysis algorithms by normalizing words and reducing them to their root/base forms.

Write a function `process_text` that



1.   removes `english` stop words.
2.   uses `PorterStemmer` and `WordNetLemmatizer` to stem AND lemmatize the tokenized abstracts.

The function would take in a document and return its tokenization as a list of tokens.

To verify its functionality, call the function with the first abstract as input, and then print the transformed abstract as a full text (i.e., as a string, not as a list of tokens).

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [None]:
# Function to preporcess text data

def process_text(text):

    # Tokenizing the text
    tokens = word_tokenize(text)

    # Removing stop words from the text
    tokens = [token for token in tokens if token not in stop_words]

    # Stemming the text
    tokens = [stemmer.stem(token) for token in tokens]

    # Lemmatizing the text
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Rejoining tokens into text
    text = " ".join(tokens)

    return text

In [None]:
# Testin the 'process_text' function
import textwrap  # **Used AI

tokens_abs_1 = process_text(abstracts_df['abstract'].iloc[0])

print("\033[1mTransformed abstract 1\033[0m \n")  # **Used stackoverflow
print(textwrap.fill(tokens_abs_1, width=150))

[1mTransformed abstract 1[0m 

studi present novel transcriptom pilot analysi human ascend aortic tissu explor mechan behind exagger autophagi stanford type aortic dissect recent
establish excess autophagi associ increas risk progress complic destruct form thorac aortic injuri howev underli molecular pathway remain mostli
unknown investig mechan conduct rna sequenc experi ribosomaldeplet sampl ten ascend aorta dissect surgic resect seven patient stanford type aortic
dissect staad result provid insight possibl molecular marker might contribut acceler autophag activ use research exagger pathway regul stabil staad
patholog


#Vectorization

Next, we will try different vector representations and see how well each performs.

## Question 2.1: Bag of Words

**Maximum Marks: 5**

Perform Bag of Words on the abstracts and store the vector representation as a DataFrame.

You are expected to apply the `process_text` tokenization.

Print the head of the resulting DataFrame.

How many tokens does BoW yield?

In [None]:
abstracts_df.head()

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,this study presents a novel transcriptome pilo...,1
1,2,ABO blood types and sepsis mortality,the abo blood types have been associated with ...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,title aav8mediated angiotensinconverting enzym...,1
3,4,MyCare study: protocol for a controlled trial ...,introduction people with serious mental illnes...,0
4,5,Exploring collective emotion transmission in f...,collective emotion is the synchronous converge...,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Tokenizing the Abstracts
abstracts_df['abstract_tokenized'] = abstracts_df['abstract'].apply(process_text)

# Vectorizing the Abstracts using Bag of Words
cv = CountVectorizer()
abstracts_bow = cv.fit_transform(abstracts_df['abstract_tokenized'])
abstracts_bow_df = pd.DataFrame(abstracts_bow.toarray(), columns=cv.get_feature_names_out())
abstracts_bow_df.head()

Unnamed: 0,00,000,0000,000001,000007,00001,00002,00003,000032,00004,...,zymogen,zymographi,zymosan,zymosaninduc,zythia,zyz,zyz803,zzn,zzz,zzzn
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# No. of tokens
print("No. of tokens yielded by BoW:", len(cv.vocabulary_))

No. of tokens yielded by BoW: 67262


**Answer**

Bag of Words yields a total of 67262 tokens.

## Question 2.2: TF-IDF

**Maximum Marks: 5**

Using TF-IDF with `process_text` tokenization, vectorize the abstracts. Then, find the top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content.

In [None]:
query_index = 6
print("document id.", query_index, ": ", abstracts_df["abstract"].iloc[query_index])

document id. 6 :  background advantages of multiple arterial conduits for coronary artery bypass grafting cabg have been reported previously we aimed to evaluate the midterm outcomes of multiple arterial cabg mabg among patients with mild to moderate left ventricular systolic dysfunction lvsd methods this multicenter study using propensity score matching took place from january 2013 to june 2019 in jiangsu province and shanghai china with a mean and maximum followup of 33 and 68 years respectively we included patients with mild to moderate lvsd undergoing primary isolated multivessel cabg with left internal thoracic artery the inhospital and midterm outcomes of mabg versus conventional left internal thoracic artery supplemented by saphenous vein grafts single arterial cabg were compared the primary end points were death from all causes and death from cardiovascular causes the secondary end points were stroke myocardial infarction repeat revascularization and a composite of all mentione

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenizing the Abstracts
abstracts_df['abstract_tokenized'] = abstracts_df['abstract'].apply(process_text)

# Vectorizing the Abstracts using TFIDF
tfidf = TfidfVectorizer(norm=None)
tfidf.fit(abstracts_df['abstract_tokenized'])
abstracts_tfidf = tfidf.transform(abstracts_df['abstract_tokenized'])
abstracts_tfidf_df = pd.DataFrame(abstracts_tfidf.toarray(), columns=tfidf.get_feature_names_out())
abstracts_tfidf_df.head()

Unnamed: 0,00,000,0000,000001,000007,00001,00002,00003,000032,00004,...,zymogen,zymographi,zymosan,zymosaninduc,zythia,zyz,zyz803,zzn,zzz,zzzn
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Using Cosine Similarity to find similar abstracts

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Cosine Similarity for Document ID 6 after TFIDF
doc_6_similarity_tf = cosine_similarity(abstracts_tfidf[query_index], abstracts_tfidf)
doc_6_similarity_tf_df = pd.DataFrame(doc_6_similarity_tf).T
doc_6_similarity_tf_df = doc_6_similarity_tf_df.sort_values(by=0, ascending=False).head(6)
doc_6_similarity_tf_df

Unnamed: 0,0
6,1.0
8270,0.279836
6632,0.260873
3076,0.213453
4288,0.213075
3244,0.161436


In [None]:
for index in doc_6_similarity_tf_df.index:
    print("Document Index:", index, "|", "Document ID:", abstracts_df.iloc[index]['doc_id'])
    print("Abstract:", abstracts_df.iloc[index]['abstract'])
    print()

Document Index: 6 | Document ID: 7
Abstract: background advantages of multiple arterial conduits for coronary artery bypass grafting cabg have been reported previously we aimed to evaluate the midterm outcomes of multiple arterial cabg mabg among patients with mild to moderate left ventricular systolic dysfunction lvsd methods this multicenter study using propensity score matching took place from january 2013 to june 2019 in jiangsu province and shanghai china with a mean and maximum followup of 33 and 68 years respectively we included patients with mild to moderate lvsd undergoing primary isolated multivessel cabg with left internal thoracic artery the inhospital and midterm outcomes of mabg versus conventional left internal thoracic artery supplemented by saphenous vein grafts single arterial cabg were compared the primary end points were death from all causes and death from cardiovascular causes the secondary end points were stroke myocardial infarction repeat revascularization and 

## Question 2.3 Word2Vec

**Maximum Marks: 7.5**

Now repeat Q 2.2 but using Word2Vec. For each token, the model should consider the two adjacent tokens on its left and the two on its right. Use a `workers=4` as a parameter to speed up computations. Include **all** possible words that occur in the abstracts.

Use vector averaging to calculate the vector representation of the sentence based on the vectors of its constituent words.

How do the results of Word2Vec and TF-IDF compare?

In [None]:
from gensim.models import Word2Vec

# Function to get embeddings
def get_word_embedding(word, model):
    if word in model.key_to_index:
        return model[word]
    else:
        # Return a zero vector for Out-of-vocabulary
        return np.zeros(model.vector_size)

# Construct and train the Word2Vec model
abstracts_df_list = abstracts_df['abstract_tokenized'].apply(lambda x: x.split())
word2vec = Word2Vec(sentences=abstracts_df_list, vector_size=300, window=2, min_count=1, workers=4)
word2vec = word2vec.wv

# Construct the embeddings (i.e., vectorization) using Word2Vec
embeddings = []
# Iterate through the messages
for tokenized_abstract in abstracts_df_list:
    message_word_embeddings = [get_word_embedding(word, word2vec) for word in tokenized_abstract]
    # Average the word embeddings to get a sentence embedding
    message_embedding = np.mean(message_word_embeddings if len(message_word_embeddings) > 0 else [np.zeros(word2vec.vector_size)], axis=0)
    # Add the current message embedding into the list of embeddings for all messages
    embeddings = embeddings + [message_embedding]

embeddings = np.array(embeddings)
embeddings.shape

(14330, 300)

In [None]:
# Cosine Similarity for Document ID 6 after Word2Vec
doc_6_wv = embeddings[query_index,:].reshape(1,-1)
doc_6_similarity_wv = cosine_similarity(doc_6_wv, embeddings)
doc_6_similarity_wv_df = pd.DataFrame(doc_6_similarity_wv).T
doc_6_similarity_wv_df = doc_6_similarity_wv_df.sort_values(by=0, ascending=False).head(6)
doc_6_similarity_wv_df

Unnamed: 0,0
6,1.0
4727,0.97753
11355,0.977297
12864,0.975788
4452,0.975229
8211,0.972778


In [None]:
for index in doc_6_similarity_wv_df.index:
    print("Document Index:", index, "|", "Document ID:", abstracts_df.iloc[index]['doc_id'])
    print("Abstract:", abstracts_df.iloc[index]['abstract'])
    print()

Document Index: 6 | Document ID: 7
Abstract: background advantages of multiple arterial conduits for coronary artery bypass grafting cabg have been reported previously we aimed to evaluate the midterm outcomes of multiple arterial cabg mabg among patients with mild to moderate left ventricular systolic dysfunction lvsd methods this multicenter study using propensity score matching took place from january 2013 to june 2019 in jiangsu province and shanghai china with a mean and maximum followup of 33 and 68 years respectively we included patients with mild to moderate lvsd undergoing primary isolated multivessel cabg with left internal thoracic artery the inhospital and midterm outcomes of mabg versus conventional left internal thoracic artery supplemented by saphenous vein grafts single arterial cabg were compared the primary end points were death from all causes and death from cardiovascular causes the secondary end points were stroke myocardial infarction repeat revascularization and 

**Answer**

Comparison between TF-IDF and Word2Vec Results:

The two methods for vectorization give different lists of the top 5 most similar abstracts to a given one. This is because they work in different ways.

- TF-IDF looks at how often words appear in documents and across the whole text collection, highlighting unique important words.
- Word2Vec looks at the meaning of words in their context and how they relate to each other within a certain range of words, leading to more detailed connections between words.

This comparison shows that TF-IDF and Word2Vec offer different but complementary ideas about how similar pieces of text are. Depending on what you need, TF-IDF might be better for focusing on specific words, while Word2Vec might be better for capturing how words fit together in context.

# Classification

## Question 3.1: GloVe

**Maximum Marks: 7.5**

Instead of training our own Word2Vec model, we decided to use a [GloVe](https://nlp.stanford.edu/projects/glove/) model that was pre-trained by researchers at Stanford University. They used a much larger amount of text in their training (e.g., Wikipedia).

For this question, simply use `get_tokens(doc)` below for tokenization.

**Note:** *Vectorizing the entire dataset using GloVe may take 5-10 minutes. Use the guidelines we discussed in class to test and develop your code before fully applying it to the entire dataset.*

In [None]:
from gensim import downloader

# load the GloVe model
glove_model = downloader.load("glove-wiki-gigaword-50")

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def get_tokens(doc):
    doc_tokenized = nlp(doc)
    tokens = [token.text for token in doc_tokenized]
    return tokens

In [None]:
# Tokenization
glove_tokens = abstracts_df['abstract'].apply(get_tokens)

In [None]:
def get_vectors(doc):
    word_vectors = []
    for token in doc:
        if token in glove_model.key_to_index:
            vector = glove_model.get_vector(token)
            word_vectors.append(vector)
        else:
            word_vectors.append(np.zeros(glove_model.vector_size))
    return np.mean(word_vectors, axis=0)

# Vectorization
glove_vectors = glove_tokens.apply(get_vectors)
glove_vectors.shape

(14330,)

In [None]:
glove_vectors_df = pd.DataFrame(glove_vectors.to_list())
glove_vectors_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.603671,0.120718,-0.043174,-0.108008,0.128356,0.524549,0.010537,-0.179774,0.178283,0.065511,...,-0.178671,-0.157161,-0.046142,0.353873,0.105455,-0.060076,0.234696,0.173895,-0.059936,-0.053178
1,0.512616,0.114286,0.076718,-0.215744,0.112969,0.364053,-0.132071,-0.22429,0.296624,0.058713,...,-0.285538,-0.062622,0.17888,0.283761,0.077815,0.027271,-0.076039,0.118408,0.011821,-0.005593
2,0.489056,0.021651,-0.059018,0.050998,0.000489,0.519882,0.213924,-0.210018,0.22477,0.282279,...,-0.07407,-0.059354,-0.064858,0.185008,0.180855,0.012741,0.143166,0.153217,0.011706,0.067228
3,0.382849,0.154184,-0.123321,-0.08825,0.127219,0.290153,-0.216508,-0.229867,0.144744,-0.055943,...,-0.20463,-0.082973,0.155084,0.286676,-0.022155,0.06652,-0.051594,0.215751,0.003165,0.028528
4,0.310228,0.201045,-0.010916,-0.072452,0.229855,0.294811,-0.097882,-0.197745,-0.045902,0.086589,...,-0.070973,-0.042319,-0.053452,0.228537,-0.095237,0.022551,0.016278,0.137007,-0.092861,-0.140495


## Question 3.2: Random Forest Classifier

**Maximum Marks: 7.5**

Using a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), compare the classification results using GloVe to those using TF-IDF. Does GloVe do better or worse? Are there any particular issues you faced? Elaborate on your findings and justify them.

Use a test set of 20% the total dataset size. Use `random_state = 42`.

Print the `classification_report` of your model.

In [None]:
abstracts_df.head()

Unnamed: 0,doc_id,title,abstract,label,abstract_tokenized
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,this study presents a novel transcriptome pilo...,1,studi present novel transcriptom pilot analysi...
1,2,ABO blood types and sepsis mortality,the abo blood types have been associated with ...,1,abo blood type associ varieti health outcom re...
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,title aav8mediated angiotensinconverting enzym...,1,titl aav8medi angiotensinconvert enzym 2 gene ...
3,4,MyCare study: protocol for a controlled trial ...,introduction people with serious mental illnes...,0,introduct peopl seriou mental ill smi often fa...
4,5,Exploring collective emotion transmission in f...,collective emotion is the synchronous converge...,0,collect emot synchron converg effect respons a...


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
abstracts_tfidf_df.shape

(14330, 67262)

In [None]:
glove_vectors_df.shape

(14330, 50)

In [None]:
# Using TFIDF Vectors
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(abstracts_tfidf, abstracts_df['label'], test_size=0.2, random_state=42)

# Initializing the model
rf_model = RandomForestClassifier(random_state=42)

# Fit the model
rf_model.fit(X_train_tf, y_train_tf)

# Predict using the model
y_pred = rf_model.predict(X_test_tf)

# Evaluate the model
print(classification_report(y_test_tf, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1426
           1       0.92      0.97      0.94      1440

    accuracy                           0.94      2866
   macro avg       0.94      0.94      0.94      2866
weighted avg       0.94      0.94      0.94      2866



In [None]:
abstracts_glove = glove_vectors.to_list()

# Using Glove Vectors
X_train_g, X_test_g, y_train_g, y_test_g = train_test_split(abstracts_glove, abstracts_df['label'], test_size=0.2, random_state=42)

# Initializing the model
rf_model = RandomForestClassifier(random_state=42)

# Fit the model
rf_model.fit(X_train_g, y_train_g)

# Predict using the model
y_pred = rf_model.predict(X_test_g)

# Evaluate the model
print(classification_report(y_test_g, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87      1426
           1       0.86      0.88      0.87      1440

    accuracy                           0.87      2866
   macro avg       0.87      0.87      0.87      2866
weighted avg       0.87      0.87      0.87      2866



**Answer**

Based on the classification results using RandomForestClassifier, it appears that TF-IDF outperforms GloVe in terms of accuracy, precision, recall, and F1-score.

For TF-IDF:
- The accuracy is higher at 94%, indicating that TF-IDF performs better in predicting the correct class labels.
- The precision and recall for both classes (0 and 1) are consistently high, around 0.92-0.97, showing that TF-IDF effectively identifies true positives and avoids false positives and false negatives.
- The F1-score, which balances precision and recall, is also high at 0.94, suggesting a good overall performance.

For GloVe:
- The accuracy is lower at 87%, indicating that GloVe performs worse in predicting the correct class labels compared to TF-IDF.
- Although the precision and recall for both classes are decent, around 0.86-0.88, they are slightly lower compared to TF-IDF.
- Consequently, the F1-score is also lower at 0.87, indicating a less balanced performance compared to TF-IDF.

The reason for TF-IDF's superior performance could be attributed to its ability to capture the importance of individual words in the document, which is crucial for text classification tasks. In contrast, using pre-trained GloVe embeddings may face challenges in capturing domain-specific semantics or nuances present in the dataset, which could affect its performance compared to TF-IDF, which is tailored to the specific dataset. It is also worth noting that in the case of TFIDF, the dataset dimensions are (14330, 67262) which indicates that the supremely high model performance could be due to overfitting as a result of having way too many features than samples.
