**Abhina Premachandran Bindu**

# Comparing the performance of gensim vs nltk libraries
<p> In this analysis, the nltk and gensim nlp libraries are compared based on the accuracy scores of the same classifier applied on the processed texts corresponding to the libraries. It is found that using nltk library to process the text and tfidf vectorizer to apply the classifier resulted in better accuracy scores compared to using gensim's word2vect function for training the classifier. The accuracy for nltk is 0.99 for the gradient boosting classifier while the gensim accuracy for the same classifier is only 0.96.</p>

## Loading and initial cleaning

In [30]:
# importing the necessary libraries
import gensim
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing the libraries for nltk
import nltk
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# BoW
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# df1 --> Fake , df2 --> Real
df1 = pd.read_csv("/Users/abhinapremachandran/Desktop/Spring '24 CCNY/Machine Learning/group_project_ML/Fake.csv")
df2 = pd.read_csv("/Users/abhinapremachandran/Desktop/Spring '24 CCNY/Machine Learning/group_project_ML/True.csv")
# adding the labels Real --> 1 and Fake --> 0
df1['target'] = 0
df2['target'] = 1
# combining the dataframes
combined_df = pd.concat([df1, df2], ignore_index=True)
# shuffling the indices
shuffled_indices = np.random.permutation(combined_df.index)

# Using .loc[] to rearrange the DataFrame rows according to the shuffled indices
data = combined_df.loc[shuffled_indices]


In [3]:
data.head()

Unnamed: 0,title,text,subject,date,target
12190,OBAMA TELLS TROOPS To Rise Up Against Trump…Pr...,43 days and counting Characterizing the milita...,politics,"Dec 8, 2016",0
3547,Obama Makes UNPRECEDENTED Move To Protect The...,President Barack Obama knows that President-el...,News,"December 6, 2016",0
12511,WOW! BLACK PROF Delivers Truth Bomb About Blac...,,politics,"Nov 3, 2016",0
12043,FEEL GOOD STORY OF THE DAY: Globalist Billiona...,Says the guy who s been funding riots across ...,politics,"Dec 29, 2016",0
813,Wife Of The Japanese PM Epically Trolled Trum...,Amateur president Donald Trump left his chair ...,News,"July 20, 2017",0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44898 entries, 12190 to 2885
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   target   44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB


In [5]:
# dropping the na values
data.dropna(inplace=True)

In [6]:
# checking the value counts of 'target' to check for data imbalance
data.target.value_counts()

0    23481
1    21417
Name: target, dtype: int64

 Since the number of Fake and True classes are almost same, there is no class imbalance

In [7]:
data.subject.value_counts()

politicsNews       11272
worldnews          10145
News                9050
politics            6841
left-news           4459
Government News     1570
US_News              783
Middle-east          778
Name: subject, dtype: int64

## Data Preprocessing

## using nltk for cleaning and preparing for classification

In [8]:
# Tokenize and removing stop words
stop_words = set(stopwords.words('english'))
def clean_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # remove non-alphabetical characters and stopwords
    cleaned_tokens = [re.sub(r'[^a-zA-Z ]', '', text).lower() for text in tokens if text.lower() not in stop_words]
    cleaned_tokens = [token for token in cleaned_tokens if ((token not in  set(string.punctuation)))]
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]
    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    #stem the tokens
    porter = PorterStemmer()
    cleaned_text = " ".join(porter.stem(token) for token in processed_text.split())
    return cleaned_text

# Apply the function across the DataFrame
data_nltk = data.copy()
data_nltk['cleaned_text'] = data_nltk['text'].apply(clean_text)

In [9]:
data_nltk.head()

Unnamed: 0,title,text,subject,date,target,cleaned_text
12190,OBAMA TELLS TROOPS To Rise Up Against Trump…Pr...,43 days and counting Characterizing the milita...,politics,"Dec 8, 2016",0,day count character militari mission fight vio...
3547,Obama Makes UNPRECEDENTED Move To Protect The...,President Barack Obama knows that President-el...,News,"December 6, 2016",0,presid barack obama know presidentelect donald...
12511,WOW! BLACK PROF Delivers Truth Bomb About Blac...,,politics,"Nov 3, 2016",0,
12043,FEEL GOOD STORY OF THE DAY: Globalist Billiona...,Says the guy who s been funding riots across ...,politics,"Dec 29, 2016",0,say guy fund riot across america dump hundr mi...
813,Wife Of The Japanese PM Epically Trolled Trum...,Amateur president Donald Trump left his chair ...,News,"July 20, 2017",0,amateur presid donald trump left chair postg s...


## using gensim to clean and build the vectors for the text

In [10]:
# Apply the function across the DataFrame
data_gensim = data.copy()
data_gensim['cleaned_text'] = data_gensim['text'].apply(gensim.utils.simple_preprocess)


In [11]:
data.head()

Unnamed: 0,title,text,subject,date,target
12190,OBAMA TELLS TROOPS To Rise Up Against Trump…Pr...,43 days and counting Characterizing the milita...,politics,"Dec 8, 2016",0
3547,Obama Makes UNPRECEDENTED Move To Protect The...,President Barack Obama knows that President-el...,News,"December 6, 2016",0
12511,WOW! BLACK PROF Delivers Truth Bomb About Blac...,,politics,"Nov 3, 2016",0
12043,FEEL GOOD STORY OF THE DAY: Globalist Billiona...,Says the guy who s been funding riots across ...,politics,"Dec 29, 2016",0
813,Wife Of The Japanese PM Epically Trolled Trum...,Amateur president Donald Trump left his chair ...,News,"July 20, 2017",0


## Building, training and using the gensim word2vect model for getting the word vectors

In [12]:
# building the word2vec model
model = gensim.models.Word2Vec(
    window = 6,
    min_count = 1,
    workers = 4
)
model.build_vocab(data_gensim['cleaned_text'])

In [13]:
# training the model
model.train(data_gensim['cleaned_text'], total_examples=model.corpus_count, epochs=5)

# saving the model
model.save("word2vec/word2vec_model")

In [14]:
model.wv.index_to_key[:5]

['the', 'to', 'of', 'and', 'in']

In [15]:
len(model.wv.index_to_key)

114493

In [16]:
# a function for finding the average of the word vectors 
def get_average_word2vec_vector(text, model, word_dim):
  vec = np.zeros((word_dim,))  
  count = 0
  for word in text:
    if word in model.wv:  
      vec += model.wv[word]
      count += 1
  if count != 0:
    vec /= count  
  return vec

# Get word dimensions from the model
word_dim = model.vector_size

# Apply the function to each cleaned_text
word_vectors = [get_average_word2vec_vector(text, model, word_dim) for text in data_gensim['cleaned_text']]


In [17]:
# adding the word vectors to the data
data['word_vectors'] = word_vectors

In [18]:
data.head()

Unnamed: 0,title,text,subject,date,target,word_vectors
12190,OBAMA TELLS TROOPS To Rise Up Against Trump…Pr...,43 days and counting Characterizing the milita...,politics,"Dec 8, 2016",0,"[-0.9129094113346878, -0.2038301904509487, -1...."
3547,Obama Makes UNPRECEDENTED Move To Protect The...,President Barack Obama knows that President-el...,News,"December 6, 2016",0,"[-0.4982029019015461, -0.38933679708476426, -1..."
12511,WOW! BLACK PROF Delivers Truth Bomb About Blac...,,politics,"Nov 3, 2016",0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
12043,FEEL GOOD STORY OF THE DAY: Globalist Billiona...,Says the guy who s been funding riots across ...,politics,"Dec 29, 2016",0,"[-0.7918661317625654, -0.22423174317256225, -1..."
813,Wife Of The Japanese PM Epically Trolled Trum...,Amateur president Donald Trump left his chair ...,News,"July 20, 2017",0,"[-0.32123523207573595, -0.8752208944867031, -0..."


## Classifying the data

In [19]:
# importing the model
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
# importing necessary libraries for model building
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
# defining the model
clf = GradientBoostingClassifier()

### using nltk and tfidf

In [20]:
# defining the X and y arrays for training and testing
X1 = data_nltk['cleaned_text'].values
y1 = data_nltk['target'].values

In [21]:
X1.shape,y1.shape

((44898,), (44898,))

In [22]:
# splitting data to train-test split
X_train1,X_test1,y_train1,y_test1 = train_test_split(X1,y1,test_size=0.33,random_state=44)

In [23]:
# defining the tfidf vectorizer
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

In [24]:
# defining the pipeline to fit the training data
gb_tfidf = Pipeline([
    ('vect',tfidf),
    ('gb clf',clf)
])
gb_tfidf.fit(X_train1,y_train1)

In [25]:
y_pred1 = gb_tfidf.predict(X_test1)
# printing the classification report for validation of the model
print(classification_report(y_test1, y_pred1))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      7679
           1       0.99      1.00      0.99      7138

    accuracy                           1.00     14817
   macro avg       1.00      1.00      1.00     14817
weighted avg       1.00      1.00      1.00     14817



### using gensim - word2vect

In [26]:
# defining X and y arrays for training and testing
X2 = word_vectors
y2 = data['target'].values

In [27]:
# Create training and test sets
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.33, random_state=44)

In [28]:
# reshaping the input values for classifying
X_train_2d = np.stack(X_train2)
X_test_2d =  np.stack(X_test2)
X_train_2d.shape , X_test_2d.shape

((30081, 100), (14817, 100))

In [29]:
# fitting the train data
clf.fit(X_train_2d, y_train2)
# predicting the test values
y_pred2 = clf.predict(X_test_2d)
# printing the classification report for validation of the model
print(classification_report(y_test2, y_pred2))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      7679
           1       0.95      0.96      0.96      7138

    accuracy                           0.96     14817
   macro avg       0.96      0.96      0.96     14817
weighted avg       0.96      0.96      0.96     14817



By comparing the performance of the nlp libraries, nltk and gensim, on the GradientBoostingClassifier, the accuracy for the model that used nltk features for word processing and Tfidfvectorizer have a perfect accuracy score of 100% while the model that used gensim processing features and its word2vec word embedding have only 96% accuracy. Therefore, in conclusion, Tfidfvectorizer word embedding is much better than gensim's word2vect word embedding in helping the classifier identify classes accurately. 