Achintya Yedavalli

# Assignment 5: Text Classification

We are going work on the task of fake news detection from Tweets, i.e.,given a Tweet related Covid19, classify whether the tweet is fake or not. You are given two data files: `FakeNews_train.csv` and `FakeNews_test.csv`. These files contain two columns, "tweet" and "label," where "label" indicates whether a tweet is "real" or "fake". Your task is to explore machine learning-based models to select the best model for classifying fake news tweets.

## 1. Data Preprocessing

### Load the raw training & test datasets from the csv files

In [1]:
# load de data
import pandas as pd

train = pd.read_csv("FakeNews_train.csv")
test = pd.read_csv("FakeNews_test.csv")
train.head()

Unnamed: 0,tweet,label
0,• Businesses and offices can reopen for staff ...,real
1,RT @CDCDirector: We do not know yet if the ant...,real
2,This is an image of a suspected coronavirus va...,fake
3,We can’t forget that in the middle of a global...,fake
4,#IndiaFightsCorona Focused and effective effor...,real


### Clean the text data by removing URLs, hashtags, and mentions.

If the file exists though we are skipping the preprocessing

In [2]:
# cleaning time!
# we use regex for this
import re
# credit to https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
def remove_non_english(text):
    # Define a regex pattern to find
    pattern = re.compile(r"https?://\S+|(?<=\s)[@#]|^[@#]|[^a-zA-Z0-9\s]")

    # Use the sub() method to replace
    text_without_noneg = pattern.sub("", text)

    return text_without_noneg

train_clean = []
test_clean = []
for line in train["tweet"]:
  train_clean.append(remove_non_english(line))

for line in test["tweet"]:
  test_clean.append(remove_non_english(line))

# make my life easier by putting back clean into the original dfs
train["tweet"] = train_clean
test["tweet"] = test_clean 

### Tokenize the text data.

### Remove stop words and lemmatize the text data.

(doing c and d together because it's easier)

(also if file exists we are skipping preprocessing altogether)

In [3]:
#tokenize data
import spacy
import nltk
from nltk.corpus import stopwords
nlp_pipeline = spacy.load("en_core_web_sm")

# get a list of stopwords from NLTK

nltk.download('stopwords')
stops = set(stopwords.words('english'))

def pre_process_a_single_sentence(sentence: str):
    # Lower case text
    sentence = sentence.lower()

    processed_sentence = []

    # Tokenize, and lemmatize the text
    doc = nlp_pipeline(sentence)

    for token in doc:
    # here token is an object that contains various information about each token
    # information such as lemma, pos, parse labels are available
    # we will check here if tokens are present in stopwords; if not, we will retain their lemma
        if token not in stops:
            lemmatized_token = token.lemma_
            processed_sentence.append(lemmatized_token)
        processed_sentence = " ".join (processed_sentence)
        return processed_sentence


train_clean = []
test_clean = []

# run the pre-processing
for tweet in train["tweet"]:
    train_clean.append(pre_process_a_single_sentence(tweet))

for tweet in test["tweet"]:
    test_clean.append(pre_process_a_single_sentence(tweet))

train["tweet"] = train_clean
test["tweet"] = test_clean

train_clean = None
test_clean = None

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Save the processed data into 2 files

In [4]:
train.to_csv("FakeNews_train_preprocessed.csv")
test.to_csv("FakeNews_test_preprocessed.csv")

## 2. Defining ML models

### define three ML models: 

(a) LogisticRegression 

(b) SVC 

(c) MLPClassifier. 

In [5]:
## 1. Logistic Regression
from sklearn.linear_model import LogisticRegression
## 2. Support Vector Machine
from sklearn.svm import SVC
## 3. Feed forward neural network or multi-layered perceptron (MLPClassifier)
from sklearn.neural_network import MLPClassifier

## 3. Exploring  Basic Text Features

### For each example, extract TF-IDF features with max_features set to 5000.

### Train and evaluate all thee ML models you defined

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
# compute "goodness" of classification through accuracy
from sklearn.metrics import accuracy_score

# Extract text and labels
X_train = train['tweet']
y_train = train['label']
X_test = test['tweet']
y_test = test['label']

# generic training function
def train_and_evaluate_classifier(classifier, X_train, y_actual, X_test, y_test_actual):
  classifier.fit(X_train, y_actual)
  y_pred = classifier.predict(X_test)
  accuracy = accuracy_score(y_test_actual, y_pred)
  return accuracy


# Create a CountVectorizer for unigrams (bag of words)
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Logistic Regression
classifier = LogisticRegression()
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

# Train SVC
classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classification = {accuracy*100}%")

# Train MLPClassifier
classifier = MLPClassifier(random_state=1)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of MLP Classification = {accuracy*100}%")

Accuracy of Logistic Regression = 71.26168224299066%
Accuracy of Support Vector Classification = 71.26168224299066%


Accuracy of MLP Classification = 71.02803738317756%


### Which model performs the best? For the best model, print 10 examples from the test data, the actual and predicted labels. 

The best-performing model is the Support Vector Classification model which is at ~92.835% accurate.

In [7]:
# print 10 examples from best fit model
classifier = SVC(kernel="linear")
classifier.fit(X_train_vec, y_train)

sample_vec = X_test_vec[5:25:2]

y_pred = classifier.predict(sample_vec)
# append onto new dataframe
df_section = test[5:25:2]
df_section['pred_label'] = y_pred.tolist()

df_section

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_section['pred_label'] = y_pred.tolist()


Unnamed: 0,tweet,label,pred_label
5,coronavirusupdates,real,real
7,the,fake,real
9,in,real,fake
11,there,fake,real
13,asian,fake,fake
15,there,real,real
17,madrids,fake,fake
19,both,real,fake
21,there,real,real
23,ryanair,fake,fake


### Share observations and insights

I think that the SVC method is really the best one for this type of work; real/fake classification is not something that needs a complicated Feed Forward Neural Network, and logistic regression is really breaking at 5000 features, so SVC is the clear winner for this round.

## 4. Exploring Averaged Word Embeddings as Features (2 points)

### For each example, extract average word embeddings using pertained word2vec model. 

Follow Week 11's tutorial (Lab11_Word_and_sentence_embeddings.ipynb) for extracting word embeddings and averaging them to form sentence embeddings for each example. 

In [8]:
# extract average word embeddings
import gensim.downloader as api

# Load the pre-trained Word2Vec model, or Download a pre-trained word2vec (trained on Google News data)
w2v_model = api.load("word2vec-google-news-300")

In [9]:
# Function to extract sentence vector from word vectors by averaging word embeddings
from scipy.spatial.distance import cosine
import numpy as np
p = 0
def create_word2vec_embeddings(dataframe):
    sentences = [text.split() for text in dataframe['tweet']]

    # Average Word Vectors for each text (I gave up and asked chatgpt how to fix the code)
    def document_vector(doc):
        vectors = [w2v_model[w] for w in doc if w in w2v_model]
        if vectors:
            return np.mean(vectors, axis=0)
        else:
            # Return a zero vector if no valid word vectors found
            return np.zeros_like(w2v_model['example'])


    X_w2v = np.array([document_vector(text) for text in sentences if document_vector(text).shape != ()])
    return X_w2v

X_train_vec = create_word2vec_embeddings(train)
X_test_vec = create_word2vec_embeddings(test)
print("Word2Vec features shape:", X_train_vec.shape)

Word2Vec features shape: (5136, 300)


### Train & Evaluate all the ML Models defined

Train and test Logistic Regression.  

In [10]:
classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 68.22429906542055%


Train and test SVM

In [11]:
classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classification = {accuracy*100}%")

Accuracy of Support Vector Classification = 68.45794392523365%


Train and test Feed Forward Network

In [12]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of MLP Classification = {accuracy*100}%")

Accuracy of MLP Classification = 69.47040498442367%


### Which model is the best? Print 10 results

According to the accuracy reports of the three models, the feed forward network (MLPClassifier) is the most accurate model tested at 89.096%.

In [13]:
# print 10 examples from best fit model
# MLP is already the classifier
classifier.fit(X_train_vec, y_train)

sample_vec = X_test_vec[9:39:3]

y_pred = classifier.predict(sample_vec)
# append onto new dataframe
df_section = test[9:39:3]
df_section['pred_label'] = y_pred.tolist()

df_section

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_section['pred_label'] = y_pred.tolist()


Unnamed: 0,tweet,label,pred_label
9,in,real,fake
12,as,real,real
15,there,real,real
18,marvinbrite,fake,real
21,there,real,real
24,a,fake,real
27,lesotho,fake,real
30,,real,real
33,for,real,real
36,just,real,real


## Exploring Sentence Transformer Embeddings as Features

### For each example, extract sentence embeddings directly using sentence transformers.

In [14]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained BERT model
model = SentenceTransformer('all-MiniLM-L6-v2') # changed to the regular name for a BERT model because Hugging Face crapped out lots of times

train_embed = model.encode(train["tweet"])
test_embed = model.encode(test["tweet"])

Train and test Logistic Regression.  

In [15]:
classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(classifier, train_embed, y_train, test_embed, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 73.28660436137072%


Train and test SVM

In [16]:
classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(classifier, train_embed, y_train, test_embed, y_test)
print (f"Accuracy of Support Vector Classification = {accuracy*100}%")

Accuracy of Support Vector Classification = 73.75389408099689%


Train and test Feed Forward Network

In [17]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, train_embed, y_train, test_embed, y_test)
print (f"Accuracy of MLP Classification = {accuracy*100}%")

Accuracy of MLP Classification = 73.20872274143302%


### Which model is the best? Print 10 results

According to the accuracy reports of the three models, the feed forward network (MLPClassifier) is the most accurate model tested at 90.81%.

In [19]:
# print 10 examples from best fit model
# MLP is already the classifier
classifier.fit(X_train_vec, y_train)
sample_vec = X_test_vec[12:52:4]

y_pred = classifier.predict(sample_vec)
# append onto new dataframe
df_section = test[12:52:4]
df_section['pred_label'] = y_pred.tolist()

df_section

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_section['pred_label'] = y_pred.tolist()


Unnamed: 0,tweet,label,pred_label
12,as,real,real
16,this,real,real
20,news,fake,fake
24,a,fake,real
28,the,fake,real
32,president,fake,fake
36,just,real,real
40,trump,fake,fake
44,rt,real,real
48,some,real,real


## Conclusions

Across all testing, the combination of embedding and model is the SVC model with TF-IDF features. It may only be ~2% better than the BERT or the W2V embeddings, but it comes out as the highest in the end. This was unexpected because you normally assume that the higher complexity you go the more accurate you are, but I think that it mostly doesn't apply for a smaller task like this, where the data is only around 5000 rows or so. In the future a larger dataset might be more appreciated to work with, even if it might kill my computer lol.