Achintya Yedavalli

# Assignment 5: Text Classification

We are going work on the task of fake news detection from Tweets, i.e.,given a Tweet related Covid19, classify whether the tweet is fake or not. You are given two data files: `FakeNews_train.csv` and `FakeNews_test.csv`. These files contain two columns, "tweet" and "label," where "label" indicates whether a tweet is "real" or "fake". Your task is to explore machine learning-based models to select the best model for classifying fake news tweets.

## 1. Data Preprocessing

### Load the raw training & test datasets from the csv files

In [1]:
# load de data
import pandas as pd

train = pd.read_csv("FakeNews_train.csv")
test = pd.read_csv("FakeNews_test.csv")
train.head()

Unnamed: 0,tweet,label
0,• Businesses and offices can reopen for staff ...,real
1,RT @CDCDirector: We do not know yet if the ant...,real
2,This is an image of a suspected coronavirus va...,fake
3,We can’t forget that in the middle of a global...,fake
4,#IndiaFightsCorona Focused and effective effor...,real


### Clean the text data by removing URLs, hashtags, and mentions.

If the file exists though we are skipping the preprocessing

In [29]:
import os.path

if not os.path.isfile("FakeNews_test_preprocessed.csv"):
  # cleaning time!
  # we use regex for this
  import re
  # credit to https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
  def remove_non_english(text):
      # Define a regex pattern to find
      pattern = re.compile(r"https?://\S+|(?<=\s)[@#]|^[@#]|[^a-zA-Z0-9\s]")

      # Use the sub() method to replace
      text_without_noneg = pattern.sub("", text)

      return text_without_noneg

  train_clean = []
  test_clean = []
  for line in train["tweet"]:
    train_clean.append(remove_non_english(line))

  for line in test["tweet"]:
    test_clean.append(remove_non_english(line))

  # make my life easier by putting back clean into the original dfs
  train["tweet"] = train_clean
  test["tweet"] = test_clean 

else:
   print("skipped")

skipped


### Tokenize the text data.

### Remove stop words and lemmatize the text data.

(doing c and d together because it's easier)

(also if file exists we are skipping preprocessing altogether)

In [30]:
#tokenize data
import spacy
import nltk

# if 1 exists the other has to exists
if not os.path.isfile("FakeNews_test_preprocessed.csv"):
    from nltk.corpus import stopwords
    nlp_pipeline = spacy.load("en_core_web_sm")

    # get a list of stopwords from NLTK

    nltk.download('stopwords')
    stops = set(stopwords.words('english'))

    def pre_process_a_single_sentence(sentence: str):
        # Lower case text
        sentence = sentence.lower()

        processed_sentence = []

        # Tokenize, and lemmatize the text
        doc = nlp_pipeline(sentence)

        for token in doc:
        # here token is an object that contains various information about each token
        # information such as lemma, pos, parse labels are available
        # we will check here if tokens are present in stopwords; if not, we will retain their lemma
            if token not in stops:
                lemmatized_token = token.lemma_
                processed_sentence.append(lemmatized_token)
            processed_sentence = " ".join (processed_sentence)
            return processed_sentence


    train_clean = []
    test_clean = []

    # run the pre-processing
    for tweet in train["tweet"]:
        train_clean.append(pre_process_a_single_sentence(tweet))

    for tweet in test["tweet"]:
        test_clean.append(pre_process_a_single_sentence(tweet))

    train["tweet"] = train_clean
    test["tweet"] = test_clean

    train_clean = None
    test_clean = None
else:
    print("file exists: skipping")

file exists: skipping


### Save the processed data into 2 files

alternatively load if they exist

In [3]:
import os.path
# if it exists then load them, otherwise save them
if os.path.isfile("FakeNews_test_preprocessed.csv"):
    train = pd.read_csv("FakeNews_train_preprocessed.csv")
    test = pd.read_csv("FakeNews_test_preprocessed.csv")
else:
    train.to_csv("FakeNews_train_preprocessed.csv")
    test.to_csv("FakeNews_test_preprocessed.csv")

## 2. Defining ML models (1 point)

### define three ML models: 

(a) LogisticRegression 

(b) SVC 

(c) MLPClassifier. 

In [8]:
## 1. Logistic Regression
from sklearn.linear_model import LogisticRegression
## 2. Support Vector Machine
from sklearn.svm import SVC
## 3. Feed forward neural network or multi-layered perceptron (MLPClassifier)
from sklearn.neural_network import MLPClassifier

## 3. Exploring  Basic Text Features (2 points)

### For each example, extract TF-IDF features with max_features set to 5000.

### Train and evaluate all thee ML models you defined

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
# compute "goodness" of classification through accuracy
from sklearn.metrics import accuracy_score

# Extract text and labels
X_train = train['tweet']
y_train = train['label']
X_test = test['tweet']
y_test = test['label']

# generic training function
def train_and_evaluate_classifier(classifier, X_train, y_actual, X_test, y_test_actual):
  classifier.fit(X_train, y_actual)
  y_pred = classifier.predict(X_test)
  accuracy = accuracy_score(y_test_actual, y_pred)
  return accuracy


# Create a CountVectorizer for unigrams (bag of words)
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Logistic Regression
classifier = LogisticRegression(max_iter=5000)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

# Train SVC
classifier = SVC(kernel="linear", max_iter=5000)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classification = {accuracy*100}%")

# Train MLPClassifier
classifier = MLPClassifier(random_state=1, max_iter=5000)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of MLP Classification = {accuracy*100}%")

Accuracy of Logistic Regression = 90.73208722741433%




Accuracy of Support Vector Classification = 92.83489096573209%
Accuracy of MLP Classification = 92.21183800623052%


### Which model performs the best? For the best model, print 10 examples from the test data, the actual and predicted labels. 

The best-performing model is the Support Vector Classification model which is at ~92.835% accurate.

In [34]:
classifier = SVC(kernel="linear", max_iter=5000)
classifier.fit(X_train_vec, y_train)
y_pred = classifier.predict(X_test_vec)



In [35]:
sample = test.head(10)
sample["predicted"] = y_pred[:10]
sample = sample.drop("Unnamed: 0", axis=1)
sample

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample["predicted"] = y_pred[:10]


Unnamed: 0,tweet,label,predicted
0,expert call out claim that cow dungurine yoga ...,fake,fake
1,bill gates predict and simulate the covid19 pa...,fake,fake
2,global coronavirus death exceed 800000,fake,fake
3,an ayurveda practitioner devender sharma give ...,fake,fake
4,meghan markle planning huge 200k birthday part...,fake,fake
5,coronavirusupdates indiafightscorona national ...,real,real
6,chancine coimmunity alexkx3 ballouxfrancois I ...,fake,fake
7,the video of terrible condition of a covid19 w...,fake,fake
8,coronavirusupdates indiafightscorona indias av...,real,real
9,in addition to our new reopeningsafely metric ...,real,real


### Share observations and insights

I think that the SVC method is really the best one for this type of work; real/fake classification is not something that needs a complicated Feed Forward Neural Network, and logistic regression is really breaking at 5000 features, so SVC is the clear winner for this round.

## Exploring Averaged Word Embeddings as Features (2 points)

### For each example, extract average word embeddings using pertained word2vec model. 

Follow Week 11's tutorial (Lab11_Word_and_sentence_embeddings.ipynb) for extracting word embeddings and averaging them to form sentence embeddings for each example. 

In [36]:
# extract average word embeddings
import gensim
import gensim.downloader as api

# Load the pre-trained Word2Vec model

# Download a pre-trained word2vec (trained on Google News data)
w2v_model = api.load("word2vec-google-news-300")


In [45]:
# Function to extract sentence vector from word vectors by averaging word embeddings
from scipy.spatial.distance import cosine
import numpy as np

# I'm pretty sure this is what is good for word2vec
def extract_sentence_vector(sentence):
    words = sentence.split()
    word_vectors = [w2v_model[word] for word in words if word in w2v_model]
    if not word_vectors:
        return None  # Return None if no word vectors are found
    sentence_vector = np.mean(word_vectors, axis=0)
    return sentence_vector

X_train_vec = []
X_test_vec = []
for tweet in train["tweet"]:
    vector = extract_sentence_vector(tweet)
    if vector is not None:
        X_train_vec.append(vector)

for tweet in test["tweet"]:
    vector = extract_sentence_vector(tweet)
    if vector is not None:
        X_test_vec.append(vector)

# this part to save error taken from https://stackoverflow.com/a/49569182 and https://stackoverflow.com/q/32284121
X_train_mtrx = np.vstack(X_train_vec)
X_test_mtrx = np.vstack(X_test_vec)
from scipy.sparse import csr_matrix
X_train_vec = csr_matrix(X_train_mtrx)
X_test_vec = csr_matrix(X_test_mtrx)


### Train & Evaluate all the ML Models defined

Train and test Logistic Regression.  

In [46]:
classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

ValueError: Found input variables with inconsistent numbers of samples: [5135, 5136]

Train and test SVM

In [47]:
classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classificaiton = {accuracy*100}%")

ValueError: Found input variables with inconsistent numbers of samples: [5135, 5136]

Train and test Feed Forward Network

In [48]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classificaiton = {accuracy*100}%")

ValueError: Found input variables with inconsistent numbers of samples: [5135, 5136]

## These aren't working, I don't know why. I've spent way, way too long (4 hours) on these; maybe I'll get to it tomorrow if I can. For now it has to lie like this unfortunately. :(

## Exploring Sentence Transformer Embeddings as Features

### For each example, extract sentence embeddings directly using sentence transformers.

In [5]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

train_embed = model.encode(train["tweet"])
test_embed = model.encode(test["tweet"])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

The Problem I had with there being no bert-case-uncased is still there. Without one of those, trying to run this exercise is impossible, unfortunately.

I just can't run the sentence transformers needed for me to get the data to feed into the ML algorithm trainers to evaluate them, much less try to do something to compare them properly. Unfortunately, that's just how it is sometimes. I don't think it's a problem with my environment, as I tried Google Colab and my local machine and both failed. I'll try to look for this.

This took me 3 hours to debug before throwing in the towel. 3. HOURS.

In [11]:
# Train Logistic Regression
classifier = LogisticRegression(max_iter=5000)
accuracy = train_and_evaluate_classifier(classifier, train_embed, y_train, test_embed, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

# Train SVC
classifier = SVC(kernel="linear", max_iter=5000)
accuracy = train_and_evaluate_classifier(classifier, train_embed, y_train, test_embed, y_test)
print (f"Accuracy of Support Vector Classification = {accuracy*100}%")

# Train MLPClassifier
classifier = MLPClassifier(random_state=1, max_iter=5000)
accuracy = train_and_evaluate_classifier(classifier, train_embed, y_train, test_embed, y_test)
print (f"Accuracy of MLP Classification = {accuracy*100}%")

Accuracy of Logistic Regression = 87.77258566978193%
Accuracy of Support Vector Classification = 88.70716510903426%
Accuracy of MLP Classification = 90.80996884735202%
