In [None]:
# Exercise sheet 10 with fake train and fake test . Load the data set into your console.

# Task 1
In moodle you will find the files fake train.csv and fake test.csv. They each contain a set of news articles that either contain factual news or fake news.

In the upcoming tasks, we will try to differentiate fake news from real news by comparing their document embeddings. 

For this, we will train a document embedding model on the whole corpus. 

Then, we will use the embeddings of our train-corpus to train a logistic regression model that tries to predict the labels of the test-corpus given their embeddings.


In [2]:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument  # Updated import
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [5]:
# Define file paths
train_file_path = '/Users/oayanwale/Downloads/NLP_Exercise_24_25/Data/fake_train.csv'
test_file_path = '/Users/oayanwale/Downloads/NLP_Exercise_24_25/Data/fake_test.csv'

# Task 1: Load data
train_df = pd.read_csv(train_file_path)
test_df = pd.read_csv(test_file_path)



# Task 2
# Preprocess the texts so that they are fit for an analysis. Argue the use the preprocessing steps you take for the given analysis.


In [10]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Initialize necessary components
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenization (split text into individual words)
    tokens = word_tokenize(text)
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stop_words]
    
    # Remove numbers (optional for news articles)
    tokens = [token for token in tokens if not token.isdigit()]
    
    # Lemmatization (reduce words to their base form)
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join tokens back into a single string and strip excess whitespace
    preprocessed_text = ' '.join(tokens).strip()
    
    return preprocessed_text

# Apply preprocessing to the 'text' column of both datasets
train_df['preprocessed_text'] = train_df['text'].apply(preprocess_text)
test_df['preprocessed_text'] = test_df['text'].apply(preprocess_text)

# Display the preprocessed data
print("Train Data:")
print(train_df.head())
print("\nTest Data:")
print(test_df.head())


Train Data:
                                                text  label  \
0  Trump administration to review goal of world w...      1   
1  Turkish academics to be tried in April over Ku...      1   
2  Factbox: Italy's new electoral law offers a mi...      1   
3   WATCH: Trump Get His A** Handed To Him By Chr...      0   
4  Mexico president says Trump visit could have b...      1   

                                   preprocessed_text  
0  trump administration review goal world without...  
1  turkish academic tried april kurdish letter is...  
2  factbox italy new electoral law offer mix syst...  
3  watch trump get handed chris cuomo cry fake ne...  
4  mexico president say trump visit could done be...  

Test Data:
                                                text  label  \
0  As U.S. budget fight looms, Republicans flip t...      1   
1  U.S. military to accept transgender recruits o...      1   
2  Senior U.S. Republican senator: 'Let Mr. Muell...      1   
3  FBI Russia p

# Argue the use the preprocessing steps you take for the given analysis.

1. Lowercasing
Purpose: Converting all text to lowercase ensures uniformity across the dataset.
Argument: This step is crucial because it eliminates case sensitivity, allowing words like "Trump" and "trump" to be treated as the same token. This helps reduce redundancy in the vocabulary and aids in better model performance.

2. Removing Punctuation
Purpose: Punctuation marks do not typically carry semantic meaning that contributes to understanding or classifying text.
Argument: By removing punctuation, you reduce noise in the data. In news articles, punctuation can often interfere with tokenization and may not provide valuable information for classification tasks.

3. Tokenization
Purpose: Splitting text into individual words (tokens) allows for detailed analysis of word frequencies and relationships.
Argument: Tokenization is a foundational step in NLP that enables further processing such as stop word removal and lemmatization. It transforms the continuous string of text into manageable units that can be analyzed individually.

4. Removing Stopwords
Purpose: Stopwords are common words (e.g., "and", "the", "is") that do not add significant meaning to sentences.
Argument: By filtering out stopwords, you focus on more meaningful content within the articles. This reduces dimensionality and helps improve the efficiency of machine learning algorithms by only considering relevant tokens.

5. Removing Numbers (Optional)
Purpose: Depending on context, numbers may not contribute meaningful information for classification tasks.
Argument: In many cases, especially when analyzing textual content rather than numerical data, removing numbers helps clean up the dataset. However, if your analysis involves financial or statistical news where numbers are significant, this step might need reconsideration.

6. Lemmatization
Purpose: Lemmatization reduces words to their base or root form (e.g., "running" becomes "run").
Argument: This process helps group different forms of a word together under a single representation which is beneficial for understanding semantics and improving model accuracy. It ensures that variations of a word do not inflate feature space unnecessarily.

7. Joining Tokens Back into a Single String
Purpose: After preprocessing, it’s often useful to convert tokens back into a single string format for certain models or analyses.
Argument: Some models require input as strings rather than lists of tokens; thus joining them allows flexibility depending on how subsequent analyses will be performed.


# Task 3
# Train a Doc2Vec model on all documents from both the training and test corpus with a window size of four and a vector dimension of 300.

In [12]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

# Prepare the tagged data by tagging each document
combined_df = pd.concat([train_df[['preprocessed_text']], test_df[['preprocessed_text']]])

# Create TaggedDocument for each document in the combined dataset
tagged_data = [TaggedDocument(words=row.split(), tags=[str(i)]) for i, row in enumerate(combined_df['preprocessed_text'])]

# Train the Doc2Vec model
model = Doc2Vec(vector_size=300, window=4, min_count=1, workers=4, epochs=10)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Save the model (optional)
model.save("/Users/oayanwale/Downloads/NLP_Exercise_24_25/Data/doc2vec_model.model")


In [13]:
from gensim.models import Doc2Vec

# Load the saved model
loaded_model = Doc2Vec.load("/Users/oayanwale/Downloads/NLP_Exercise_24_25/Data/doc2vec_model.model")



# Task 4
# Create a data frame of all document embeddings of the documents within the training corpus and the label of the respective document. Use this data frame to train a logistic regression that uses the embeddings to predict the label of the document.

# Use it to predict the labels of all documents in the test-corpus using their embeddings. Compare the resulting labels to the true labels and return the classification rate. How well does the model perform?


In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create embeddings for each document in the training set
train_embeddings = [model.infer_vector(row.split()) for row in train_df['preprocessed_text']]

# Add the embeddings and the labels to a new DataFrame
train_embeddings_df = pd.DataFrame(train_embeddings)
train_embeddings_df['label'] = train_df['label']

# Train a Logistic Regression model using the document embeddings
X_train = train_embeddings_df.drop('label', axis=1)
y_train = train_embeddings_df['label']

log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(X_train, y_train)

# Now use the model to predict on the test set
test_embeddings = [model.infer_vector(row.split()) for row in test_df['preprocessed_text']]
test_embeddings_df = pd.DataFrame(test_embeddings)

# Make predictions
predictions = log_reg_model.predict(test_embeddings_df)

# Compare with true labels
accuracy = accuracy_score(test_df['label'], predictions)
print(f"Classification accuracy: {accuracy}")


Classification accuracy: 0.9467476351745884


# Task 5
# Repeat tasks 3 and 4 with one adjustment: Train your initial Doc2Vec model only on the train-corpus. This way, the test-corpus is entirely unobserved for our model.


In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the logistic regression model
logreg = LogisticRegression(max_iter=1000)


# Re-train Doc2Vec model using only the train corpus
train_docs = [TaggedDocument(words=doc.split(), tags=[str(i)]) for i, doc in enumerate(train_df['text'])]
model = Doc2Vec(vector_size=100, window=2, min_count=1, workers=4, epochs=40)
model.build_vocab(train_docs)
model.train(train_docs, total_examples=model.corpus_count, epochs=model.epochs)

# Generate embeddings for training data
train_embeddings = [model.infer_vector(doc.split()) for doc in train_df['text']]

# Create DataFrame with embeddings and corresponding labels
train_embeddings_df = pd.DataFrame(train_embeddings)
train_embeddings_df['label'] = train_df['label']

# Train logistic regression
X_train = train_embeddings_df.drop('label', axis=1)
y_train = train_embeddings_df['label']
logreg.fit(X_train, y_train)

# Generate embeddings for test data
test_embeddings = [model.infer_vector(doc.split()) for doc in test_df['text']]
X_test = pd.DataFrame(test_embeddings)
y_test = test_df['label']  # Extract actual labels

# Predict labels for test data
y_pred = logreg.predict(X_test)

# Compare predicted labels with true labels
accuracy = accuracy_score(y_test, y_pred)

print(f"Classification Accuracy (Doc2Vec - Train-only): {accuracy}")


Classification Accuracy (Doc2Vec - Train-only): 0.9822102845575927


# Task 6
# Repeat tasks 3 and 4 with a different adjustment: Train a Word2Vec model and create a document embedding for each document by averaging all word vector that document contains. 
# Which of the three models from task 3, 5 and 6 performs best in classifying the documents?

In [18]:
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Initialize the logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Train Word2Vec model on the train corpus
word2vec_model = Word2Vec(sentences=[doc.split() for doc in train_df['text']], vector_size=100, window=2, min_count=1, workers=4, epochs=40)

# Create document embeddings by averaging word embeddings
def get_word2vec_embedding(doc):
    words = doc.split()
    word_vectors = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
    if len(word_vectors) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(word_vectors, axis=0)

# Generate embeddings for training data
train_embeddings = [get_word2vec_embedding(doc) for doc in train_df['text']]

# Create DataFrame with embeddings and corresponding labels
train_embeddings_df = pd.DataFrame(train_embeddings)
train_embeddings_df['label'] = train_df['label']

# Train logistic regression
X_train = train_embeddings_df.drop('label', axis=1)
y_train = train_embeddings_df['label']
logreg.fit(X_train, y_train)

# Generate embeddings for test data
test_embeddings = [get_word2vec_embedding(doc) for doc in test_df['text']]
X_test = pd.DataFrame(test_embeddings)
y_test = test_df['label']  # Extract actual labels

# Predict labels for test data
y_pred = logreg.predict(X_test)

# Compare predicted labels with true labels
accuracy = accuracy_score(y_test, y_pred)

print(f"Classification Accuracy (Word2Vec): {accuracy}")

Classification Accuracy (Word2Vec): 0.9889446844953093


# Summary of your model performances:

Doc2Vec (Trained on both Train & Test Corpus) → 94.67%
Doc2Vec (Train-only, Unseen Test Corpus) → 98.22%
Word2Vec (Averaged Word Vectors) → 98.89%

# Observations & Insights
✅ Word2Vec performed best with 98.89% accuracy, showing that averaging word vectors effectively captures document meaning.
✅ Doc2Vec (Train-only) improvConclusion:
The Word2Vec model (Task 6) achieves the highest classification accuracy, likely because averaging word vectors provides a more robust document representation than Doc2Vec’s inferred vectorsed compared to full corpus training, indicating that generalization worked better when the test data was entirely unseen.
✅ Both Doc2Vec approaches performed well, but training on the full corpus slightly reduced accuracy—possibly due to overfitting.

Conclusion:
The Word2Vec model (Task 6) achieves the highest classification accuracy, likely because averaging word vectors provides a more robust document representation than Doc2Vec’s inferred vectors
