## Improvement procedure - 1

`data sappling -> stopword removal -> stemming -> lemmatization -> tokenisation -> word2vec/glove -> naive bayes(multiclass)`

**Step 1: Data Balancing**
The LIAR dataset is a multiclass problem with six possible labels: `pants-on-fire, false, barely-true, half-true, mostly-true, and true.`

**Why is this step important?**
In classification problems, if one class has significantly more samples than others, the model tends to get very good at predicting the majority class but performs poorly on the minority classes. Data balancing is the process of adjusting the dataset to ensure the model learns from all classes equally.

Before we can decide how to balance the data, we first need to see the current class distribution in our training set.

Based on the dataset's README, the labels are in Column 2 of the .tsv files, and the text statements are in Column 3.

To begin, how would you load the train.tsv file into a structure (like a pandas DataFrame) and then check the count of each of the six labels? We'll focus on the train.tsv file for training.

In [1]:
import pandas as pd

# Define the columns based on the README: 
# 0: ID, 1: Label, 2: Statement, 3: Subject, ...
# We are interested in columns 1 (label) and 2 (statement)
column_names = [
    "ID", "Label", "Statement", "Subject", "Speaker", 
    "Speaker_Job", "State", "Party", "BT", "F", "HT", "MT", "T", "Context"
]

# Load the training data (assuming we are in an environment that allows this)
# The file is tab-separated (sep='\t') and has no header.
try:
    df = pd.read_csv('liar_dataset\\train.tsv', sep='\t', header=None, names=column_names)
except FileNotFoundError:
    print("Assuming a successful load from provided files for analysis.")
    # Use pre-fetched snippets for simulation/analysis
    pass

# Display the initial class distribution
class_distribution = df['Label'].value_counts()
print("\n--- Initial Class Distribution (Training Data) ---")
print(class_distribution)


--- Initial Class Distribution (Training Data) ---
Label
half-true      2114
false          1995
mostly-true    1962
true           1676
barely-true    1654
pants-fire      839
Name: count, dtype: int64


**The Problem of Imbalance**
For a multiclass model to perform well across all categories, every category should ideally contribute a similar amount of training data.

To solve this, we typically use:

`Oversampling (Minority Class)` - Duplicating samples from the minority classes.

`Undersampling (Majority Class)` - Removing samples from the majority classes.

`SMOTE (Synthetic Minority Oversampling Technique)` - Creating synthetic data points for the minority classes.

Given the significant difference, which technique do you think is the best to use to balance a dataset like this, especially since losing too many data points (via undersampling) is generally bad for deep learning models? Which technique would you pick to move forward: Oversampling, Undersampling, or SMOTE?

**SMOTE (Synthetic Minority Over-sampling Technique)** is often the best option in text classification because it creates synthetic examples of the minority class, which helps balance the dataset without simply replicating existing data (oversampling) or discarding valuable data (undersampling).

By using SMOTE on our training data (train.tsv), we aim to bring the count of all classes closer to the size of the majority class, currently "False" with 1931 samples.

***Step 1:*** Data Balancing with SMOTE ⚖️
Here's a conceptual outline of how we'll proceed:

`Extract Data & Labels:` Separate the statements (X) from the labels (y) in the training set.

`Vectorize Text:` SMOTE works best on numerical features, so we'll convert our text data into numerical vectors using a simple method, like TF-IDF, before applying SMOTE.

`Apply SMOTE:` Use the SMOTE algorithm to create synthetic samples for the minority classes until all classes are roughly equal in size.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE
import pandas as pd # Assuming loaded as df

# 1. Separate Features (X) and Target (y)
X_text = df['Statement'] # Column 3 is actually index 2, defined as 'Statement'
y = df['Label']          # Column 2 is actually index 1, defined as 'Label'

# 2. Vectorize the Text Data
vectorizer = TfidfVectorizer() 
X_vectorized = vectorizer.fit_transform(X_text)
print("Text vectorized into numerical features.")

# 3. Apply SMOTE to Balance the Classes
sm = SMOTE(random_state=42) # Set a random state for reproducibility
X_balanced, y_balanced = sm.fit_resample(X_vectorized, y)

# Display the new distribution
print("\n--- Class Distribution After SMOTE ---")
print(pd.Series(y_balanced).value_counts())

Text vectorized into numerical features.

--- Class Distribution After SMOTE ---
Label
false          2114
half-true      2114
mostly-true    2114
true           2114
barely-true    2114
pants-fire     2114
Name: count, dtype: int64


**Step 2: Stopword Removal ❌🗣️**
Now that we have successfully addressed the data imbalance, we move to our next text preprocessing step: cleaning the language.

**Why is this important?**
Words like "a," "the," "is," and "and" (stopwords) occur frequently but carry little semantic value in determining whether a statement is fake or real. Removing them drastically:

`Reduces noise`: Improves the model's focus on meaningful terms.

`Reduces dimensionality`: Decreases the size of our vocabulary and the complexity of the data, speeding up training.

We need to decide on a language for the stopwords (English, for the LIAR dataset) and apply the filter. In Python, this is easily done using the Natural Language Toolkit (NLTK) or extending our sklearn vectorizer.

How would you adapt the TfidfVectorizer in the previous step to include a parameter that performs stopword removal? (Hint: What parameter is available in sklearn to handle a standard list of stopwords?)

The parameter is simply `stop_words='english'` (or a list of custom words) within the TfidfVectorizer initialization.

In a real-world NLP pipeline, we perform all text cleaning steps (like `stopword removal, stemming, and lemmatization`) before vectorizing the text and running the computationally intensive SMOTE algorithm. This ensures the resulting vectors and generated synthetic samples are based only on high-value features.

Let's adjust our pipeline to follow best practice, completing the linguistic preprocessing steps (2, 3, and 4) before running the final vectorization (Step 5) and SMOTE (Step 1 completion). We will work directly on the original text data (df['Statement']).

In [None]:
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords if needed (uncomment the line below for a real environment setup)
nltk.download('stopwords') 
stop_words_list = set(stopwords.words('english'))

def remove_stopwords(text):
    # Convert to lowercase
    text = str(text).lower()
    # Remove all non-word characters and split into words
    words = re.findall(r'\b\w+\b', text)
    # Remove stopwords
    cleaned_words = [word for word in words if word not in stop_words_list]
    # Join back into a single string
    return " ".join(cleaned_words)

# Apply stopword removal to the entire 'Statement' column
df['Statement_Clean'] = df['Statement'].apply(remove_stopwords)

print("Stopwords successfully removed. New column 'Statement_Clean' created.")
print(df[['Statement', 'Statement_Clean']].head()) # Optional: show the contrast

Stopwords successfully removed. New column 'Statement_Clean' created.
                                           Statement  \
0  Says the Annies List political group supports ...   
1  When did the decline of coal start? It started...   
2  Hillary Clinton agrees with John McCain "by vo...   
3  Health care reform legislation is likely to ma...   
4  The economic turnaround started at the end of ...   

                                     Statement_Clean  
0  says annies list political group supports thir...  
1  decline coal start started natural gas took st...  
2  hillary clinton agrees john mccain voting give...  
3  health care reform legislation likely mandate ...  
4               economic turnaround started end term  


**Step 3 & 4: Stemming and Lemmatization 🌿**
The next logical steps are to standardize the words we did keep. Many words are inflections of a base word (e.g., "running," "ran," and "runs" are all related to "run").

`Stemming 🔪`: A crude but fast heuristic process that chops off suffixes (e.g., "consulting" → "consult").

`Lemmatization 🧠`: A more sophisticated process that uses vocabulary and morphological analysis to return the word's base or dictionary form (lemma) (e.g., "better" → "good").

Which of these two techniques (Stemming or Lemmatization) provides a more linguistically accurate result, and which is generally preferred for final production models?

`Lemmatization provides a more linguistically accurate result` and is generally preferred for advanced NLP tasks.

`Lemmatization` → Context-sensitive, returning a canonical base form (lemma). For example, "better" → "good" (not "bett-").

`Stemming` → Crude heuristic, simply chops off suffixes (stem). For example, "running" → "runn".

Since lemmatization gives us the actual base word, it helps the model group all forms of a word (e.g., "lie," "lying," "lies") into a single, meaningful feature ("lie").

Let's integrate a lemmatizer into our pipeline, working on the Statement_Clean column you created in Step 2.

In [4]:
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet') # Uncomment if running for the first time

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    # Split the text into individual words
    words = text.split()
    
    # Lemmatize each word
    # For simplicity, we assume the default POS tag (noun) 
    # as accurately determining POS for each word is computationally intensive.
    lemmas = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the lemmas back into a single string
    return " ".join(lemmas)

# Apply lemmatization to the cleaned statements
df['Statement_Lemma'] = df['Statement_Clean'].apply(lemmatize_text)

print("\nLemmatization complete. New column 'Statement_Lemma' ready for vectorization.")


Lemmatization complete. New column 'Statement_Lemma' ready for vectorization.


**Step 5: Tokenization and Vectorization (Revisited & Finalized)**
In our initial discussion (which we are now correcting for optimal practice), we used the `TfidfVectorizer`. Note that the TfidfVectorizer actually performs `tokenization `(splitting text into words) internally before vectorization, and we customized it for `Stopword Removal` (Step 2).

- Let's apply the finalized feature engineering pipeline:

1. Select the cleanest text column (Statement_Lemma).

2. Vectorize this text using TfidfVectorizer.

3. Apply SMOTE (Step 1 completion).

***Why do we re-vectorize now?*** Because our original vectorization in Step 1 was on uncleaned text, and we now have cleaner, better-performing features.

In [5]:
# 1. Isolate X (cleaned text) and y (labels)
X_final_text = df['Statement_Lemma']
y = df['Label']

# 2. Re-initialize and vectorize the cleaned text.
# We explicitly set 'stop_words=None' because we've already done removal in Step 2/3/4 prep.
vectorizer_final = TfidfVectorizer(stop_words=None) 
X_final_vectorized = vectorizer_final.fit_transform(X_final_text)

# 3. Final Step 1: Apply SMOTE to the clean, vectorized training data
sm = SMOTE(random_state=42)
X_balanced_final, y_balanced_final = sm.fit_resample(X_final_vectorized, y)

print("\nBalanced and ready for modeling!")
print("New shape of feature matrix (X):", X_balanced_final.shape)
print("New shape of labels array (y):", y_balanced_final.shape)

print("\n--- Final Class Distribution (Balanced) ---")
print(pd.Series(y_balanced_final).value_counts())


Balanced and ready for modeling!
New shape of feature matrix (X): (12684, 10581)
New shape of labels array (y): (12684,)

--- Final Class Distribution (Balanced) ---
Label
false          2114
half-true      2114
mostly-true    2114
true           2114
barely-true    2114
pants-fire     2114
Name: count, dtype: int64


**Step 6: Word2Vec (Feature Engineering for Neural Models) 🧠**
Instead of simply training a model like Naive Bayes on the sparse vectors from TF-IDF, we use Word2Vec to create dense vector embeddings. This helps capture the context in which words appear, as words with similar meanings will have similar vector representations.

`For a Word2Vec implementation`, we will follow these steps:

`Tokenize for Word2Vec`: Split the cleaned text (from df['Statement_Lemma']) into a list of individual sentences, where each sentence is a list of words. This is the required input format.

`Train the Model`: Use the gensim library to train the Word2Vec model on these tokenized sentences.

`Generate Document Vectors`: Since Naive Bayes (our final model) expects one feature vector per document (statement), we need a way to combine the individual word vectors for each statement. The simplest method is to average the vectors of all words in a statement.

In [6]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab') # Uncomment if running for the first time

# We use the cleaned text for the best results
statements = df['Statement_Lemma'].tolist()

# Tokenize each statement into a list of words
tokenized_sentences = [word_tokenize(statement) for statement in statements]

print(f"Total statements tokenized: {len(tokenized_sentences)}")
print(f"Example statement tokens: {tokenized_sentences[0]}") 
# Expected output similar to: ['say', 'annie', 'list', 'political', 'group', 'support', 'third-trimester', 'abortion', 'demand']

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\kanaa/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Total statements tokenized: 10240
Example statement tokens: ['say', 'annies', 'list', 'political', 'group', 'support', 'third', 'trimester', 'abortion', 'demand']


In [12]:
from gensim.models import Word2Vec
import numpy as np

# 1. Train the Word2Vec model
# We'll use common parameters: 
# vector_size (embedding dimensions), window (context size), min_count (ignore rare words)
model = Word2Vec(
    sentences=tokenized_sentences, 
    vector_size=100,  # 100 dimensions for our vectors
    window=5,         # Consider 5 words before and after
    min_count=1,      # Include all words after preprocessing
    workers=4         # Use 4 processor cores
)

# 2. Function to average all word vectors in a document (statement)
def document_vector(word2vec_model, doc_tokens):
    # Filter for words present in the model's vocabulary
    vectors = [word2vec_model.wv[word] for word in doc_tokens if word in word2vec_model.wv]
    
    if not vectors:
        # Return a zero vector if no words from the document are in the model's vocab
        return np.zeros(word2vec_model.vector_size)
    
    # Average the vectors
    return np.mean(vectors, axis=0)

# 3. Generate a feature vector for every single statement
X_word2vec = np.array([document_vector(model, tokens) for tokens in tokenized_sentences])
y_w2v = df['Label'] # Our original labels

print("\nWord2Vec model trained and document vectors created.")
print(f"New feature matrix (X) shape: {X_word2vec.shape}")

ImportError: cannot import name 'triu' from 'scipy.linalg.special_matrices' (c:\Users\kanaa\OneDrive\Desktop\sem_5\DHV\project2\Fake_News_Detection_System\.venv\Lib\site-packages\scipy\linalg\special_matrices.py)