**<span style="color: purple; font-size: 30px;">Fake or Real News Detection</span>**


## Methodology

I applied two common text vectorization techniques for news classification:

1. **Bag-of-Words (Unigram / “Bag 1”)**  
   - Converted each news article into a vector of word counts (after cleaning, lowercasing, and removing stopwords).  
   - This captures the frequency of words but does not consider their relative importance across documents.

2. **TF-IDF (Term Frequency – Inverse Document Frequency)**  
   - Similar to Bag-of-Words, but assigns higher weight to words that are frequent in a document but rare across the whole dataset.  
   - This helps to reduce the influence of very common words (e.g., *said*, *news*) that don’t carry much meaning for classification.

Both representations were used with the **Multinomial Naive Bayes (NB)** classifier.  
The dataset was split into 80% for training and 20% for testing.  

---

**<h2 style ="font size:20px;">Approach: 1</h2>**

**combined title + text into a single content column**

**Load and Quick Inspection**

In [1]:
#set working directory
import os
os.chdir('I:\Data Science and Machine Learning\Python\Python with Jupiter Notebook\Class 42 Naïve Bayse Classifications')

In [2]:
import pandas as pd
df = pd.read_csv('Assignment_Data_fake_or_real_news - Assignment_Data_fake_or_real_news.csv')
print(df.columns)
print(df.shape)
print(df['label'].value_counts())

Index(['id', 'title', 'text', 'label'], dtype='object')
(6335, 4)
label
REAL    3171
FAKE    3164
Name: count, dtype: int64


**Create a single text field, handle missing values, drop unused columns**

In [7]:
#combine title + text
df['title'] = df['title'].fillna('')
df['text'] = df['text'].fillna('')
df['content'] = (df['title']+ ' ' + df['text']).str.strip()

In [8]:
#drop rows with empty content and id, title, text
df = df[df['content'].str.strip() != '']
df = df.drop(columns = ['id', 'title', 'text'], errors ='ignor')

In [9]:
print(df.columns)
df

Index(['label', 'content'], dtype='object')


Unnamed: 0,label,content
0,FAKE,You Can Smell Hillary’s Fear Daniel Greenfield...
1,FAKE,Watch The Exact Moment Paul Ryan Committed Pol...
2,REAL,Kerry to go to Paris in gesture of sympathy U....
3,FAKE,Bernie supporters on Twitter erupt in anger ag...
4,REAL,The Battle of New York: Why This Primary Matte...
...,...,...
6330,REAL,State Department says it can't find emails fro...
6331,FAKE,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332,FAKE,Anti-Trump Protesters Are Tools of the Oligarc...
6333,REAL,"In Ethiopia, Obama seeks progress on peace, se..."


**Basic Cleaning/ Normalization**

In [10]:
import re
def clean_text(s):
    s = s.lower()
    s = re.sub(r'http\S+|www\.\S+', ' ', s)   # remove urls
    s = re.sub(r'<[^>]+>', ' ', s)            # html tags
    s = re.sub(r'[^a-z0-9\s]', ' ', s)        # keep alphanumeric
    s = re.sub(r'\s+', ' ', s).strip()
    return s

df['clean'] = df['content'].apply(clean_text)
df['y'] = df['label']

In [11]:
from sklearn.model_selection import train_test_split
x = df['clean']
y = df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =0.2, stratify = y, random_state = 42)
x_train, x_test, y_train, y_test

(471     america is already strong obama continues demo...
 4825    podesta relative earned six figure fees lobbyi...
 6166    bitcoin soars as china launches crackdown on w...
 4886    nanny in jail after force feeding baby to deat...
 2646    lavrov schools european diplomats in logic usi...
                               ...                        
 90      exclusive gop campaigns plot revolt against rn...
 917     who s winning indiana it s anybody s guess the...
 1101    three ways to recharge your energy using cryst...
 1845    principal institutes ban after students wear c...
 698     michael moore joe blow will vote trump as ulti...
 Name: clean, Length: 5068, dtype: object,
 1707    israel votes netanyahu s last ditch vow to his...
 1926    the source of our rage the ruling elite is pro...
 2674    should birthright citizenship be abolished roo...
 2597    study running linked to extended lifespan and ...
 4602    10 things trump could but probably won t chang...
             

In [12]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(5068,)
(1267,)
(5068,)
(1267,)


**Vectorization — Bag-of-Words (CountVectorizer, unigram)**

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english', min_df = 5, max_df = 0.95, ngram_range = (1,1))
x_train_cv = cv.fit_transform(x_train)
x_test_cv  = cv.transform(x_test)

- Stop words are frequent words like "the", "is", "in", "on", "and", "of", "to". -
- min_df removes very rare tokens (reduces noise). -
- max_df removes tokens that appear in too many docs (non-informative). -
- ngram_range=(1,1) = unigram (bag 1). You can later try (1,2) to add bigrams. -

**Vectorization — TF-IDF** 

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(stop_words = 'english', min_df = 5, max_df = 0.95, ngram_range = (1,1))
x_train_tf = tf.fit_transform(x_train)
x_test_tf = tf.transform(x_test)

**What is MultinomialNB?**

- It’s the Multinomial Naive Bayes classifier. -
- Commonly used for text classification (spam detection, fake news, sentiment, etc.). -
- It assumes features are counts or frequencies (like word counts from Bag-of-Words or TF-IDF).-
- The model estimates the probability of a document belonging to each class based on word frequencies. -

**What does alpha=1.0 mean?**

- alpha = 1.0 → Laplace smoothing (adds 1 to all counts).This prevents zero probabilities and makes the model more robust. - 
- alpha = 0 → no smoothing (risky, not recommended). -
- alpha < 1 → lighter smoothing.-
- alpha > 1 → heavier smoothing (flattens probability distribution more).-

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#BoW (unigram) is the classic baseline (what we called “bag 1”).
#Bag-of-words baseline
nb_cv = MultinomialNB(alpha =1.0)
nb_cv.fit(x_train_cv, y_train)
y_pred_cv = nb_cv.predict(x_test_cv)
print("BoW Accuracy:", accuracy_score(y_test, y_pred_cv))
print(classification_report(y_test, y_pred_cv))
print(confusion_matrix(y_test, y_pred_cv))

BoW Accuracy: 0.8784530386740331
              precision    recall  f1-score   support

        FAKE       0.88      0.88      0.88       633
        REAL       0.88      0.88      0.88       634

    accuracy                           0.88      1267
   macro avg       0.88      0.88      0.88      1267
weighted avg       0.88      0.88      0.88      1267

[[558  75]
 [ 79 555]]


In [16]:
#TF-IDF baseline
nb_tf = MultinomialNB(alpha =1.0)
nb_tf.fit(x_train_tf, y_train)
y_pred_tf = nb_tf.predict(x_test_tf)
print("TF-IDF Accuracy:", accuracy_score(y_test, y_pred_tf))
print(classification_report(y_test, y_pred_tf))
print(confusion_matrix(y_test, y_pred_tf))

TF-IDF Accuracy: 0.877663772691397
              precision    recall  f1-score   support

        FAKE       0.92      0.83      0.87       633
        REAL       0.84      0.93      0.88       634

    accuracy                           0.88      1267
   macro avg       0.88      0.88      0.88      1267
weighted avg       0.88      0.88      0.88      1267

[[523 110]
 [ 45 589]]



## Results

- **Bag-of-Words + NB:** Accuracy ≈ **87.8%**  
- **TF-IDF + NB:** Accuracy ≈ **87.7%**

Both approaches performed almost equally well, with Bag-of-Words giving a very slight edge in accuracy.  
TF-IDF, however, provided a better balance between **precision and recall** across the two classes (FAKE vs REAL).

---

## Conclusion

Using **Bag-of-Words** and **TF-IDF**, we compared classification performance.  
Accuracy was similar in both methods, showing that for this dataset, either representation works effectively with Naive Bayes.  

- Bag-of-Words gave slightly higher **accuracy**.  
- TF-IDF offered more **balanced performance** across classes.
  
---

**<h2 style ="font size:20px;">Approach: 2</h2>**


**Using only one column (e.g., text)**

In [17]:
#load the data
data = pd.read_csv("Assignment_Data_fake_or_real_news - Assignment_Data_fake_or_real_news.csv")


In [18]:
X = data['text']
Y = data['label']

In [19]:
#Handle missing values (replace NaN with empty string)
X = X.fillna("") 


In [20]:
#split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.2, random_state = 28)

In [21]:
#Bag-of-Words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english', min_df = 5, max_df = 0.95, ngram_range = (1, 2)) #unigram + bigrams

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

In [22]:
# Shapes of matrices (just to check)
print("Train shape:", X_train_cv.shape)
print("Test shape:", X_test_cv.shape)

Train shape: (5068, 57373)
Test shape: (1267, 57373)


In [23]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#Bag-of-words baseline
nb_cv = MultinomialNB(alpha =1.0)
nb_cv.fit(X_train_cv, Y_train)
Y_pred_cv = nb_cv.predict(X_test_cv)
print("BoW Accuracy:", accuracy_score(Y_test, Y_pred_cv))
print(classification_report(Y_test, Y_pred_cv))
print(confusion_matrix(Y_test, Y_pred_cv))

BoW Accuracy: 0.9234411996842936
              precision    recall  f1-score   support

        FAKE       0.93      0.92      0.92       627
        REAL       0.92      0.93      0.92       640

    accuracy                           0.92      1267
   macro avg       0.92      0.92      0.92      1267
weighted avg       0.92      0.92      0.92      1267

[[575  52]
 [ 45 595]]


**Vectorization : TF-IDF**

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(stop_words = 'english', min_df = 5, max_df = 0.95, ngram_range = (1,2))
X_train_tf = tf.fit_transform(X_train)
X_test_tf = tf.transform(X_test)

In [25]:
#TF-IDF baseline
nb_tf = MultinomialNB(alpha =1.0)
nb_tf.fit(X_train_tf, Y_train)
Y_pred_tf = nb_tf.predict(X_test_tf)
print("TF-IDF Accuracy:", accuracy_score(Y_test, Y_pred_tf))
print(classification_report(Y_test, Y_pred_tf))
print(confusion_matrix(Y_test, Y_pred_tf))

TF-IDF Accuracy: 0.9068666140489345
              precision    recall  f1-score   support

        FAKE       0.96      0.85      0.90       627
        REAL       0.87      0.96      0.91       640

    accuracy                           0.91      1267
   macro avg       0.91      0.91      0.91      1267
weighted avg       0.91      0.91      0.91      1267

[[533  94]
 [ 24 616]]


**<h2 style ="font size:20px;">Results and Findings</h2>**
**Why Accuracy Improved with random_state=28 and ngram_range=(1,2)**
**1. Changing random_state**

- The train/test split depends on the random seed. -
- Using random_state=28 instead of 42 created a different split of the data.-
- Sometimes, the new split makes the test set easier (less ambiguous, better balanced), which leads to higher accuracy.-
- ⚠️ This doesn’t mean the model itself is universally better — just that this split worked better.-
- ✅ To get a more reliable estimate, researchers often use **cross-validation** instead of relying on one split.-


**2. Changing ngram_range=(1,2) instead of (1,1)**

- (1,1) → uses only unigrams (single words).-
- (1,2) → uses unigrams + bigrams (word pairs).-
- Bigrams capture context and phrases that single words can miss:**"fake news", "breaking story", "white house"**.-
- These phrases often carry stronger signals in tasks like fake/real news detection.-
- ✅ Adding bigrams usually improves accuracy because the model learns **patterns, not just words.**-



**<h2 style ="font size:20px;">Conclusion</h2>**

- BoW with bigrams: ~92.3% accuracy -
- TF-IDF with bigrams: ~90.6% accuracy -
- Both are better than unigram-only models. -


---