⚠️ The files `Fake.csv` and `True.csv` are required to run this notebook.  
Please download them from Kaggle:  
https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

In [8]:
import pandas as pd

# Load datasets
fake_df = pd.read_csv("Fake.csv")
true_df = pd.read_csv("True.csv")

# Label the data
fake_df['label'] = 0  # Fake
true_df['label'] = 1  # Real

# Combine and shuffle
df = pd.concat([fake_df, true_df], ignore_index=True)
df = df[['text', 'label']].sample(frac=1).reset_index(drop=True)

df.head()

Unnamed: 0,text,label
0,SYDNEY (Reuters) - Australian Prime Minister M...,1
1,BEIJING (Reuters) - China s military is prepar...,1
2,WASHINGTON (Reuters) - The U.S. Senate Foreign...,1
3,Via: TMZOlivia Wilde posed for a black and whi...,0
4,The GOP Presidential Primary hit a new low on ...,0


In [9]:
import nltk
import string
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Simple text cleaner (no stemming for speed)
def clean_text(text):
    text = text.lower()
    text = ''.join(char for char in text if char not in string.punctuation)
    words = text.split()
    words = [w for w in words if w not in stop_words]
    return ' '.join(words)

df['text_cleaned'] = df['text'].apply(clean_text)
df[['text_cleaned', 'label']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bhara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text_cleaned,label
0,sydney reuters australian prime minister malco...,1
1,beijing reuters china military preparing sweep...,1
2,washington reuters us senate foreign relations...,1
3,via tmzolivia wilde posed black white selfie p...,0
4,gop presidential primary hit new low thursday ...,0


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['text_cleaned']).toarray()
y = df['label']

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

In [12]:
from sklearn.metrics import classification_report, accuracy_score, f1_score

y_pred = nb_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9478841870824053
F1 Score: 0.9453908984830806

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.94      0.95      4739
           1       0.94      0.96      0.95      4241

    accuracy                           0.95      8980
   macro avg       0.95      0.95      0.95      8980
weighted avg       0.95      0.95      0.95      8980



In [13]:
import joblib

joblib.dump(model, 'model.pkl')
joblib.dump(tfidf, 'tfidf.pkl')

['tfidf.pkl']

## ✅ Week 1 Progress Summary

- Loaded and labeled Fake and Real news datasets from Kaggle
- Preprocessed text using NLTK (lowercase, punctuation, stopwords)
- Converted text to numerical format using TF-IDF
- Trained a Naive Bayes model on the dataset
- Evaluated model with accuracy and F1 score
- Saved model and TF-IDF vectorizer for future use

✅ Ready to build UI using Streamlit in Week 2.
