## Problem Statement
Develop a machine learning model to classify whether a news article is real or fake based on its content using NLP techniques.
##### **Objectives:**
- Clean and preprocess text data
- Convert it into numerical format using TF-IDF
- Train a Logistic Regression classifier/ Random Forest
- Evaluate using Accuracy, Precision, Recall, F1-score

### Import Libraries

In [None]:
!pip install nltk

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from nltk.stem.porter import PorterStemmer


### Load and Combine Data

In [None]:
# Load datasets
true_df = pd.read_csv("../data/True.csv")
fake_df = pd.read_csv("../data/Fake.csv")

In [None]:
# Add labels
true_df['label'] = 1  # Real
fake_df['label'] = 0  # Fake


In [None]:
# Combine Title and Text
data = pd.concat([true_df, fake_df], axis=0).reset_index(drop=True)
data['content'] = data['title'] + " " + data['text']

data.head()
#Both title and text are informative. Merging them into a single content column gives the model a better understanding of the article’s context.

### Preprocess the Text

In [None]:
stemmer = PorterStemmer()
stopwords_set = {
    'the', 'a', 'is', 'in', 'and', 'to', 'of', 'that', 'it', 'on',
    'was', 'he', 'she', 'for', 'with', 'as', 'by', 'at', 'from'
}

def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    words = text.split()
    filtered = [stemmer.stem(word) for word in words if word not in stopwords_set]
    return ' '.join(filtered)

data['cleaned_content'] = data['content'].apply(clean_text)

#This prepares the text for vectorization:
#it removes noise (punctuation, numbers),converts words to root form (stemming) and removes common, meaningless words (stopwords)

In [None]:
data.head()

### Vectorize Text (TF-IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert cleaned text into sparse matrix (memory efficient)
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['cleaned_content'])  
y = data['label'].values

# TF-IDF captures word importance across the corpus.
#Many fake articles may use overused terms (e.g., "shocking", "breaking", "alert") or poor grammar.
#Real articles have formal structure and named sources.
#So, TF-IDF helps to classify those while ignoring common words so that model learns strong patterns.

### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### Model Training
Using Logistic Regression and Random Forest to test which works best

### Train Logistic Regression

In [None]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

lr_accuracy = accuracy_score(y_test, lr_pred)
lr_report = classification_report(y_test, lr_pred, target_names=["Fake", "Real"], output_dict=True)

print(" Logistic Regression Evaluation")
print(f"Accuracy: {lr_accuracy * 100:.2f}%\n")
print(classification_report(y_test, lr_pred, target_names=["Fake", "Real"]))

# Get confusion matrix
cm = confusion_matrix(y_test, lr_pred)

# Plot using seaborn heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["Fake", "Real"], yticklabels=["Fake", "Real"])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title(' Logistic Regression - Confusion Matrix')
plt.show()



**Insights:**
- Accuracy: 98.88%
- Fake News: 4585 correct, 65 missed
- Real News: 4294 correct, 36 wrongly marked as fake
- Total mistakes: Only 101 out of 8980
- Precision: 99% – Predictions are mostly correct
- Recall: 99% – Very few fake/real articles missed
- F1-Score: 99% – Excellent balance and consistency
##### **Conclusion:**
Logistic Regression Model is highly accurate and reliable, with very few errors in both fake and real news detection.

### Train Random Forest


In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_pred)
rf_report = classification_report(y_test, rf_pred, target_names=["Fake", "Real"], output_dict=True)

print(" Random Forest Evaluation")
print(f"Accuracy: {rf_accuracy * 100:.2f}%\n")
print(classification_report(y_test, rf_pred, target_names=["Fake", "Real"]))


cm = confusion_matrix(y_test, rf_pred)

# Plot the heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', xticklabels=["Fake", "Real"], yticklabels=["Fake", "Real"])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Random Forest - Confusion Matrix')
plt.show()


**Insights:**
- Accuracy: 99.65%
- Fake News: 4635 correct, 15 missed
- Real News: 4314 correct, 16 wrongly marked as fake
- Total mistakes: Only 31 out of 8980
- Precision: 100% – Predictions are almost always correct
- Recall: 100% – Very few fake/real articles missed
- F1-Score: 100% – Perfect balance and consistency
##### **Conclusion:**
The Random Forest model performs exceptionally well, with near-perfect accuracy and no significant errors in classifying fake and real news.

### Final Conclusion
Both models performed exceptionally well:

**Logistic Regression** achieved **98.88%** accuracy, is faster, and easier to interpret. It's ideal for real-time applications and works well with clean, structured text.

**Random Forest** achieved a slightly higher **99.65%** accuracy, with better handling of complex patterns and more robustness to noisy or nonlinear data, though it takes longer to train.

While the performance difference is not very large, we chose to proceed with the Random Forest model due to its higher accuracy and better generalization on more complex or potentially noisy datasets. This makes it a more reliable choice for ensuring consistent performance in varied real-world scenarios.

### Usage Example – Predict New Article

In [None]:
def predict_news(text, model, vectorizer):
    """
    Predict whether a given news article is real or fake.
    Args:
        text (str): The news article (title + content).
        model: Trained classification model.
        vectorizer: Trained TF-IDF vectorizer.
    Returns:
        str: "Real" or "Fake"
    """
    cleaned = clean_text(text)  # Apply same preprocessing
    vectorized = vectorizer.transform([cleaned]).toarray()  # Vectorize
    prediction = model.predict(vectorized)[0]  # Predict
    return "Real" if prediction == 1 else "Fake"


### Try with Example Article

In [None]:
# Example input (you can change this!)
sample_news = """
President announces new economic plan to support small businesses
and reduce interest rates amidst inflation concerns. The plan includes
tax breaks and grants over the next fiscal year.
"""

# Predict with both models

rf_result = predict_news(sample_news, rf_model, vectorizer)

print(f"Random Forest says: {rf_result}")
