<div style="color:red; font-size:16px; font-weight:bold;">
⚠️ IMPORTANT INSTRUCTIONS:<br><br>
1. To run this code, you must enable **GPU**.<br>
   Go to → <b>Settings → Accelerator → GPU (T4 x2)</b> before starting the notebook.<br><br>
2. Make sure you upload the dataset:<br>
   Go to → <b>Input → Add Input → Search for "fake-real"</b> and attach it to this notebook.
</div>


# NLP Pipeline for Fake Text Detection (82% Accuracy)

## Steps:
1. **Sentiment Result**  
2. **Keywords Matching and Counting**  
3. **Character Length**  
4. **Prepare the Train File as Result of Feature Engineering**  
5. **Comparison of Various ML Models**  
6. **Picking the Best Model with High Accuracy**  
7. **Predicting the Real or Fake Text in Test Folder**  

---

## Project Overview
This pipeline demonstrates how **NLP-based feature engineering + ML classification** can effectively detect fake vs. real news by combining:  
- **Semantic signals** (sentiment/emotion)  
- **Lexical signals** (keywords)  
- **Structural signals** (text length)  

These features enable interpretable machine learning models to separate real news from fake news with competitive accuracy.  


# Step 1-4: Feature Engineering (Train File)

In this step, we transform the raw text files (`file_1.txt` and `file_2.txt`) into structured numerical features.  
The following 3 types of features are extracted for each text:

## 1. Sentiment Result  
We apply a pretrained **emotion classification model**:  
`j-hartmann/emotion-english-distilroberta-base`  

- Each text (`file_1.txt`, `file_2.txt`) is passed through this model.  
- The model outputs a **probability distribution across emotions** (joy, fear, anger, sadness, surprise, neutral).  
- Probabilities are **averaged** to get robust features such as:  
  - `T1_Joy`, `T1_Fear`, `T1_Sadness`, etc.  
  - `T2_Joy`, `T2_Fear`, `T2_Sadness`, etc.  

---

## 2. Keywords Matching and Counting  
A curated **keyword list** (e.g., *china, dinosaur, secrets, sentiment*) is used.  
- The number of matches for each text is counted.  
- Features:  
  - `T1_KCount`, `T2_KCount`  

This provides **lexical cues**, as fake and real news may differ in word choice.  

---

## 3. Character Length  
We compute the **character length** of each text.  
- Features:  
  - `T1_Length`, `T2_Length`  

This captures **structural differences** (fake articles may be shorter/longer or padded with irrelevant content).  

---

## 4. Preparing the Train File  
All engineered features are combined into a single structured dataset:  
- Each row = one article pair.  
- Columns = emotion scores, keyword counts, text lengths.  
- Label = `real_text_id` (indicating which text is the real one).  

The dataset is saved as:  
**`train_features_combined.csv`**  

In [1]:
# =====================================
# Combined Feature Extraction Script
# =====================================

import os
import re
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# ======================
# 1. Setup
# ======================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
torch.set_grad_enabled(False)

# Paths
train_csv_path = "/kaggle/input/fake-or-real-the-impostor-hunt/data/train.csv"
train_folder = "/kaggle/input/fake-or-real-the-impostor-hunt/data/train"

# Load train.csv with labels
df_train_csv = pd.read_csv(train_csv_path)  # must contain ["id","real_text_id"]
train_id_map = dict(zip(df_train_csv["id"], df_train_csv["real_text_id"]))

# ======================
# 2. Setup Sentiment Model
# ======================
model_name = "j-hartmann/emotion-english-distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device).eval()

# Label mapping
id2label = {int(k): v for k, v in model.config.id2label.items()}
labels = [id2label[i] for i in sorted(id2label.keys())]

def emotion_scores(text: str, max_length=512, stride=64, batch_size=16) -> dict:
    """Return average emotion distribution for text"""
    if not text or text.strip() == "":
        return {label: 0.0 for label in labels}

    enc = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        stride=stride,
        return_overflowing_tokens=True,
        padding="max_length"
    )

    input_ids = enc["input_ids"]
    attention_mask = enc["attention_mask"]

    probs_list = []
    for start in range(0, input_ids.size(0), batch_size):
        end = start + batch_size
        batch_ids = input_ids[start:end].to(device)
        batch_mask = attention_mask[start:end].to(device)

        with torch.no_grad():
            outputs = model(input_ids=batch_ids, attention_mask=batch_mask)
            probs = torch.softmax(outputs.logits, dim=-1).cpu().numpy()

        probs_list.append(probs)

    probs_all = np.concatenate(probs_list, axis=0)
    avg_probs = np.mean(probs_all, axis=0)

    return {labels[i]: float(avg_probs[i]) for i in range(len(labels))}

# ======================
# 3. Keyword setup
# ======================
keywords = [
    "china", "dinosaur", "chinese", "dinosaurs", "agency", "true", "ton", "secrets",
    "think", "product", "removing", "fine", "sentiment", "unsettling", "empty",
    "null", "emergence", "ordin", "inspector", "landscape", "recommend", "mult"
]
pattern_combined = re.compile(r'\b(' + '|'.join(keywords) + r')\b', re.IGNORECASE)

def keyword_and_length(text: str):
    """Return keyword count and character length"""
    return len(pattern_combined.findall(text)), len(text)

# ======================
# 4. Process Training Data
# ======================
rows = []

folders = [d for d in sorted(os.listdir(train_folder)) if os.path.isdir(os.path.join(train_folder, d))]
for folder in folders:
    fpath = os.path.join(train_folder, folder)
    f1 = os.path.join(fpath, "file_1.txt")
    f2 = os.path.join(fpath, "file_2.txt")
    if not (os.path.exists(f1) and os.path.exists(f2)):
        continue

    try:
        with open(f1, "r", encoding="utf-8", errors="ignore") as fh:
            t1 = fh.read()
        with open(f2, "r", encoding="utf-8", errors="ignore") as fh:
            t2 = fh.read()
    except Exception:
        continue

    # Sentiment/emotion scores
    t1_scores = emotion_scores(t1)
    t2_scores = emotion_scores(t2)

    # Keyword count + length
    t1_kcount, t1_len = keyword_and_length(t1)
    t2_kcount, t2_len = keyword_and_length(t2)

    # Article id
    try:
        art_id = int(folder.split("_")[1])
    except Exception:
        art_id = folder

    row = {"id": art_id, "real_text_id": train_id_map.get(art_id, 0)}

    # Add T1 features
    for label in labels:
        row[f"T1_{label}"] = round(t1_scores[label], 4)
    row["T1_KCount"] = t1_kcount
    row["T1_Length"] = t1_len

    # Add T2 features
    for label in labels:
        row[f"T2_{label}"] = round(t2_scores[label], 4)
    row["T2_KCount"] = t2_kcount
    row["T2_Length"] = t2_len

    rows.append(row)

# ======================
# 5. Final DataFrame
# ======================
df_results = pd.DataFrame(rows).sort_values("id").reset_index(drop=True)

# Move real_text_id column to the end
cols = [c for c in df_results.columns if c != "real_text_id"] + ["real_text_id"]
df_results = df_results[cols]

pd.set_option("display.float_format", lambda x: f"{x:.4f}")
print(df_results.head())

# Save
df_results.to_csv("train_features_combined.csv", index=False)
print("\nSaved combined features to train_features_combined.csv")
print(f"Total train articles processed: {len(df_results)}")

Using device: cuda


tokenizer_config.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

2025-08-21 09:48:10.337256: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755769690.513961      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755769690.566323      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

   id  T1_anger  T1_disgust  T1_fear  T1_joy  T1_neutral  T1_sadness  \
0   0    0.0062      0.0030   0.0013  0.0180      0.9301      0.0029   
1   1    0.0175      0.0099   0.0031  0.0072      0.9460      0.0024   
2   2    0.0146      0.0206   0.0077  0.0171      0.9025      0.0047   
3   3    0.0185      0.0090   0.0173  0.0154      0.8974      0.0054   
4   4    0.0149      0.0039   0.0025  0.0106      0.0171      0.0075   

   T1_surprise  T1_KCount  T1_Length  T2_anger  T2_disgust  T2_fear  T2_joy  \
0       0.0385          0       2196    0.0228      0.0031   0.0049  0.1834   
1       0.0140          7       3124    0.0096      0.0114   0.0040  0.0114   
2       0.0328          0       1139    0.0220      0.0165   0.0274  0.0103   
3       0.0370         10       1774    0.0083      0.0084   0.0056  0.0122   
4       0.9435          3        195    0.0133      0.0146   0.0073  0.0024   

   T2_neutral  T2_sadness  T2_surprise  T2_KCount  T2_Length  real_text_id  
0      0.6317  

# Step 1-4: Feature Engineering (Test File)
We do the same procedure for test folder text files

In [2]:
# =====================================
# Combined Feature Extraction Script (Test)
# =====================================

import os
import re
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# ======================
# 1. Setup
# ======================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
torch.set_grad_enabled(False)

# Paths
test_folder = "/kaggle/input/fake-or-real-the-impostor-hunt/data/test"

# ======================
# 2. Setup Sentiment Model
# ======================
model_name = "j-hartmann/emotion-english-distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device).eval()

# Label mapping
id2label = {int(k): v for k, v in model.config.id2label.items()}
labels = [id2label[i] for i in sorted(id2label.keys())]

def emotion_scores(text: str, max_length=512, stride=64, batch_size=16) -> dict:
    """Return average emotion distribution for text"""
    if not text or text.strip() == "":
        return {label: 0.0 for label in labels}

    enc = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        stride=stride,
        return_overflowing_tokens=True,
        padding="max_length"
    )

    input_ids = enc["input_ids"]
    attention_mask = enc["attention_mask"]

    probs_list = []
    for start in range(0, input_ids.size(0), batch_size):
        end = start + batch_size
        batch_ids = input_ids[start:end].to(device)
        batch_mask = attention_mask[start:end].to(device)

        with torch.no_grad():
            outputs = model(input_ids=batch_ids, attention_mask=batch_mask)
            probs = torch.softmax(outputs.logits, dim=-1).cpu().numpy()

        probs_list.append(probs)

    probs_all = np.concatenate(probs_list, axis=0)
    avg_probs = np.mean(probs_all, axis=0)

    return {labels[i]: float(avg_probs[i]) for i in range(len(labels))}

# ======================
# 3. Keyword setup
# ======================
keywords = [
    "china", "dinosaur", "chinese", "dinosaurs", "agency", "true", "ton", "secrets",
    "think", "product", "removing", "fine", "sentiment", "unsettling", "empty",
    "null", "emergence", "ordin", "inspector", "landscape", "recommend", "mult"
]
pattern_combined = re.compile(r'\b(' + '|'.join(keywords) + r')\b', re.IGNORECASE)

def keyword_and_length(text: str):
    """Return keyword count and character length"""
    return len(pattern_combined.findall(text)), len(text)

# ======================
# 4. Process Test Data
# ======================
rows = []

folders = [d for d in sorted(os.listdir(test_folder)) if os.path.isdir(os.path.join(test_folder, d))]
for folder in folders:
    fpath = os.path.join(test_folder, folder)
    f1 = os.path.join(fpath, "file_1.txt")
    f2 = os.path.join(fpath, "file_2.txt")
    if not (os.path.exists(f1) and os.path.exists(f2)):
        continue

    try:
        with open(f1, "r", encoding="utf-8", errors="ignore") as fh:
            t1 = fh.read()
        with open(f2, "r", encoding="utf-8", errors="ignore") as fh:
            t2 = fh.read()
    except Exception:
        continue

    # Sentiment/emotion scores
    t1_scores = emotion_scores(t1)
    t2_scores = emotion_scores(t2)

    # Keyword count + length
    t1_kcount, t1_len = keyword_and_length(t1)
    t2_kcount, t2_len = keyword_and_length(t2)

    # Article id
    try:
        art_id = int(folder.split("_")[1])
    except Exception:
        art_id = folder

    row = {"id": art_id}

    # Add T1 features
    for label in labels:
        row[f"T1_{label}"] = round(t1_scores[label], 4)
    row["T1_KCount"] = t1_kcount
    row["T1_Length"] = t1_len

    # Add T2 features
    for label in labels:
        row[f"T2_{label}"] = round(t2_scores[label], 4)
    row["T2_KCount"] = t2_kcount
    row["T2_Length"] = t2_len

    rows.append(row)

# ======================
# 5. Final DataFrame
# ======================
df_results = pd.DataFrame(rows).sort_values("id").reset_index(drop=True)

pd.set_option("display.float_format", lambda x: f"{x:.4f}")
print(df_results.head())

# Save
df_results.to_csv("test_features_combined.csv", index=False)
print("\nSaved combined features to test_features_combined.csv")
print(f"Total test articles processed: {len(df_results)}")


Using device: cuda
   id  T1_anger  T1_disgust  T1_fear  T1_joy  T1_neutral  T1_sadness  \
0   0    0.0136      0.0029   0.0034  0.1750      0.5721      0.0044   
1   1    0.0531      0.0994   0.0331  0.0225      0.7456      0.0089   
2   2    0.0148      0.0302   0.0099  0.0080      0.8783      0.0051   
3   3    0.0668      0.0130   0.0321  0.0090      0.8370      0.0043   
4   4    0.0135      0.0092   0.0021  0.0084      0.9349      0.0018   

   T1_surprise  T1_KCount  T1_Length  T2_anger  T2_disgust  T2_fear  T2_joy  \
0       0.2285          1       1710    0.0235      0.0082   0.0100  0.3108   
1       0.0375          0       1168    0.0168      0.0154   0.0100  0.0069   
2       0.0538          0        752    0.0140      0.0412   0.0042  0.0030   
3       0.0380          0       1223    0.0101      0.0042   0.0052  0.0337   
4       0.0302          8       1271    0.0070      0.0054   0.0019  0.0136   

   T2_neutral  T2_sadness  T2_surprise  T2_KCount  T2_Length  
0      0.5

# Step 5: Comparison of Various ML Models

We train and evaluate multiple ML models:  

- Logistic Regression  
- Random Forest  
- Gradient Boosting  
- Support Vector Machines (SVM)  
- K-Nearest Neighbors (KNN)  

Evaluation uses **cross-validation accuracy** and **validation accuracy**.  

In [3]:
# =====================================
# Train & Evaluate Multiple Models
# =====================================

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings("ignore")

# ======================
# Load Train Features
# ======================
df = pd.read_csv("/kaggle/working/train_features_combined.csv")

# Features (all T1_ and T2_ columns) and target
features = [col for col in df.columns if col.startswith("T1_") or col.startswith("T2_")]
X = df[features].values
y = df["real_text_id"].values

# ======================
# Train-validation split (stratified)
# ======================
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ======================
# Feature scaling
# ======================
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# ======================
# Define models
# ======================
models = {
    "LogisticRegression": LogisticRegression(max_iter=500, random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=200, random_state=42),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=200, random_state=42),
    "SVM": SVC(kernel="rbf", probability=True, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

# ======================
# Cross-validation and evaluation
# ======================
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = []

for name, model in models.items():
    # Cross-validation accuracy
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring="accuracy")
    
    # Train on training set
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)
    val_acc = accuracy_score(y_val, y_pred)
    
    results.append({
        "Model": name,
        "CV_Accuracy_Mean": round(cv_scores.mean(), 4),
        "CV_Accuracy_STD": round(cv_scores.std(), 4),
        "Validation_Accuracy": round(val_acc, 4)
    })

# ======================
# Display results
# ======================
df_results = pd.DataFrame(results).sort_values("Validation_Accuracy", ascending=False)
print(df_results.to_string(index=False))

             Model  CV_Accuracy_Mean  CV_Accuracy_STD  Validation_Accuracy
LogisticRegression            0.8425           0.0517               1.0000
      RandomForest            0.8950           0.0526               0.8947
               SVM            0.7625           0.0696               0.8947
  GradientBoosting            0.8158           0.0650               0.7895
               KNN            0.8033           0.0921               0.7368


### 📌 Interpretation:
- **RandomForest** achieved the **highest cross-validation accuracy (89.5%)**, indicating it captures patterns in the training data very effectively.  
- **Logistic Regression** reached a **perfect validation accuracy (100%)** on the held-out set, but cross-validation was slightly lower (84.25%), which may indicate it generalizes well but learns less complex patterns.  
- Models like **KNN** and **GradientBoosting** underperformed and are less reliable here.  

### ✅ Best Model Choice:
- Considering both CV performance and learning capacity, we select **RandomForest** as the final model.  
- Although Logistic Regression generalizes well, RandomForest’s higher CV accuracy suggests it can capture subtle differences in sentiment, keywords, and text length more effectively, which may help in competition submissions.  

👉 Therefore, we proceed with **RandomForest** for training and predicting the real vs fake texts.


# Step 6: Training ML Model based on RandomForest

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import joblib
from scipy.stats import randint

# ======================
# Load dataset
# ======================
file_path = "/kaggle/working/train_features_combined.csv"
df = pd.read_csv(file_path)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# ======================
# Features and target
# ======================
feature_cols = [col for col in df.columns if col.startswith("T1_") or col.startswith("T2_")]
X = df[feature_cols]
y = df["real_text_id"]

# ======================
# Train-validation split
# ======================
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ======================
# Pipeline: imputer + scaler + Random Forest
# ======================
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(random_state=42, n_jobs=-1))
])

# ======================
# Hyperparameter tuning using Randomized Search
# ======================
param_distributions = {
    'clf__n_estimators': randint(100, 500),
    'clf__max_depth': randint(10, 50),
    'clf__min_samples_split': randint(2, 11),
    'clf__min_samples_leaf': randint(1, 5)
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    n_iter=50,  # Number of random combinations to try
    cv=5,       # 5-fold cross-validation
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

# Perform the search
print("Starting hyperparameter tuning...")
random_search.fit(X_train, y_train)

# ======================
# Get the best model
# ======================
best_pipeline = random_search.best_estimator_

# ======================
# Validation accuracy of the best model
# ======================
val_acc = best_pipeline.score(X_val, y_val)
print("\nHyperparameter tuning complete.")
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.4f}")
print(f"Validation Accuracy with Best Model: {val_acc:.4f}")

# ======================
# Save the best model
# ======================
model_filename = "random_forest_emotion_model_tuned.joblib"
joblib.dump(best_pipeline, model_filename)
print(f"\nModel saved as {model_filename}")

Starting hyperparameter tuning...
Fitting 5 folds for each of 50 candidates, totalling 250 fits

Hyperparameter tuning complete.
Best parameters: {'clf__max_depth': 48, 'clf__min_samples_leaf': 4, 'clf__min_samples_split': 9, 'clf__n_estimators': 288}
Best cross-validation score: 0.8683
Validation Accuracy with Best Model: 0.8947

Model saved as random_forest_emotion_model_tuned.joblib


# Step 7: Prediction on Test Data  
The same **feature engineering pipeline** is applied to the **test set** (`/data/test`).  

The selected model predicts which text (`file_1.txt` or `file_2.txt`) is the **real article**.  
The results are then formatted into a Kaggle submission file. 

In [5]:
import pandas as pd
import joblib

# ======================
# Load trained Random Forest model
# ======================
pipeline = joblib.load("random_forest_emotion_model_tuned.joblib")

# ======================
# Load test dataset
# ======================
test_file = "/kaggle/working/test_features_combined.csv"  # Adjusted path
df_test = pd.read_csv(test_file)

# Drop any unnamed columns (if exist)
df_test = df_test.loc[:, ~df_test.columns.str.contains('^Unnamed')]

# ======================
# Define feature columns (same as training)
# ======================
feature_cols = [col for col in df_test.columns if col.startswith("T1_") or col.startswith("T2_")]

# Extract test features
X_test = df_test[feature_cols]

# Keep IDs for output
test_ids = df_test["id"]

# ======================
# Predict real_text_id (1 or 2)
# ======================
predictions = pipeline.predict(X_test)

# ======================
# Prepare output DataFrame
# ======================
df_output = pd.DataFrame({
    "id": test_ids,
    "predicted_real_text_id": predictions
})

print(df_output.head())

# ======================
# Save predictions
# ======================
df_output.to_csv("Submission.csv", index=False)
print("Predictions saved to Submission.csv")


   id  predicted_real_text_id
0   0                       2
1   1                       2
2   2                       1
3   3                       1
4   4                       2
Predictions saved to Submission.csv


# Done : 

1) Relaad the output > kaggle/working
2) Download the Submission.csv file
3) Publish your prediction result in competition