# Fake News Detection - Complete KDD Pipeline

Implementasi lengkap metodologi KDD sesuai dokumen penelitian:
1. Data Selection
2. Data Preprocessing (5 tahap)
3. Feature Extraction (TF-IDF)
4. Model Training (Logistic Regression)
5. Pattern Evaluation (Confusion Matrix + Metrics)

In [None]:
# Import required libraries
import sys
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

print("Importing custom modules...")
from preprocessing import TextPreprocessor
from tfidf_calculator import CustomTFIDFCalculator
from model import FakeNewsLogisticRegression
from evaluation import ModelEvaluator

print("All modules imported successfully!")

## 1. Data Selection Phase

In [None]:
# Load dataset
print("=== DATA SELECTION ===")
train_df = pd.read_csv('../data/raw/train.csv', sep=';', on_bad_lines='skip', encoding='utf-8')

print(f"Total records: {len(train_df)}")
print(f"Columns: {list(train_df.columns)}")

# Label distribution
label_dist = train_df['label'].value_counts()
print(f"\nLabel distribution:")
print(f"  REAL news (0): {label_dist.get(0, 0)} ({label_dist.get(0, 0)/len(train_df)*100:.1f}%)")
print(f"  FAKE news (1): {label_dist.get(1, 0)} ({label_dist.get(1, 0)/len(train_df)*100:.1f}%)")

# Use sample untuk demo
SAMPLE_SIZE = 2000
train_sample = train_df.sample(n=SAMPLE_SIZE, random_state=42)
print(f"\nUsing sample of {SAMPLE_SIZE} records for demonstration")
train_sample.head()

## 2. Data Preprocessing Phase

5 tahap preprocessing sesuai dokumen:
1. Case Folding
2. Punctuation Removal
3. Tokenization
4. Stopword Removal  
5. Stemming

In [None]:
# Initialize preprocessor
preprocessor = TextPreprocessor()

# Demo preprocessing dengan satu sample
sample_text = train_sample.iloc[0]['text']
print("=== SAMPLE PREPROCESSING DEMO ===")
print(f"Original: {sample_text[:200]}...")
print()

# Step by step preprocessing
step1 = preprocessor.case_folding(sample_text)
print(f"1. Case Folding: {step1[:200]}...")

step2 = preprocessor.remove_punctuation(step1)
print(f"2. Punctuation Removal: {step2[:200]}...")

step3 = preprocessor.tokenize(step2)
print(f"3. Tokenization: {step3[:20]}...")

step4 = preprocessor.remove_stopwords(step3)
print(f"4. Stopword Removal: {step4[:20]}...")

step5 = preprocessor.stemming(step4)
print(f"5. Stemming: {step5[:20]}...")

In [None]:
# Preprocess entire sample dataset
print("\n=== PREPROCESSING ENTIRE SAMPLE ===")
processed_sample = preprocessor.preprocess_dataset(train_sample, text_column='text')

# Show results
print(f"\nProcessed dataset shape: {processed_sample.shape}")
print(f"Processed dataset columns: {list(processed_sample.columns)}")

# Sample processed data
processed_sample[['text', 'processed_text', 'label']].head()

## 3. Feature Extraction Phase (TF-IDF)

Menggunakan rumus TF-IDF dari dokumen:
- TF(t,d) = 1 + 10*log(f(t,d))
- IDF(t) = 10*log(n/df(t))
- Normalisasi: w(t,d) / sqrt(sum(w(t,d)^2))

In [None]:
# Initialize TF-IDF calculator
tfidf_calc = CustomTFIDFCalculator()

# Prepare documents for TF-IDF
documents = processed_sample['processed_tokens'].tolist()
labels = processed_sample['label'].values

print(f"Number of documents: {len(documents)}")
print(f"Sample document: {documents[0][:10]}...")

# Calculate TF-IDF matrix
print("\n=== CALCULATING TF-IDF MATRIX ===")
tfidf_matrix = tfidf_calc.calculate_tfidf_matrix(documents)

print(f"\nTF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of features: {len(tfidf_calc.get_feature_names())}")
print(f"Vocabulary sample: {tfidf_calc.get_feature_names()[:20]}")

## 4. Model Training Phase (Logistic Regression)

Menggunakan sigmoid function: sigmoid(z) = 1 / (1 + e^(-z))

In [None]:
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    tfidf_matrix, labels, 
    test_size=0.2, 
    random_state=42,
    stratify=labels
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# Initialize and train model
model = FakeNewsLogisticRegression()
training_result = model.train(X_train, y_train, X_test, y_test)

print("\nTraining completed!")
print(f"Training accuracy: {training_result['train_accuracy']:.4f}")
print(f"Test accuracy: {training_result['test_accuracy']:.4f}")

## 5. Pattern Evaluation Phase

Evaluasi menggunakan confusion matrix dan metrik sesuai rumus dokumen:
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN) 
- F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Initialize evaluator
evaluator = ModelEvaluator()

# Generate comprehensive evaluation report
metrics = evaluator.print_evaluation_report(y_test, y_pred, model_name="Logistic Regression")

# Compare with previous research
evaluator.compare_with_research(metrics)

## 6. Demo Prediction

Test model dengan contoh teks baru

In [None]:
# Sample texts untuk demo
demo_texts = [
    "Vaksin COVID-19 mengandung chip 5G yang bisa mengontrol pikiran manusia! Jangan divaksin karena berbahaya!",
    "Pemerintah mengumumkan program vaksinasi COVID-19 gratis untuk seluruh warga negara sesuai protokol WHO.",
    "BREAKING: Ilmuwan menemukan obat ajaib yang bisa menyembuhkan semua penyakit dalam 1 hari!",
    "Menteri Kesehatan melaporkan penurunan kasus COVID-19 sebesar 15% dalam minggu ini berdasarkan data resmi."
]

print("=== DEMO PREDICTIONS ===")

for i, text in enumerate(demo_texts, 1):
    print(f"\n--- Sample {i} ---")
    print(f"Text: {text[:100]}...")
    
    # Preprocess text
    processed_tokens = preprocessor.preprocess_text(text)
    
    # Transform to TF-IDF
    tfidf_vector = tfidf_calc.transform_new_document(processed_tokens)
    
    # Predict
    prediction = model.predict(tfidf_vector)[0]
    probability = model.predict_proba(tfidf_vector)[0]
    
    result = "FAKE" if prediction == 1 else "REAL"
    confidence = probability[prediction] * 100
    
    print(f"Prediction: {result}")
    print(f"Confidence: {confidence:.2f}%")
    print(f"Probabilities: REAL={probability[0]*100:.2f}%, FAKE={probability[1]*100:.2f}%")

## 7. Model Analysis

Analisis koefisien model dan fitur penting

In [None]:
# Get model coefficients
coeffs = model.get_model_coefficients()

print("=== MODEL ANALYSIS ===")
print(f"Intercept (β₀): {coeffs['intercept']:.4f}")
print(f"Number of features: {len(coeffs['coefficients'])}")

# Top positive coefficients (indicating FAKE)
feature_names = tfidf_calc.get_feature_names()
coeff_importance = list(zip(feature_names, coeffs['coefficients']))
coeff_importance.sort(key=lambda x: abs(x[1]), reverse=True)

print("\nTop 10 Most Important Features:")
for i, (feature, coeff) in enumerate(coeff_importance[:10]):
    direction = "FAKE" if coeff > 0 else "REAL"
    print(f"{i+1:2d}. {feature:<15} : {coeff:8.4f} ({direction})")

# Demonstration of manual sigmoid calculation
print("\n=== MANUAL SIGMOID DEMONSTRATION ===")
sample_z_values = [-2.0, -1.0, 0.0, 1.0, 2.0]
for z in sample_z_values:
    sigmoid_val = model.sigmoid(z)
    classification = "FAKE" if sigmoid_val > 0.5 else "REAL"
    print(f"Z = {z:4.1f} → Sigmoid = {sigmoid_val:.4f} → {classification}")

## Summary

Pipeline KDD lengkap telah berhasil diimplementasi sesuai metodologi dalam dokumen penelitian:

1. ✅ **Data Selection**: Dataset berhasil dimuat dan dieksplorasi
2. ✅ **Data Preprocessing**: 5 tahap preprocessing sesuai dokumen
3. ✅ **Feature Extraction**: TF-IDF menggunakan rumus dari dokumen
4. ✅ **Model Training**: Logistic Regression dengan sigmoid function
5. ✅ **Pattern Evaluation**: Confusion matrix dan metrik evaluasi

Model siap untuk digunakan dalam aplikasi fake news detection!