# Fake News Classification with RNN vs LSTM (PyTorch)

This notebook builds two **recurrent neural network** baselines to classify news articles as **Fake** or **Real**:
- a **vanilla RNN**
- an **LSTM**

The focus is on the full workflow (text cleaning → vectorization → model training → evaluation) and on comparing how
a plain RNN behaves versus an LSTM on the same data.

## Data preparation

Before training any deep learning model, we need the data in a clean, consistent format:
- remove noise and normalize text
- convert text into numeric representations (tensors)
- build train/validation/test splits

## Dataset

We use a dataset of news articles with a binary label:
- **Real**
- **Fake**

Each row includes fields such as `title`, `text`, `subject`, `date`, and `authenticity`.

Goal: train sequence models that learn patterns in the text and predict whether the article is fake or real.

#### Ahora sabemos un poco más como es nuestro dataset, creo que el primero que podemos hacer es eliminar puntuación, Upper/Lower case. Al hacer esto podemos estar perdiendo información semántica del texto, pero muchas veces cuando tenemos poder de computo limitado tenemos hacer este trade-off entre acurraccy y tiempo

## Text preprocessing (NLTK)

We apply standard NLP preprocessing to reduce noise and improve signal:
- tokenization
- stopword removal
- lemmatization

These steps help the models focus on meaningful words rather than punctuation and filler terms.

In [1]:

import pandas as pd
import numpy as np

import torch
from torch import nn

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load your dataset
df = pd.read_csv('news.csv') 

In [None]:

#Limpieza basica del texto, remover puntuación y digitos como fechas, números de usuario de twitter etc.

stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer('[\'a-zA-Z]+') # Acá es para eligir solo parabras del alfabeto entre A-Z minuscula o mayscula, es una RE
lemmatizer = WordNetLemmatizer() #Acá iremos reducir las palabras para su clasificación minima, la raíz semantica de la palabra.


primera_noticia = df.iloc[0]
def preprocess_text(text):
    words = []
    for sentence in sent_tokenize(text):
        tokens = [word for word in tokenizer.tokenize(sentence)]
        tokens = [token.lower() for token in tokens]
        tokens = [token for token in tokens if token not in stop_words]
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
        words += tokens
    return ' '.join(words)
# Esta funcion nos sirve para etiquetar variables de clasificación binaria

tokens_primera_noticia = preprocess_text(primera_noticia['text'])
df['texto_titulo'] = df['title'] + df['text']

df['preprocessado'] = df['texto_titulo'].apply(preprocess_text)
df['preprocessado'].head()
# Si queremos palabra por palabra basta hacer .split(' ')

0    donald trump sends embarrassing new year eve m...
1    drunk bragging trump staffer started russian c...
2    sheriff david clarke becomes internet joke thr...
3    trump obsessed even obama name coded website i...
4    pope francis called donald trump christmas spe...
Name: preprocessado, dtype: object

## Text vectorization (embeddings)

Neural networks require numeric inputs. Here we convert tokens into vectors using a pre-trained embedding approach.
This keeps the notebook lightweight and avoids training embeddings from scratch (which can be expensive on limited hardware).

The final output is a 3D tensor shaped like:
`(num_samples, sequence_length, embedding_dim)`

In [None]:
from collections import defaultdict

# Load GloVe embeddings
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            embeddings_dict[word] = vector
    return embeddings_dict

glove_embeddings = load_glove_embeddings("glove.6B.100d.txt")

# Function to convert articles to sequences of embeddings
def article_to_embedding(article, embeddings_dict, max_len):
    embedding_dim = len(next(iter(embeddings_dict.values())))
    embedded_article = np.zeros((max_len, embedding_dim))

    words = article.split()[:max_len]
    for i, word in enumerate(words):
        if word in embeddings_dict:
            embedded_article[i] = embeddings_dict[word]
        else:
            embedded_article[i] = np.zeros(embedding_dim)

    return embedded_article


max_len = 120  # Choose based on dataset analysis
embedded_articles = np.array([article_to_embedding(article, glove_embeddings, max_len) for article in df['preprocessado']])


In [None]:
text_as_vectors = torch.as_tensor(embedded_articles, dtype=torch.float)

embedding_dim = text_as_vectors.size(2)

text_as_vectors.size()

torch.Size([10000, 120, 100])

In [None]:
from torch.utils.data import DataLoader, TensorDataset

df['authenticity_as_num'] = df['authenticity'].apply(lambda x: 0 if x == 'Fake' else 1)
labels = torch.as_tensor(df['authenticity_as_num'].values, dtype=torch.long)

dataset = TensorDataset(text_as_vectors, labels)

# Splitting dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)


## Models: RNN and LSTM

We train and compare two architectures:

- **Vanilla RNN**: simple and fast, but can struggle with long-range dependencies.
- **LSTM**: designed to handle longer dependencies and usually trains more stably on sequences.

Both models output logits for binary classification.

In [None]:
import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleRNN, self).__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        output, _ = self.rnn(x)
        output = self.fc(output[:, -1, :])
        return output

# Ejemplo de uso: model = SimpleRNN(input_dim=embedding_dim, hidden_dim=128, output_dim=2)


In [None]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True) 
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        output, (hidden, cell) = self.lstm(x)
        # output shape: (batch, seq_len, hidden_dim)

        # Take the output of the last time step for classification

        output = self.fc(output[:, -1, :])  # shape: (batch, output_dim)
        
        return output


## Notes
You can improve results by tuning:
- max sequence length
- embedding choice
- hidden size / number of layers
- dropout and learning rate

In [None]:
import torch

def train_model(model, train_loader, val_loader, epochs, learning_rate):
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for batch in train_loader:
           
            inputs, labels = batch        
                
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}")

        # Validation
        model.eval()
        total = 0
        correct = 0
        with torch.no_grad():
            for batch in val_loader:
                inputs, labels = batch
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        print(f'Validation Accuracy: {accuracy}%')


model = SimpleRNN(input_dim=100, hidden_dim=128, output_dim=2)

train_model(model, train_loader, val_loader, epochs=40, learning_rate=  0.001)



In [None]:

model2 = LSTMModel(input_dim=100, hidden_dim=256, output_dim=2)
train_model(model2, train_loader, val_loader, epochs=10, learning_rate=0.001)

KeyboardInterrupt: 