# Data Preparation and Cleaning

First, we import the necessary libraries for data manipulation, numerical operations, and preprocessing. Pandas is used for data manipulation, NumPy for numerical operations, and train_test_split from scikit-learn for splitting the data into training and test sets.

In [21]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer
import re

  from .autonotebook import tqdm as notebook_tqdm


We load our dataset from a CSV file named 'dataset-merged.csv' and inspect the column names to understand the structure of our data. This helps us identify any unnecessary columns that need to be removed.



In [22]:
df = pd.read_csv('dataset-merged.csv')

In [23]:
df.columns

Index(['sr', 'text', 'label', 'wcount'], dtype='object')

The column 'sr' is identified as redundant and is removed from the dataset. We then check for missing values in the dataset to ensure data integrity. If there are any missing values, they could potentially skew our analysis and model performance.

In [24]:
df = df.drop('sr', axis=1)
df.columns

Index(['text', 'label', 'wcount'], dtype='object')

In [25]:
df = df.dropna()
df.isnull().sum()

text      0
label     0
wcount    0
dtype: int64

To clean the text data, we define a preprocessing function to remove digits and extra spaces. This helps in normalizing the text data before tokenization.


In [29]:
# Preprocess the text data
def preprocess_text(text):
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    return text

In [30]:
df['text'] = df['text'].apply(preprocess_text)

# Data Splitting

We split the dataset into training and test sets using an 80-20 split. This ensures that our models can be trained and tested on different subsets of the data.

In [31]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# Tokenization with BERT

We utilize the BERT tokenizer to tokenize our text data. BERT is a powerful pre-trained language model that helps in capturing the context of words in a sentence. We set a maximum sequence length of 100 tokens.

In [32]:
# Useing BERT tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [33]:
# Tokenize the text
max_len = 100  # Maximum length of the sequences
X_train_tokens = tokenizer(X_train.tolist(), padding=True, truncation=True, max_length=max_len, return_tensors='pt')
X_test_tokens = tokenizer(X_test.tolist(), padding=True, truncation=True, max_length=max_len, return_tensors='pt')

Converting labels to Tensors.

In [34]:
y_train = torch.tensor(y_train.values)
y_test = torch.tensor(y_test.values)

# Dataset and DataLoader

To handle batching of the tokenized data, we create a custom dataset class that extends torch.utils.data.Dataset. This class returns the tokenized inputs and corresponding labels. We then create DataLoaders for training and test datasets to enable easy batching and shuffling.

In [35]:
# creating dataset and dataloader to handle batching

class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset = NewsDataset(X_train_tokens, y_train)
test_dataset = NewsDataset(X_test_tokens, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


# LSTM Model Definition

We define an LSTM classifier for our text classification task. LSTM (Long Short-Term Memory) networks are a type of recurrent neural network (RNN) capable of learning long-term dependencies, making them suitable for text data.

In [36]:
# LSTM Model

class LSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, output_dim, n_layers, bidirectional, dropout):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        if self.lstm.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
        out = self.fc(hidden)
        return out

Explanation of LSTM Layers
- Embedding Layer: Converts input text tokens into dense vectors of a fixed size (embedding_dim).
- LSTM Layer: The core of the model, which processes sequences of data. It has parameters such as hidden_dim (size of the hidden state).
- n_layers (number of LSTM layers stacked), bidirectional (if True, the LSTM will be bidirectional), and dropout (dropout probability).
- Fully Connected Layer: Maps the hidden state output of the LSTM to the desired output dimension (number of classes).
- Dropout Layer: Regularization technique to prevent overfitting by randomly setting a fraction of input units to 0 at each update during training time.

# Hyperparameters and Model Initialization

Hyperparameter tuning is a crucial step in building an effective machine learning model, as it involves selecting the optimal set of hyperparameters that define the model's structure and learning process. In this project, we carefully set the hyperparameters for our LSTM model to achieve a balance between model complexity and performance. The embedding dimension (embedding_dim) is set to 128, which determines the size of the vector space in which words will be embedded. The hidden dimension (hidden_dim) is set to 256, defining the size of the hidden states in the LSTM layers, which allows the model to capture more complex patterns in the data. We use 8 LSTM layers (n_layers), enabling the model to learn hierarchical representations of the text data. The bidirectional flag (bidirectional) is set to True, allowing the LSTM to capture dependencies in both forward and backward directions, thus improving the model's context understanding. The dropout rate (dropout) is set to 0.3, which helps in regularizing the model by preventing overfitting. The vocabulary size (vocab_size) is derived from the BERT tokenizer, ensuring compatibility with pre-trained BERT embeddings. These hyperparameters are chosen based on prior experiments and domain knowledge, and further fine-tuning can be performed using grid search or random search to find the optimal configuration.

In [47]:
# Hyperparameters
embedding_dim = 128
hidden_dim = 256
output_dim = 2
n_layers = 8
bidirectional = True
dropout = 0.3
vocab_size = tokenizer.vocab_size

# initailaizing the model
model = LSTMClassifier(embedding_dim, hidden_dim, vocab_size, output_dim, n_layers, bidirectional, dropout)

# Move the model to CUDA
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Loss Function and Optimizer

The choice of the loss function and optimizer significantly impacts the model's training process and final performance. For this text classification task, we use the Cross-Entropy Loss (nn.CrossEntropyLoss()), which is well-suited for multi-class classification problems. This loss function measures the difference between the predicted probability distribution and the true distribution of the classes, penalizing incorrect predictions more severely. It is particularly effective in dealing with imbalanced datasets where some classes might be more frequent than others.

To optimize the model parameters, we employ the Adam optimizer (optim.Adam). Adam (short for Adaptive Moment Estimation) is an advanced optimization algorithm that combines the advantages of two other popular methods: AdaGrad and RMSProp. It adapts the learning rate for each parameter individually, using estimates of first and second moments of the gradients. This makes Adam particularly well-suited for training deep neural networks, as it can handle sparse gradients and noisy data more efficiently. We set the learning rate to a small value of 0.0001, which helps in achieving a more stable and gradual convergence, reducing the risk of overshooting the optimal parameter values.



In [48]:
# Setting up loss function and Adam optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Training and Evaluation Functions

We define functions to train and evaluate the model. The train_model function performs one epoch of training, and the evaluate_model function evaluates the model on the validation set.

In [49]:
def train_model(model, train_loader, criterion, optimizer, device):
    model.train()
    for batch in train_loader:
        inputs = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

def evaluate_model(model, test_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    with torch.no_grad():
        for batch in test_loader:
            inputs = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            correct += torch.sum(preds == labels).item()
    return total_loss / len(test_loader), correct / len(test_loader.dataset)

# Model Training

We train the model for a specified number of epochs and print the validation loss and accuracy at the end of each epoch.

In [50]:
n_epochs = 10
for epoch in range(n_epochs):
    train_model(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = evaluate_model(model, test_loader, criterion, device)
    print(f'Epoch {epoch+1}, Val Loss: {val_loss}, Val Acc: {val_acc}')

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


Epoch 1, Val Loss: 0.5538217243221071, Val Acc: 0.7191240875912409
Epoch 2, Val Loss: 0.4908814444034188, Val Acc: 0.7693430656934307
Epoch 3, Val Loss: 0.5684072830610805, Val Acc: 0.7170802919708029
Epoch 4, Val Loss: 0.419455757709565, Val Acc: 0.8186861313868613
Epoch 5, Val Loss: 0.4856743061984027, Val Acc: 0.7944525547445256
Epoch 6, Val Loss: 0.41693154184354675, Val Acc: 0.8294890510948905
Epoch 7, Val Loss: 0.3838138275400356, Val Acc: 0.8426277372262774
Epoch 8, Val Loss: 0.4031199230640023, Val Acc: 0.8198540145985401
Epoch 9, Val Loss: 0.3804402712870527, Val Acc: 0.8446715328467154
Epoch 10, Val Loss: 0.38689802380071747, Val Acc: 0.8458394160583942


# Test Evaluation

Finally, we evaluate the trained model on the test set to obtain the test loss and accuracy.

In [51]:
test_loss, test_acc = evaluate_model(model, test_loader, criterion, device)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_acc}')

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


Test Loss: 0.38689802380071747, Test Accuracy: 0.8458394160583942


# Results
The model achieves a test accuracy of approximately 84.58%, demonstrating its effectiveness in classifying the news articles. The training and evaluation functions help in monitoring the performance of the model throughout the training process.

We can incorporate Learning Rate Scheduler and Early Stopping in our model trainig. See the results on - https://github.com/Spinal-Tap369/fake-news/blob/main/hindi_fake_news_lstm_lrs_es.ipynb 