# **Building a Simple Recurrent Neural Network for Question-Answering**
---

## **Overview**
>This project implements a basic Recurrent Neural Network (RNN) for question-answering using PyTorch. The dataset consists of 100 unique question-answer pairs, which are preprocessed to build a vocabulary and convert text into numerical representations. A custom PyTorch Dataset class is created to handle tokenized inputs, and a DataLoader is used for batching and shuffling the data. The model consists of an embedding layer, an RNN layer, and a fully connected output layer, trained using the cross-entropy loss function and Adam optimizer. After training for 20 epochs, the model is tested by predicting answers to new questions. If the confidence level is below a set threshold, the model responds with "I don't know." This project demonstrates a foundational approach to natural language processing (NLP) using RNNs, highlighting key steps such as text tokenization, vocabulary building, dataset preparation, model training, and inference.

---
## **Import Dependencies**
- `pandas`: For loading and processing the dataset.
- `torch`: Core PyTorch library for deep learning.
- `Dataset`, `DataLoader`: Tools to create and manage training data.
- `nn`: PyTorch’s module for building neural networks.

In [43]:
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn

---
## **Load and Inspect the Dataset**
- Loads a CSV file containing 100 QA pairs.
- Displays the first few rows.


In [44]:
df = pd.read_csv("/content/100_Unique_QA_Dataset.csv")

In [45]:
df.head()

Unnamed: 0,question,answer
0,What is the capital of France?,Paris
1,What is the capital of Germany?,Berlin
2,Who wrote 'To Kill a Mockingbird'?,Harper-Lee
3,What is the largest planet in our solar system?,Jupiter
4,What is the boiling point of water in Celsius?,100


---
## **Text Preprocessing**

### **Tokenization**
- Converts text to lowercase.
- Removes question marks and apostrophes.
- Splits text into words (tokens).

In [46]:
def tokenize(text):
  text = text.lower()
  text = text.replace("?", "")
  text = text.replace("'", "")
  return text.split()

### **Vocabulary Building**
- Initializes a vocabulary dictionary with an unknown token (<UNK>).
- Iterates through all words in questions and answers to build the vocabulary.

In [47]:
vocab = {"<UNK>": 0}

def build_vocab(row):
  tokenized_question = tokenize(row["question"])
  tokenized_answer = tokenize(row["answer"])

  merged_tokens = tokenized_question + tokenized_answer

  for token in merged_tokens:
    if token not in vocab:
      vocab[token] = len(vocab)

In [48]:
built_vocab = df.apply(build_vocab, axis=1)

### **Convert Text to Indices**
- Converts a given text into numerical indices based on the vocabulary.
- If a word is not in the vocabulary, it gets mapped to `<UNK>`.

In [49]:
def text_to_indices(text, vocab):
  indexed_text = []

  for token in tokenize(text):
    if token in vocab:
      indexed_text.append(vocab[token])
    else:
      indexed_text.append(vocab["<UNK>"])

  return indexed_text

---
## **Create a Custom Dataset**
- Creates a custom dataset that:
 - Stores the dataframe and vocabulary.
 - Returns a numerical tensor representation of a question-answer pair.

In [50]:
class QADataset(Dataset):
  def __init__(self, df, vocab):
    self.df = df
    self.vocab = vocab

  def __len__(self):
    return len(self.df)

  def __getitem__(self, index):
    self.row = self.df.iloc[index]

    num_question = text_to_indices(self.row["question"], self.vocab)
    num_answer = text_to_indices(self.row["answer"], self.vocab)

    return torch.tensor(num_question), torch.tensor(num_answer)

---
## **Prepare the DataLoader**
- Wraps the dataset in a DataLoader to iterate efficiently during training.

In [51]:
dataset = QADataset(df, vocab)

In [52]:
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

---
## **Define the RNN Model**
- **Embedding layer**: Converts word indices into dense vector representations (50-dimensional).
- **RNN layer**: Processes the embeddings using 64 hidden units.
- **Fully connected layer**: Outputs logits (unnormalized probabilities) for each word in the vocabulary.


In [53]:
class RecurrentNeuralNetwork(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim=50)
    self.rnn = nn.RNN(50, 64, batch_first=True)
    self.fc = nn.Linear(64, vocab_size)

  def forward(self, x):
    embedded = self.embedding(x)
    hidden, final = self.rnn(embedded)
    output = self.fc(final.squeeze(0))

    return output

---
## **Train the Model**
- **Hyperparameters**:
 - **Learning rate** = 0.001
 - **Number of epochs** = 20
- **Loss function**: CrossEntropyLoss (used for classification tasks).
- **Optimizer**: Adam optimizer for efficient learning.
- **Training loop**:
 - Zeroes gradients.
 - Performs forward propagation.
 - Computes loss between predicted and actual answer.
 - Performs backpropagation and updates weights.
 - Tracks the average loss per epoch.

In [54]:
learning_rate = 0.001
epochs = 20

In [55]:
model = RecurrentNeuralNetwork(len(vocab))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [56]:
for epoch in range(epochs):
  total_loss = 0

  for question, answer in dataloader:
    optimizer.zero_grad()
    output = model(question)
    loss = criterion(output, answer[0])
    loss.backward()
    optimizer.step()
    total_loss += loss.item()

  print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):4f}")

Epoch 1/20, Loss: 5.853919
Epoch 2/20, Loss: 5.030995
Epoch 3/20, Loss: 4.145952
Epoch 4/20, Loss: 3.493195
Epoch 5/20, Loss: 2.926808
Epoch 6/20, Loss: 2.388752
Epoch 7/20, Loss: 1.904426
Epoch 8/20, Loss: 1.469875
Epoch 9/20, Loss: 1.120488
Epoch 10/20, Loss: 0.854185
Epoch 11/20, Loss: 0.655832
Epoch 12/20, Loss: 0.513684
Epoch 13/20, Loss: 0.407700
Epoch 14/20, Loss: 0.328897
Epoch 15/20, Loss: 0.272507
Epoch 16/20, Loss: 0.224546
Epoch 17/20, Loss: 0.191174
Epoch 18/20, Loss: 0.164953
Epoch 19/20, Loss: 0.144087
Epoch 20/20, Loss: 0.124229


---
## **Predict Answer to a New Question**
- Converts a question into numerical indices.
- Passes the tensor through the trained model.
- Applies softmax activation to get probabilities.
- Retrieves the word with the highest probability.
- If the confidence is below the threshold (0.5), it prints `"I don't know."`

In [61]:
def predict(model, question, threshold=0.5):
  num_question = text_to_indices(question, vocab)
  tensor_question = torch.tensor(num_question).unsqueeze(0)
  output = model(tensor_question)
  probabilities = torch.nn.functional.softmax(output, dim=1)
  value, index = torch.max(probabilities, dim=1)

  if value < threshold:
    print("I don't know.")
  else:
    print(list(vocab.keys())[index])

---
### **Example Predictions**
- The model predicts answers based on learned representations.

In [62]:
predict(model, "Who wrote 'To Kill a Mockingbird'?")

harper-lee


In [63]:
predict(model, "Where is Islamabad situated?")

I don't know.
