<a href="https://colab.research.google.com/github/AKookani/NLP/blob/main/HW_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TA: Mohammad Erfan Zare
# Question topics: LLM & RAG
# FALL 2024

Install the required libraries: `transformers`, `datasets`, and `faiss`.


In [1]:
!pip install transformers datasets faiss-cpu

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64

Import modules

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer
from transformers import pipeline, AutoModel
from datasets import load_dataset
import faiss
import torch
import numpy as np

1. Use the [**`wikipedia`** dataset](https://huggingface.co/datasets/wikipedia) from the Hugging Face library.
2. Extract a subset of articles and prepare it as a knowledge base for retrieval.

In [2]:
data = load_dataset("wikipedia", "20220301.simple", split="train[:1%]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Task1: preprocess on dataset

In [3]:
# Initialize the tokenizer for the chosen model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Function to preprocess and tokenize the articles
def preprocess_article(batch):
    # Access the 'text' feature directly
    texts = [text.strip() for text in batch['text']]
    tokens = tokenizer(texts, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
    # Convert tensor outputs to lists for compatibility with datasets
    return {"input_ids": tokens['input_ids'].tolist(), "attention_mask": tokens['attention_mask'].tolist()}

# Apply the preprocessing function to the dataset
tokenized_data = data.map(preprocess_article, batched=True, batch_size=16)

# Now tokenized_data contains tokenized representations of the articles

# Task2: Index Creation for Retrieval
1. Use **FAISS** to create an index of the articles for efficient similarity search.
2. Extract embeddings using a pretrained transformer model.

In [5]:
# Load the model and move it to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("bert-base-uncased").to(device)
model.eval()  # Set the model to evaluation mode

# Function to generate embeddings for each article
def get_embeddings(batch):
    with torch.no_grad():
        input_ids = torch.tensor(batch['input_ids']).to(device)
        attention_mask = torch.tensor(batch['attention_mask']).to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # Use the [CLS] token's output as the embedding
        embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    return {"embeddings": [emb.tolist() for emb in embeddings]}

# Apply the function to generate embeddings
embedded_data = tokenized_data.map(get_embeddings, batched=True, batch_size=16)

Map:   0%|          | 0/2053 [00:00<?, ? examples/s]

### Task3: Retrieval-Augmented Question Answering
1. Implement a simple RAG system where user queries are matched against the knowledge base.
2. Retrieve the top 3 most relevant articles and use a generative LLM to answer the query.


In [9]:
# Create a FAISS index and add embeddings
# Get the dimension of the embeddings
dimension = len(embedded_data['embeddings'][0])

# Create a FAISS index with L2 (Euclidean) distance
index = faiss.IndexFlatL2(dimension)

# Convert the list of embeddings to a NumPy array and add it to the FAISS index
embeddings = np.array([np.array(e) for e in embedded_data['embeddings']])
index.add(embeddings)

# Load a generative model
gen_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
gen_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn").to(device)

# Function to retrieve and generate answers
def retrieve_and_generate(query):
    # Tokenize query and get its embedding
    query_token = tokenizer(query, return_tensors='pt', truncation=True, max_length=512).to(device)
    with torch.no_grad():
        query_embedding = model(**query_token).last_hidden_state[:, 0, :].cpu().numpy()

    # Retrieve top 3 articles
    _, indices = index.search(query_embedding, 3)
    indices = indices[0].astype(int)  # Convert numpy.int64 to int
    retrieved_articles = [data[int(i)]['text'] for i in indices]

    # Concatenate articles and generate answer
    input_text = " ".join(retrieved_articles)
    inputs = gen_tokenizer.encode("summarize: " + input_text, return_tensors='pt', truncation=True).to(device)
    summary_ids = gen_model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    answer = gen_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return answer


# Test with an example query
query = "What is artificial intelligence?"
print("Query:", query)
print("Answer:", retrieve_and_generate(query))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Query: What is artificial intelligence?
Answer: summarize: The following is a list of mental disorders. This is aList of famous walls. This list of color topics is also called a List of Color Topics. The list of colors is also known as the List of Colour Topics.


### Task4:Evaluation
1. Evaluate the performance of the system using 5 queries of your choice.
2. Analyze the quality of the answers and suggest improvements.


In [10]:
queries = [
    "What is machine learning?",
    "Explain natural language processing.",
    "What is the capital of France?",
    "Describe the theory of relativity.",
    "What is climate change?"
]

# Generate and print answers for each query
for query in queries:
    print("Query:", query)
    print("Answer:", retrieve_and_generate(query))
    print("\n")

Query: What is machine learning?
Answer: summarize: The following is a list of mental disorders. This is aList of famous walls. This list of color topics is also called a List of Color Topics. The list of colors is also known as the List of Colour Topics.


Query: Explain natural language processing.
Answer: Cognitive science studies how people make their ideas and what makes thoughts logical. It is often seen as the result of several different scientific fields working together. It does not refer to the sum of all these disciplines. It refers to their intersection on specific problems.


Query: What is the capital of France?
Answer: The following is a list of mental disorders. The following is an list of famous walls. The list includes famous people, places, and events. For more information on mental disorders, see the Mental Health Atlas.


Query: Describe the theory of relativity.
Answer: This is a list of elements by atomic number with symbol.summarize: This is a lists of physicist