# Name of the project and introduction 

- Using Pytorch to build machine learning models as it supports both CPU and GPU computations
- Metal Performance Shaders is a framework provided by Apple for accelerating machine learning tasks on macOS and iOS devices using the GPU. This   checks if the MPS backend is available on your system.
- If this returns True, it means your macOS system supports running PyTorch computations on its GPU via Metal. 
- If this returns False, it means either your system doesn’t support MPS or the backend is not properly configured.

- This check is useful to determine whether your system can offload computations to the GPU, which is much faster than using the CPU, especially for large-scale machine learning tasks.

In [1]:
import torch
print(torch.backends.mps.is_available())  # True if Metal backend is active

True


- The transformers library provides access to pre-trained models for a variety of NLP tasks.
- In this step, we're preparing to load a pre-trained model for causal language modeling and its corresponding tokenizer.
- AutoModelForCausalLM: Loads a causal language model for tasks like text generation, Automatically identifies the correct model architecture based on the model name or path you provide, GPT-2, GPT-3.
- AutoTokenizer: Converts text into token IDs that the model can process, Converts token IDs back into human-readable text after processing., Automatically matches the tokenizer with the model you load

- It prepares the tools needed to load a pre-trained model and process text inputs. These components are essential for interacting with NLP models for tasks like text generation, summarization, or question-answering.


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


- A fine-tuned instruction-following model optimized for tasks like:

    - Question answering
    - Summarization
    - Dialogue and other conversational tasks.
    - 0.5B: The model has 0.5 billion parameters, which makes it lightweight and fast for most tasks.

In [3]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

### 1. Loading the Model
AutoModelForCausalLM.from_pretrained:

- This method fetches the pre-trained causal language model specified by model_name (in this case, "Qwen/Qwen2.5-0.5B-Instruct").
- Arguments:

    - model_name: The name of the model to load (stored in the previous step).
    - torch_dtype="auto": Automatically selects the most suitable precision for the model, such as float16 or float32. Lower precision (e.g., float16) helps save memory and speeds up inference.
    - device_map="auto": Automatically maps the model to the best available device (like GPU, MPS, or CPU).
    - cache_dir='llms/': Specifies the directory to cache the downloaded model for reuse.

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    cache_dir='llms/'
)


### 2. Loading the tokenizer

- Loads the tokenizer corresponding to the model_name model.
- Ensures that the text input and output align with the model's requirements.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

### 3. Defining the Prompt and Messages

- Prompt:
    - A simple text string asking the model for a short introduction to large language models.
- Messages:
    - A list simulating a conversation:
        - role: system: Provides system instructions (e.g., define the assistant's identity or behavior).
        - role: user: The user’s input or question (the prompt).

In [6]:
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

### Preparing the input for the model

- Converts the messages into a format the model can understand.
- apply_chat_template: 
    - A method that combines system instructions and user input into a cohesive template.
    - add_generation_prompt=True: Appends any additional information the model might need for generation.

In [7]:
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)


- Tokenizing:
    - Converts the formatted text into token IDs that the model can process.
    - return_tensors="pt": Returns the data in PyTorch tensor format.
    - .to(model.device): Moves the tokenized input to the device (e.g., GPU or CPU) where the model is loaded.

In [8]:
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

### 5. Generating a response

- Generates a response based on the tokenized input.
- max_new_tokens=512: Specifies the maximum number of tokens to generate in the response.

In [9]:
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

- Extracts the newly generated tokens (ignoring the input tokens). This ensures that only the model's response is retained.

In [10]:
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

### 6. Decoding the Generated tokens

- Converts the generated token IDs back into human-readable text.
- skip_special_tokens=True: Removes special tokens (like <|endoftext|>) from the output.

In [11]:
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [12]:
response

'Large Language Models (LLMs) are artificial intelligence systems that can generate human-like text, often used in various applications such as language translation, summarization, and chatbots. These models are trained on vast amounts of data, allowing them to understand and produce natural-sounding language with remarkable accuracy. LLMs have become increasingly popular due to their ability to handle complex tasks and generate creative content. They have the potential to revolutionize industries such as healthcare, finance, education, and more.'

In [13]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import wikipedia
from bs4 import BeautifulSoup
import re
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


- This step defines a reusable class to encapsulate all interactions with the Large Language Model (LLM). The goal is to:

    - Simplify the code for generating responses by wrapping LLM-related logic in a single class.
    - Improve efficiency by loading the model and tokenizer only once during initialization.

- Advantages of This Wrapper

    - Modularity: Encapsulates all LLM-related logic in one place, making the code reusable and easy to debug.
    - Efficiency: Loads the model only once during initialization, avoiding repeated loads for every query.
    - Ease of Use: Simplifies generating responses by providing a single method (get_response).

In [14]:
# Step 1: Reusable LLM Wrapper Class
class LLMWrapper:
    """
    A reusable wrapper for interacting with an LLM, ensuring the model is loaded once.
    """
    def __init__(self, model_name="Qwen/Qwen2.5-0.5B-Instruct", cache_dir="llms/"):
        """
        Initialize the LLM model and tokenizer.

        Args:
            model_name (str): The Hugging Face model name.
            cache_dir (str): Directory to cache the model and tokenizer.
        """
        print("Initializing the LLM model and tokenizer...\n")
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
            device_map="auto",
            cache_dir=cache_dir
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

    def get_response(self, prompt, max_new_tokens=512):
        """
        Generate a response from the LLM.

        Args:
            prompt (str): The input prompt for the LLM.
            max_new_tokens (int): Maximum tokens to generate.

        Returns:
            str: The generated response from the LLM.
        """
        # Format the messages
        messages = [
            {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]

        # Prepare the input for the model
        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device)

        # Generate the response
        generated_ids = self.model.generate(
            **model_inputs,
            max_new_tokens=max_new_tokens
        )
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]

        # Decode and return the response
        return self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]


- This step involves fetching content from Wikipedia based on a query and splitting the content into manageable chunks for processing. It consists of two primary functions:

    - fetch_wikipedia_articles: 
        - Fetches articles from Wikipedia based on a query.
        - Handles common Wikipedia errors like disambiguation and missing pages.
    - chunk_text:
        - Splits long texts into smaller chunks for easier processing (e.g., for feeding into an LLM or analyzing text).

In [15]:
# Step 2: Wikipedia Fetching and Processing
def fetch_wikipedia_articles(query, top_n=10, lang="en", user_agent="MyWikipediaApp/1.0 (myemail@example.com)"):
    """
    Fetch Wikipedia articles for a given query.

    Args:
        query (str): Search query.
        top_n (int): Number of search results to fetch.
        lang (str): Wikipedia language edition (default: English).
        user_agent (str): User agent string for API compliance.

    Returns:
        dict: A dictionary with article titles as keys and content as values.
    """
    # Set language and user agent
    wikipedia.set_lang(lang)
    wikipedia.set_user_agent(user_agent)

    # Search Wikipedia for the query
    search_results = wikipedia.search(query, results=top_n)
    print(search_results)

    # Fetch content for each search result
    articles = {}
    for title in search_results:
        try:
            content = wikipedia.page(title).content
            articles[title] = content
        except wikipedia.exceptions.DisambiguationError as e:
            print(f"DisambiguationError: {title} has multiple meanings. Skipping.")
        except wikipedia.exceptions.PageError as e:
            print(f"PageError: Could not fetch page for {title}. Skipping.")
    
    return articles


def chunk_text(text, chunk_size=300, overlap=50):
    sentences = re.split(r'(?<=[.!?]) +', text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        current_length += len(sentence.split())
        current_chunk.append(sentence)
        if current_length >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = current_chunk[-overlap:]  # Maintain overlap
            current_length = len(" ".join(current_chunk).split())
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks

- To identify and retrieve the most relevant text chunks for a given question by leveraging cosine similarity and sentence embeddings.

In [16]:
# Step 3: Retrieve Relevant Chunks
def retrieve_relevant_chunks(question, chunks, model_name="BAAI/bge-base-en-v1.5"):
    embedder = SentenceTransformer(model_name, trust_remote_code=True, cache_folder="llms/")
    chunk_embeddings = embedder.encode(chunks)
    question_embedding = embedder.encode([question])

    # Calculate cosine similarity
    similarities = cosine_similarity(question_embedding, chunk_embeddings)[0]
    sorted_indices = np.argsort(similarities)[::-1]
    top_chunks = [chunks[i] for i in sorted_indices[:3]]  # Top 5 relevant chunks

    return top_chunks

- This function uses the Qwen LLM to transform a user’s natural language question into a search query that can be used to fetch relevant Wikipedia articles. 

- The transformation is essential for narrowing down the search results to match the intent behind the user's query.

In [17]:
# Step 4: Question to Query Transformation Using Qwen
def transform_question_to_query_with_llm(question, llm_wrapper):
    prompt = f"Transform the following question into a search query suitable for Wikipedia:\nQuestion: {question}\nSearch Query:"
    return llm_wrapper.get_response(prompt, max_new_tokens=50)


- This function is responsible for generating a detailed answer to the user's question based on the relevant chunks of Wikipedia articles that were retrieved earlier. 

- It uses the Qwen LLM to synthesize the provided context and answer the question effectively.

In [18]:
# Step 5: Generate Answer with Qwen
def generate_answer_with_qwen(question, chunks, llm_wrapper):
    # Combine chunks into a context for the prompt
    context = "\n".join(chunks)
    print(f"Context being used:\n{context}")
    prompt = f"Here is some context:\n{context}\n\nAnswer the question: {question}"
    return llm_wrapper.get_response(prompt)

- The function orchestrates the entire process of answering a question using the Qwen LLM. 

- It covers all the steps from transforming the question into a search query, fetching relevant Wikipedia articles, processing the content, retrieving the most relevant chunks, and finally generating an answer. 

- This forms a complete pipeline for answering a question using external knowledge from Wikipedia.

In [19]:
# Step 6: Full Pipeline
def answer_question_pipeline(question, llm_wrapper):
    # Step 1: Transform question into a query
    query = transform_question_to_query_with_llm(question, llm_wrapper)
    print(f"Search Query: {query}")

    # Step 2: Fetch Wikipedia articles
    articles = fetch_wikipedia_articles(query)
    print(f"\n Fetched {len(articles)} articles for query: {query}\n")

    # Step 3: Extract and chunk content
    chunks = []
    for title, text in articles.items():
        chunks.extend(chunk_text(text))
    print(f"Generated {len(chunks)} chunks from the articles.\n")

    # Step 4: Retrieve the most relevant chunks
    relevant_chunks = retrieve_relevant_chunks(question, chunks)
    print("Retrieved relevant chunks.\n")

    # Step 5: Generate answer using Qwen
    answer = generate_answer_with_qwen(question, relevant_chunks, llm_wrapper)
    return answer

- This final step initializes the LLMWrapper and uses the previously defined pipeline to answer a specific question. 

- It demonstrates how to interact with the answer_question_pipeline function and obtain an answer based on a question about the Prime Minister of India in 2023.

In [20]:
# Initialize the LLM wrapper
llm = LLMWrapper()

# Example question
question = "Who is the Prime Minister of India in 2023?"

# Get the final answer
answer = answer_question_pipeline(question, llm)
print("\nFinal Answer:")
print(answer)

Initializing the LLM model and tokenizer...

Search Query: Who was the Prime Minister of India in 2023?
['List of prime ministers of India', 'Deputy Prime Minister of India', 'Spouse of the prime minister of India', "Prime Minister's Office (India)", 'List of prime ministers of Canada', 'Minister of Railways (India)', 'Prime Minister of India', 'The Accidental Prime Minister', 'Acting prime minister', 'List of prime ministers of Pakistan']

 Fetched 10 articles for query: Who was the Prime Minister of India in 2023?

Generated 256 chunks from the articles.



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Retrieved relevant chunks.

Context being used:
The prime minister of India (ISO: Bhārata kē Pradhānamaṁtrī) is the head of Union Council of Ministersof the Republic of India. Executive authority is vested in the prime minister and his chosen Council of Ministers, despite the president of India being the nominal head of the executive. The prime minister has to be a (nominated) member of one of the houses of bicameral Parliament of India, alongside heading the respective house. The prime minister and his cabinet are at all times responsible to the Lok Sabha.
The prime minister is appointed by the president of India; however, the prime minister has to enjoy the confidence of the majority of Lok Sabha members, who are directly elected every five years, lest the prime minister shall resign. The prime minister can be a member of the Lok Sabha or the Rajya Sabha, the upper house of the parliament. The prime minister controls the selection and dismissal of members of the Union Council of Mini

## Observations

- initially when the top_n value was 5, the model had trouble selecting the right articles to go through (no releveant articles). So the model kept returning that it had no information regarding the PM of India in 2023.

- After updating the top_n value to 10, the model finally found a few references where PM of India, 2023, and Narendra Modi were there together and realted them to give this answer.

In [21]:
# Initialize the LLM wrapper
llm = LLMWrapper()

# Example question
question = "Is Narendra Modi Prime minister of India in 2025?"

# Get the final answer
answer = answer_question_pipeline(question, llm)
print("\nFinal Answer:")
print(answer)

Initializing the LLM model and tokenizer...

Search Query: Is Narendra Modi the current Prime Minister of India as of 2025?
['Foreign policy of the Narendra Modi government', 'Union Council of Ministers', 'Bangladesh–India relations', '2025 Delhi Legislative Assembly election', 'List of schemes of the government of India', 'India–United States relations', 'Manmohan Singh', 'Make in India', '2024 Indian general election', 'NITI Aayog']
PageError: Could not fetch page for Make in India. Skipping.

 Fetched 9 articles for query: Is Narendra Modi the current Prime Minister of India as of 2025?

Generated 1214 chunks from the articles.

Retrieved relevant chunks.

Context being used:
Modi met with President Vladimir Putin in July on the sidelines of 6th BRICS summit in Brazil.
French Foreign Minister Laurent Fabius made an official visit to India from 29 June–2 July and held high-level talks with both the External Affairs Minister and Modi. Strategic and defense cooperation was at the top o