# LLM RAG NLP Project: Document-Based Question Answering System

This notebook combines Large Language Models (LLM), Retrieval-Augmented Generation (RAG), and Natural Language Processing (NLP) techniques to create a document-based question answering system.

## Setup and Installation

In [None]:
!pip install torch transformers sentence-transformers faiss-cpu python-dotenv

In [None]:
import os
import re
import numpy as np
from typing import List, Dict, Tuple
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss
import ipywidgets as widgets
from IPython.display import display, Markdown

## Document Processor Class

In [None]:
class DocumentProcessor:
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.vector_db = None
        self.doc_chunks = []
        self.chunk_embeddings = []
        
    def chunk_document(self, text: str, chunk_size: int = 500, chunk_overlap: int = 100) -> List[str]:
        """Split document into overlapping chunks"""
        words = re.split(r'\s+', text)
        chunks = []
        start = 0
        
        while start < len(words):
            end = min(start + chunk_size, len(words))
            chunk = ' '.join(words[start:end])
            chunks.append(chunk)
            
            if end == len(words):
                break
                
            start = end - chunk_overlap
            
        return chunks
    
    def create_embeddings(self, chunks: List[str]) -> None:
        """Generate embeddings for document chunks"""
        self.doc_chunks = chunks
        self.chunk_embeddings = self.embedding_model.encode(chunks, show_progress_bar=True)
        
        # Create FAISS index
        dimension = self.chunk_embeddings.shape[1]
        self.vector_db = faiss.IndexFlatL2(dimension)
        self.vector_db.add(self.chunk_embeddings)
    
    def retrieve_relevant_chunks(self, query: str, k: int = 3) -> List[Tuple[str, float]]:
        """Retrieve top-k most relevant document chunks for a query"""
        query_embedding = self.embedding_model.encode([query])
        distances, indices = self.vector_db.search(query_embedding, k)
        
        results = []
        for idx, distance in zip(indices[0], distances[0]):
            if idx >= 0:  # -1 indicates no result
                results.append((self.doc_chunks[idx], float(distance)))
                
        return results

## QA System Class

In [None]:
class QASystem:
    def __init__(self):
        self.llm = pipeline(
            "text-generation",
            model="gpt2",  # Replace with "gpt-3.5-turbo" if you have OpenAI API access
            device="cuda" if torch.cuda.is_available() else "cpu"
        )
        self.doc_processor = DocumentProcessor()
    
    def process_document(self, document_text: str) -> None:
        """Process and store document for future queries"""
        chunks = self.doc_processor.chunk_document(document_text)
        self.doc_processor.create_embeddings(chunks)
    
    def answer_question(self, question: str) -> str:
        """Answer question based on processed documents"""
        if not self.doc_processor.doc_chunks:
            return "Please upload and process a document first."
            
        # Retrieve relevant context
        relevant_chunks = self.doc_processor.retrieve_relevant_chunks(question)
        context = "\n\n".join([chunk for chunk, _ in relevant_chunks])
        
        # Generate answer using LLM
        prompt = f"""Based on the following context, answer the question. If the answer isn't in the context, say you don't know.

Context:
{context}

Question: {question}
Answer:"""
        
        response = self.llm(
            prompt,
            max_length=200,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True
        )
        
        return response[0]['generated_text'].replace(prompt, "").strip()

## Interactive UI with IPython Widgets

In [None]:
# Initialize QA System
qa_system = QASystem()

# Create widgets
document_upload = widgets.Textarea(
    value='',
    placeholder='Paste your document text here or upload a file below',
    description='Document:',
    layout={'width': '90%', 'height': '200px'}
)

file_upload = widgets.FileUpload(
    description='Upload file:',
    multiple=False
)

process_btn = widgets.Button(description="Process Document")

question_input = widgets.Text(
    value='',
    placeholder='Enter your question here',
    description='Question:',
    layout={'width': '90%'}
)

ask_btn = widgets.Button(description="Ask Question")

output_area = widgets.Output()

# Event handlers
def on_process_clicked(b):
    with output_area:
        output_area.clear_output()
        document_text = document_upload.value
        
        if file_upload.value:
            # If file uploaded, use that instead
            uploaded_file = list(file_upload.value.values())[0]
            document_text = uploaded_file['content'].decode('utf-8')
            
        if not document_text.strip():
            print("Please provide document text or upload a file.")
            return
            
        print("Processing document...")
        qa_system.process_document(document_text)
        print("Document processed successfully!")

def on_ask_clicked(b):
    with output_area:
        output_area.clear_output()
        question = question_input.value
        if not question.strip():
            print("Please enter a question.")
            return
            
        print("Generating answer...")
        answer = qa_system.answer_question(question)
        display(Markdown(f"**Question:** {question}"))
        display(Markdown(f"**Answer:** {answer}"))

# Attach event handlers
process_btn.on_click(on_process_clicked)
ask_btn.on_click(on_ask_clicked)

# Display the UI
display(widgets.VBox([
    widgets.HTML("<h2>Document-Based QA System with RAG</h2>"),
    widgets.HTML("<p>Upload a document and ask questions about its content</p>"),
    document_upload,
    file_upload,
    process_btn,
    widgets.HTML("<hr>"),
    question_input,
    ask_btn,
    output_area
]))

## Example Usage

1. Paste your document text in the text area OR upload a text file
2. Click "Process Document"
3. Enter your question in the question field
4. Click "Ask Question"

Example questions you can try (after processing a document):
- "What are the main points?"
- "Can you summarize this document?"
- "What methodology was used?"

## Sample Document for Testing

In [None]:
sample_document = """
Large Language Models (LLMs) are a type of artificial intelligence that has revolutionized natural language processing. 
These models are trained on vast amounts of text data and can generate human-like text, answer questions, and perform 
various language tasks. The most advanced LLMs today have hundreds of billions of parameters.

Retrieval-Augmented Generation (RAG) is a technique that combines the power of LLMs with external knowledge retrieval. 
When answering a question, the system first retrieves relevant documents or passages, then uses the LLM to generate 
an answer based on this context. This approach reduces hallucinations and improves answer accuracy.

Key benefits of RAG include:
1. Access to up-to-date information (since the knowledge base can be updated separately from the model)
2. Better traceability (you can see which documents were used to generate the answer)
3. Reduced training costs (you don't need to retrain the model when knowledge updates)

Current challenges with RAG systems include:
- Retrieval accuracy (finding the most relevant passages)
- Integration of multiple knowledge sources
- Handling contradictory information in the knowledge base
"""

# Preload the sample document
document_upload.value = sample_document