<a href="https://colab.research.google.com/github/Rohit-Munda/GenAIWorkshop/blob/main/Workshop-1/Day-3/DocumentRetrievalQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🛠 1. Install and load required python libraries

In [None]:
!pip install -q faiss-cpu pdfplumber

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.2/60.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
#Import necessary packages
import pdfplumber
import os
import textwrap
import faiss
import numpy as np
from transformers import pipeline
from google.colab import files
from sentence_transformers import SentenceTransformer

## 📂 Step 2. Upload PDF or TXT Files

In [None]:
uploaded_files = files.upload()  # Upload one or more .pdf or .txt files

Saving generative_ai_extended.pdf to generative_ai_extended.pdf


## 📄 Step 3. Extract Text from PDFs or Text Files

In [None]:
def extract_text(file):
    if file.endswith('.pdf'):
        text = ''
        with pdfplumber.open(file) as pdf:
            for page in pdf.pages:
                text += page.extract_text() + '\n'
        return text
    elif file.endswith('.txt'):
        with open(file, 'r', encoding='utf-8') as f:
            return f.read()
    else:
        raise ValueError("Unsupported file type")

In [None]:
# Collect all document texts
corpus = []

for fname in uploaded_files:
    with open(fname, 'wb') as f:
        f.write(uploaded_files[fname])
    text = extract_text(fname)
    corpus.append(text)



## ✂️ Step 4. Split Text into Chunks

In [None]:
def split_into_chunks(text, max_length=300):
    return textwrap.wrap(text, width=max_length, break_long_words=False)

In [None]:
# Flatten all chunks from all documents
all_chunks = []
for doc_text in corpus:
    all_chunks.extend(split_into_chunks(doc_text, max_length=300))

In [None]:
print(f"✅ Total Chunks: {len(all_chunks)}")

✅ Total Chunks: 12


## 🔡 Step 5. Generate Embeddings

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
chunk_embeddings = embedder.encode(all_chunks)

## 📦 Step 6. Store in FAISS Index

In [None]:
embedding_dim = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(np.array(chunk_embeddings))

## 🧠 Step 7. RAG Function: Retrieve + Answer

In [None]:
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

In [None]:
def rag_answer(question, top_k=3):
    question_embedding = embedder.encode([question])
    distances, indices = index.search(np.array(question_embedding), top_k)
    retrieved_chunks = [all_chunks[i] for i in indices[0]]

    context = " ".join(retrieved_chunks)

    result = qa_pipeline(question=question, context=context)

    print(f"📌 Question: {question}")
    print(f"\n📚 Retrieved Context:\n{context[:1000]}...")  # Truncated for display
    print(f"\n🧠 Answer: {result['answer']}")

In [None]:
rag_answer("What is the main topic of the document?")

📌 Question: What is the main topic of the document?

📚 Retrieved Context:
where two networks competeone generates data, and the other evaluates it. Each technique has its own strengths and application areas. 3. Applications of Generative AI Text Generation: Language models like GPT-3 and ChatGPT can generate essays, news articles, conversations, and summaries. Image and inspire creativity. It can democratize access to tools and information by enabling non-experts to generate high-quality content with minimal input. 5. Risks and Challenges Despite its promise, Generative AI also poses risks: - Misinformation: AI-generated fake news and deepfakes can mislead the artificial intelligence, with transformative potential across nearly every industry. As with any powerful technology, its development must be approached with both excitement and responsibility....

🧠 Answer: Generative AI Text Generation


In [None]:
rag_answer("As per the document, what are the models used for image generation?")

📌 Question: As per the document, what are the models used for image generation?

📚 Retrieved Context:
Generation: Models like DALLE and Stable Diffusion can create original images from text prompts, offering applications in design, art, and media. Music and Audio: AI can compose music in various genres, synthesize voices, and even create new sounds. Code Generation: Tools like GitHub Copilot use various techniques such as: - Transformer Architectures: Models like GPT use attention mechanisms to process and generate sequences of data. - Autoencoders: These compress and reconstruct input data and can be used to generate new data points. - GANs (Generative Adversarial Networks): A framework original data distribution. One of the most well-known classes of generative models are transformer-based language models like OpenAI's GPT series. These models are pre-trained on large text corpora and then fine-tuned for specific tasks. 2. How Do Generative Models Work? Generative models use...

🧠 An

## ✅ What You Learned

- How to build a simple RAG system:
  - Load documents (PDFs/TXT)
  - Break them into chunks
  - Embed and store in a vector database (FAISS)
  - Retrieve relevant chunks for a user query
  - Use a QA model to generate an answer based on the retrieved content