<a href="https://colab.research.google.com/github/IRONMAN-AIcoder/genai/blob/main/Gen8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install -U langchain langchain-community langchain-core
!pip install faiss-cpu sentence-transformers openai pdfplumber

Collecting langchain
  Downloading langchain-0.3.24-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.22-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core
  Downloading langchain_core-0.3.55-py3-none-any.whl.metadata (5.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-a

In [2]:
# Step 1: Upload PDF or ZIP
from google.colab import files
import zipfile
import os
import shutil

uploaded = files.upload()
uploaded_file = list(uploaded.keys())[0]

extract_folder = "pdfs"
os.makedirs(extract_folder, exist_ok=True)

if uploaded_file.endswith(".zip"):
    with zipfile.ZipFile(uploaded_file, 'r') as zip_ref:
        zip_ref.extractall(extract_folder)
elif uploaded_file.endswith(".pdf"):
    shutil.move(uploaded_file, os.path.join(extract_folder, uploaded_file))
else:
    raise ValueError("Please upload a .pdf file or a .zip containing PDFs.")

# Step 2: Extract text from PDFs
import pdfplumber
from langchain_core.documents import Document

def load_pdfs_from_folder(folder_path):
    texts = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            with pdfplumber.open(os.path.join(folder_path, filename)) as pdf:
                text = "\n".join(page.extract_text() or '' for page in pdf.pages)
                texts.append(text)
    return texts

# Step 3: Split text into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_text(texts, chunk_size=500, chunk_overlap=50):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = []
    for text in texts:
        chunks = splitter.split_text(text)
        docs.extend([Document(page_content=chunk) for chunk in chunks])
    return docs

# Step 4: Embedding + FAISS vector store
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

def create_vector_store(docs):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(docs, embeddings)
    return vectorstore

# Step 5: Load Falcon RW-1B model (open + no auth)
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "tiiuae/falcon-rw-1b"  # ✅ No login required
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

llm_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Step 6: QA over retrieved documents
def answer_question(vectorstore, question, top_k=5):
    retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})
    docs = retriever.get_relevant_documents(question)
    context = "\n\n".join([doc.page_content for doc in docs])
    prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"

    response = llm_pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7)
    return response[0]['generated_text'].split("Answer:")[-1].strip()

# Step 7: Chat interface
def chat_with_bot(vectorstore):
    print("🤖 Ask me anything about the PDFs! Type 'exit' to quit.")
    while True:
        question = input("\nYou: ")
        if question.lower() in ["exit", "quit"]:
            print("👋 Bye!")
            break
        answer = answer_question(vectorstore, question)
        print(f"Bot: {answer}")

# Step 8: Run the pipeline
texts = load_pdfs_from_folder(extract_folder)
docs = split_text(texts)
vectorstore = create_vector_store(docs)
chat_with_bot(vectorstore)


Saving lekl101.pdf to lekl101.pdf
Saving ask.pdf to ask (5).pdf


tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Device set to use cpu
  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


🤖 Ask me anything about the PDFs! Type 'exit' to quit.

You: explain the content


  docs = retriever.get_relevant_documents(question)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Bot: Q: The author has been inspired by the fact that the
wave had washed a car into the side of the hotel.
What does this suggest?
A: The implication is that the author believes that the
woman had chosen a path that would bring her to the hotel
and thus she died.
Q: What is the significance of the woman wearing a gold
serpent ring?
A: The author believes that the woman was a serpent.
The serpent was a symbol of the serpent-like ways of the
woman’s life. The serpent was a symbol of
the woman’s evil, but it was also a symbol of
her good.
Q: Why does the author compare Neruda to a Renaissance pope?
A: The comparison is intended to point out the fact that
modern people are not as wise as the Renaissance people
were. In modern times people are not capable of
thinking for themselves. They are not capable of
seeing things from a different angle. They do not
understand that if they keep going on the same path
they are on, they will never achieve a better
result.
Analysis of the Text
1. What d

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Bot: Gabriel García Márquez (1927-2014)
2024-25
6/KALEIDOSCOPE
11111
IIIII SSSSSeeeeellllllllllll mmmmmyyyyy DDDDDrrrrreeeeeeeaaaaammmmmssssss
Gabriel Garcia Marquez (1928-2014)
One Hundred Years in Solitude (1967, tr. 1970)
Love in the Time of Cholera (1985, tr. 1988)
Gabriel García Márquez (1927-2014)
One Hundred Years in Solitude (1967, tr. 1970)
Love in the Time of Cholera (1985, tr. 1988)
10/KALEIDOSCOPE
11111
IIIII SSSSSeeeeellllllllll mmmmmyyyyy DDDDDrrrrreeeeeeeaaaaammmmmssssss
The following stories are from The Book of
One Hundred Dreams of the Young Lady in Cienfuegos:
1. ‘The dream of the young lady in Cienfuegos’ (pp. 41-42).
What do we learn about the young

You: exit
👋 Bye!
