## Task 1: Chat with PDF

### Step 1: Extract Data from PDFs


In [73]:
pip install pdfplumber



In [74]:
# Using pdfplumber to extract text.
import pdfplumber

def extract_text_from_pdf(pdf_path):
    text_data = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text_data.append(page.extract_text())
    return text_data

# Example usage
pdf_path = "example.pdf"
pdf_text = extract_text_from_pdf(pdf_path)


In [75]:
pdf_text

['Technological Advances and Impact\n1. Introduction\nTechnology has played a transformative role in human history, revolutionizing the way societies function,\nindustries operate, and individuals interact. From the Industrial Revolution to the Digital Age,\ntechnological advancements have driven progress in every field imaginable. This document explores the\nkey categories of technological advances and their impacts, supported by semi-structured data.\n2. Categories of Technological Advances\n2.1 Artificial Intelligence (AI)\nAI has redefined problem-solving capabilities through machine learning, natural language processing, and\nrobotics. AI applications span diverse fields, from healthcare diagnostics to autonomous vehicles.\n2.2 Renewable Energy Technologies\nRenewable energy sources, such as solar and wind, are rapidly replacing fossil fuels, driven by the need\nfor sustainable development.\n2.3 Biotechnology\nBiotechnology leverages biological systems and organisms to create inno

## Step 2: Chunk the Extracted Text
Split the extracted text into smaller, meaningful chunks for better retrieval.



Use libraries like nltk for splitting text into sentences or paragraphs:

In [76]:
def chunk_text(text, max_length=500):
    words = text.split()
    return [' '.join(words[i:i+max_length]) for i in range(0, len(words), max_length)]

chunks = chunk_text(' '.join(pdf_text))


In [5]:
chunks

['Technological Advances and Impact 1. Introduction Technology has played a transformative role in human history, revolutionizing the way societies function, industries operate, and individuals interact. From the Industrial Revolution to the Digital Age, technological advancements have driven progress in every field imaginable. This document explores the key categories of technological advances and their impacts, supported by semi-structured data. 2. Categories of Technological Advances 2.1 Artificial Intelligence (AI) AI has redefined problem-solving capabilities through machine learning, natural language processing, and robotics. AI applications span diverse fields, from healthcare diagnostics to autonomous vehicles. 2.2 Renewable Energy Technologies Renewable energy sources, such as solar and wind, are rapidly replacing fossil fuels, driven by the need for sustainable development. 2.3 Biotechnology Biotechnology leverages biological systems and organisms to create innovative solutio

## Step 3: Convert Chunks to Embeddings
Use a free Hugging Face embedding model like sentence-transformers/all-MiniLM-L6-v2.


Install required library: pip install sentence-transformers

In [94]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_embeddings(chunks):
    return model.encode(chunks)

embeddings = generate_embeddings(chunks)


In [7]:
embeddings

array([[-2.11438537e-03,  8.22735019e-03,  9.22150761e-02,
        -2.09102947e-02,  1.13635762e-02, -4.26488966e-02,
         7.69801019e-03, -2.45063994e-02, -2.98184566e-02,
         6.51459992e-02, -6.27262369e-02, -3.52699012e-02,
        -3.24409604e-02,  7.16379192e-03, -1.61520224e-02,
         4.31093797e-02, -3.95570956e-02, -5.10982648e-02,
        -7.62242004e-02, -3.68719883e-02,  9.99040082e-02,
         9.09594372e-02, -1.38494733e-03, -1.77873706e-03,
         3.82404141e-02,  8.73489976e-02,  5.56522980e-03,
        -9.40532163e-02, -6.40310422e-02,  7.04503357e-02,
         4.03540954e-02,  4.29736152e-02,  2.15611663e-02,
         4.09165323e-02,  2.70049069e-02,  1.19267687e-01,
        -1.03831748e-02, -2.82174852e-02,  1.14130592e-02,
        -1.03148138e-02, -3.75012495e-02, -1.50371879e-01,
         2.17372607e-02,  3.62067334e-02,  3.24423611e-02,
         4.62682992e-02, -6.17828704e-02, -8.25282335e-02,
         5.70844300e-02,  2.95319594e-03, -1.28737420e-0

## Step 4: Store Embeddings in a Vector Database
Use FAISS for similarity search.

In [78]:
!pip install faiss-cpu




In [84]:
import faiss
import numpy as np

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)


## Step 5: Handle Queries
Search for the most relevant chunks.

In [88]:


# Step 4: Search for relevant chunks based on a query
def search_query(query, index, chunks):
    # Generate the embedding for the query
    query_embedding = np.array(model.encode([query]))  # Generate query embedding
    distances, indices = index.search(query_embedding, k=3)  # Retrieve top 3 relevant chunks
    relevant_chunks = [chunks[i] for i in indices[0]]  # Extract the relevant chunks
    return relevant_chunks

# Test the search functionality
query = "what is biotechnology?"
results = search_query(query, index, chunks)

# # Step 5: Print the retrieved chunks
# print("\nTop relevant chunks for the query:")
# for result in results:
#     print(f"Relevant Chunk: {result[:]}...")  # Preview first 200 characters of each result


In [89]:
prompt = f"Based on the following information, answer the query: {' '.join(relevant_chunks)}\n\nQuery: {query}"
# prompt

## Step 6: Generate Responses with LLM
Use OpenAI GPT to generate responses.

In [82]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Switch to a more capable model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

def generate_response(query, retrieved_chunks):
    if not retrieved_chunks:
        return "No relevant information found in the provided context."
    context = " ".join(retrieved_chunks)
    prompt = (
        f"Based on the following information, answer the query:"
        f"Question: {query}\n\n"
        f"Answer:"
    )
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

response = generate_response(query, relevant_chunks)
print("Response:", response)


Response: Biotechnology is the science and technology of modifying living organisms to increase their biochemical properties and to increase their biophysical properties.


## comparing responses

In [90]:


# Step 4: Search for relevant chunks based on a query
def search_query(query, index, chunks):
    # Generate the embedding for the query
    query_embedding = np.array(model.encode([query]))  # Generate query embedding
    distances, indices = index.search(query_embedding, k=3)  # Retrieve top 3 relevant chunks
    relevant_chunks = [chunks[i] for i in indices[0]]  # Extract the relevant chunks
    return relevant_chunks

# Test the search functionality
query = "what are Renewable Energy Technologies?"
results = search_query(query, index, chunks)

# # Step 5: Print the retrieved chunks
# print("\nTop relevant chunks for the query:")
# for result in results:
#     print(f"Relevant Chunk: {result[:]}...")  # Preview first 200 characters of each result


In [91]:


response = generate_response(query, relevant_chunks)
print("Response:", response)


Response: Renewable energy technologies (RETs) are technologies that use renewable energy sources (renewable energy sources) to generate electricity or heat.


In [95]:


# Step 4: Search for relevant chunks based on a query
def search_query(query, index, chunks):
    # Generate the embedding for the query
    query_embedding = np.array(model.encode([query]))  # Generate query embedding
    distances, indices = index.search(query_embedding, k=3)  # Retrieve top 3 relevant chunks
    relevant_chunks = [chunks[i] for i in indices[0]]  # Extract the relevant chunks
    return relevant_chunks

# Test the search functionality
query = "what are the key breakthroughs?"
results = search_query(query, index, chunks)

# # Step 5: Print the retrieved chunks
# print("\nTop relevant chunks for the query:")
# for result in results:
#     print(f"Relevant Chunk: {result[:]}...")  # Preview first 200 characters of each result


In [97]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Switch to a more capable model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

def generate_response(query, retrieved_chunks):
    if not retrieved_chunks:
        return "No relevant information found in the provided context."
    context = " ".join(retrieved_chunks)
    prompt = (
        f"Based on the following information, answer the query:"
        f"Question: {query}\n\n"
        f"Answer:"
    )
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

response = generate_response(query, relevant_chunks)
print("Response:", response)


Response: The first major breakthrough in the development of the atomic bomb was the development of the atomic bomb , which was a device that could be used to destroy a nuclear weapon .
