<a href="https://colab.research.google.com/github/Johnverse11/RFP_AGENT/blob/main/Agent_rfp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 🔧 Clean environment
!pip uninstall -y langchain transformers sentence-transformers -q

# ✅ Install core components
!pip install -q \
  langchain==0.1.16 \
  langchain-huggingface==0.0.3 \
  transformers \
  sentence-transformers==2.6.1 \
  chromadb \
  unstructured \
  pdfminer.six \
  pymupdf \
  tqdm

In [2]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
import os

# Setup folder & move file
os.makedirs("product_docs", exist_ok=True)
!mv delta_v.txt product_docs/ 2>/dev/null || echo "✔️ Text already in place"

# Load
loader = TextLoader("/content/product_docs/delta_v.txt")
docs = loader.load()

# Split
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Embedding
embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(chunks, embedding=embedder, persist_directory="vector_db")

print("✅ Vector DB created successfully.")

✔️ Text already in place


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


✅ Vector DB created successfully.


In [3]:
# Install system dependencies
!sudo apt-get install -y poppler-utils tesseract-ocr

# Install Python dependencies (add the missing one)
!pip install -q \
  pi_heif \
  unstructured_inference \
  pdf2image \
  pytesseract \
  unstructured \
  unstructured-pytesseract


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [4]:
!sudo apt-get install poppler-utils tesseract-ocr -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [5]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("/content/sample_rfp.pdf")
rfp_docs = loader.load()
rfp_text = "\n".join([doc.page_content for doc in rfp_docs])

print("✅ RFP Loaded:", rfp_text[:500])

✅ RFP Loaded: No. 31026/99/2020-Policy dated 25th January, 2021

Selection of Public Financial Institution (Government Company) for providing services of Project Management Agency (PMA) for the implementation of Production Linked Incentive scheme for Pharmaceuticals

Request for Proposal (RFP)

Department of Pharmaceuticals Ministry of Chemicals & Fertilizers Shastri Bhawan, Dr Rajendra Prasad Road, New Delhi- 110001

Page 1 of 12

INDEX

Table of Contents 1 3

Request for Proposal from Public Financial Insti


In [None]:
# Install Transformers and Accelerate
!pip install -q transformers accelerate

from transformers import pipeline

# Load an open-access instruct model (no token required)
llm = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", device=0)

# Prompt (modify rfp_text as needed)
prompt = f"""
You are an expert AI assistant specializing in analyzing Request for Proposal (RFP) documents
for a large industrial automation company. Your task is to carefully read the following RFP text
and extract all specific requirements.

Return a valid JSON array of objects with:
- "id": e.g. "REQ-001"
- "requirement_text": exact text
- "category": e.g. "Technical Spec", "Safety"
- "details": brief summary

Here is the RFP:
---
{rfp_text[:2000]}  # limit for safety
---
Return only valid JSON:
"""

# Generate the response
response = llm(prompt, max_new_tokens=1024, do_sample=False)[0]["generated_text"]

# Output
print("✅ Extracted Requirements:\n")
print(response)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
import json

# Step 1: Extract the JSON portion from the LLM response
start_idx = response.find("[")
json_str = response[start_idx:].strip()

# Step 2: Try parsing the response into Python list of requirement dicts
try:
    requirements = json.loads(json_str)
    print(f"✅ Parsed {len(requirements)} requirements")
except Exception as e:
    print("❌ Failed to parse JSON:", e)
    print("💬 Raw response:\n", response)
    requirements = []

# Step 3: For each requirement, find the top-k relevant chunks from the vector DB
for req in requirements:
    print(f"\n📌 {req.get('id', 'REQ-???')}: {req.get('requirement_text', 'Missing text')}")

    try:
        matches = vector_db.similarity_search(req["requirement_text"], k=2)
        for i, match in enumerate(matches, start=1):
            print(f"🔗 Match #{i}:", match.page_content[:200], "...")
    except Exception as e:
        print("❌ Error during vector search:", e)
