<a href="https://colab.research.google.com/github/Ali-mohammadi-design/RFP_TASKS_EXTRACTOR/blob/main/RFP_EXTRACTOR_TASKS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# STEP 1: Install required libraries
!pip install PyMuPDF openai

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.5


In [3]:
import fitz  # PyMuPDF
import openai
from typing import List
from google.colab import files
import os

In [4]:
uploaded = files.upload()

Saving PART 3 - Requirements for Deliverables.pdf to PART 3 - Requirements for Deliverables.pdf


In [11]:
from openai import OpenAI

client = OpenAI(api_key="YOUR API KEY")


In [6]:
def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

In [7]:
def chunk_text(text: str, chunk_size: int = 500) -> List[str]:
    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

In [12]:
def extract_tasks_from_text_chunk(chunk: str) -> List[str]:
    prompt = (
        "You are analyzing a section of an engineering RFP document. "
        "From the following text, extract any clearly defined tasks or activities that "
        "should be included in a proposal. Return the tasks as a bullet point list:\n\n"
        f"{chunk}\n\nTasks:"
    )
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
    )

    tasks_text = response.choices[0].message.content
    tasks = [line.strip("-• \n") for line in tasks_text.split("\n") if line.strip()]
    return tasks

In [9]:
def analyze_uploaded_pdfs(uploaded_files) -> List[str]:
    all_tasks = set()
    for filename in uploaded_files.keys():
        print(f"\nProcessing: {filename}")
        text = extract_text_from_pdf(filename)
        chunks = chunk_text(text)
        for chunk in chunks:
            try:
                tasks = extract_tasks_from_text_chunk(chunk)
                all_tasks.update(tasks)
            except Exception as e:
                print(f"Error processing chunk: {e}")
    return sorted(all_tasks)


In [13]:
tasks = analyze_uploaded_pdfs(uploaded)


Processing: PART 3 - Requirements for Deliverables.pdf


In [14]:
print("\n--- ✅ Extracted Tasks ---")
for task in tasks:
    print(f"- {task}")


--- ✅ Extracted Tasks ---
- Account for a two-year post-construction warranty period
- Account for any differences between the conditions and quantities of the existing systems and equipment listed in Appendix D and the baseline work scope specified in the RFP.
- Account for potential shadowing effects from objects within the space
- Acknowledge the demolition of the HHW Building.
- Add a brief description of the intended information that still needs to be provided to complete the unfinished or missing section or element
- Add a new Section A.2.4 titled "Confined Space List".
- Add new subsections to Section A.4.2.1 regarding the Toronto Accessibility Design Guidelines (TADG), Toronto Green Standard (TGS), and City of Toronto’s Corporate Real Estate Management’s Net Zero Carbon Plan
- Address Environmental – Noise, Ontario Regulation NPC 205
- Address Health and Safety Requirements
- Address all emergency lighting deficiencies related to emergency lighting levels, equipment hardware, 

In [97]:
len(tasks)

757

**AS you can see the size of the tasks is 757.**

**Here we only consider the first 20 tasks.**

**You can change the size later to conisedr all the tasks.**

In [71]:
tasks_r=tasks[1:20]

In [75]:
def chunk_tasks_for_filtering(tasks: List[str], max_chunk_chars: int = 1000) -> List[List[str]]:
    """Split tasks into smaller groups based on character limit."""
    chunks = []
    current_chunk = []
    current_length = 0

    for task in tasks:
        task_str = f"- {task}"
        if current_length + len(task_str) > max_chunk_chars:
            chunks.append(current_chunk)
            current_chunk = [task_str]
            current_length = len(task_str)
        else:
            current_chunk.append(task_str)
            current_length += len(task_str)

    if current_chunk:
        chunks.append(current_chunk)

    return chunks


In [103]:
def filter_proposal_tasks_chunked(tasks: List[str]) -> List[str]:
    task_chunks = chunk_tasks_for_filtering(tasks_r)
    all_filtered_tasks = []
    seen = set()

    for i, chunk in enumerate(task_chunks):
        print(f"\n🔍 Processing chunk {i+1}/{len(task_chunks)}...")
        prompt = (
            "You are a proposal engineer reviewing a list of tasks extracted from an RFP. "
            "Return only the tasks that should be included in the proposal while we are preparing the TTM. "
            "The returned tasks must be either some actions that we should do or specific report or document that we should prepare for the project owner"
            "do not return tasks that could be done in less than 1 hour."
            "Ignore vague or non-actionable items.\n\n"
            "Tasks:\n" + "\n".join(chunk) + "\n\nReturn only valid tasks as a bullet list."
        )

        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.2,
            )

            filtered = response.choices[0].message.content.strip()
            filtered_list = [line.strip("-• \n") for line in filtered.split("\n") if line.strip()]

            # Print each result as it's processed
            for task in filtered_list:
                if task not in seen:
                    seen.add(task)
                    all_filtered_tasks.append(task)
                    print(f"✅ {task}")

        except Exception as e:
            print(f"❌ Error in chunk {i+1}: {e}")

    return all_filtered_tasks


In [104]:
filtered_tasks = filter_proposal_tasks_chunked(tasks)

print("\n--- 📌 Final Filtered Tasks for Proposal ---")
for task in filtered_tasks:
    print(f"- {task}")



🔍 Processing chunk 1/2...
✅ Account for any differences between the conditions and quantities of the existing systems and equipment listed in Appendix D and the baseline work scope specified in the RFP.
✅ Add a brief description of the intended information that still needs to be provided to complete the unfinished or missing section or element
✅ Add a new Section A.2.4 titled "Confined Space List".
✅ Add new subsections to Section A.4.2.1 regarding the Toronto Accessibility Design Guidelines (TADG), Toronto Green Standard (TGS), and City of Toronto’s Corporate Real Estate Management’s Net Zero Carbon Plan
✅ Address Environmental – Noise, Ontario Regulation NPC 205
✅ Address Health and Safety Requirements

🔍 Processing chunk 2/2...
✅ Address all emergency lighting deficiencies related to emergency lighting levels, equipment hardware, power sources, conduits, and roofing, etc., as necessary to maintain a state of good repair and aesthetics.
✅ Address all lighting deficiencies related to

In [20]:
!pip install -q sentence-transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [93]:
from sentence_transformers import SentenceTransformer, util

# Load sentence embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')  # Small + efficient

# Split PDF text into chunks
def chunk_text_for_rag(text, chunk_size=500):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Chunk and embed the PDF text
pdf_chunks = chunk_text_for_rag(full_pdf_text)
chunk_embeddings = embedder.encode(pdf_chunks, convert_to_tensor=True)


In [95]:
def analyze_task_with_rag(task: str, chunks: List[str], chunk_embeddings, top_k=3) -> dict:
    # Embed the task
    task_embedding = embedder.encode(task, convert_to_tensor=True)

    # Find most relevant chunks
    hits = util.semantic_search(task_embedding, chunk_embeddings, top_k=top_k)
    relevant_chunks = "\n\n".join([chunks[hit['corpus_id']] for hit in hits[0]])

    # GPT system & user prompt
    system_prompt = (
        "You are an expert project planner analyzing engineering RFPs. "
        "Given a task and related RFP context, determine:\n"
        "1. If it's a deliverable or an activity\n"
        "2. Who should do it (role and expertise)\n"
        "3. The expertise level (junior/mid/senior)\n"
        "4. In a short sentence why this tasks has been defined in the project"
        "5. in a short sentence mention how it should be done"
        "6. What is your time estimation? only give me a number without more explanation"


    )

    user_prompt = f"Task:\n{task}\n\nRFP Context:\n{relevant_chunks}\n\nAnalyze the task based on the context."

    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt},
            ],
            temperature=0.4,
        )

        analysis = response.choices[0].message.content.strip()
        return {"task": task, "analysis": analysis}

    except Exception as e:
        return {"task": task, "analysis": f"Error: {e}"}


In [96]:
rag_based_analyses = []

for task in filtered_tasks:
    result = analyze_task_with_rag(task, pdf_chunks, chunk_embeddings)
    rag_based_analyses.append(result)
    print(f"\n📝 Task: {result['task']}")
    print(f"🔍 Analysis:\n{result['analysis']}")



📝 Task: Account for any differences between the conditions and quantities of the existing systems and equipment listed in Appendix D and the baseline work scope specified in the RFP.
🔍 Analysis:
1. This is an activity.
2. This should be done by an Electrical Engineer or a professional with expertise in assessing lighting systems and electrical equipment.
3. The expertise level required is senior.
4. This task has been defined in the project to ensure the accuracy of the information presented in Appendix D and to reconcile any differences with the baseline work scope specified in the RFP.
5. This should be done by thoroughly reviewing and validating all the information in Appendix D, and comparing it with the baseline work scope specified in the RFP.
6. My time estimation for this task is 10 hours.

📝 Task: Add a brief description of the intended information that still needs to be provided to complete the unfinished or missing section or element
🔍 Analysis:
1. Deliverable
2. Project Ma