# Phase 1: Taxonomy Discovery & EDA

**Objective:** Analyze the raw proposal data to derive a compact, evidence-based taxonomy.

This notebook represents the "Research Phase." We use Llama 3.3 to read the entire dataset and generate a Mutually Exclusive, Collectively Exhaustive (MECE) set of categories.

**Output:** A `taxonomy.json` file saved to the `data/` directory, which will be used by the production agent.

In [5]:
# ==========================================
# 1. SETUP & CONFIGURATION
# ==========================================
import os
import json
import pandas as pd
from groq import Groq
from dotenv import load_dotenv

# 1. Load Environment Variables (from root .env)
# We assume this notebook is in 'notebooks/', so .env is one level up
load_dotenv(os.path.join('..', '.env'))

api_key = os.environ.get("GROQ_API_KEY")
if not api_key:
    print("‚ö†Ô∏è  Warning: GROQ_API_KEY not found. Please check your .env file.")
else:
    print("‚úÖ API Key Loaded.")

# 2. Configure Paths
DATA_DIR = os.path.join('..', 'data')
INPUT_FILE = os.path.join(DATA_DIR, 'proposals.csv')
OUTPUT_TAXONOMY = os.path.join(DATA_DIR, 'taxonomy.json')

# 3. Initialize Client
client = Groq(api_key=api_key)

‚úÖ API Key Loaded.


In [6]:
# --- DATA INGESTION & VALIDATION ---

# 1. Fail Fast Check
if not os.path.exists(INPUT_FILE):
    print(f"‚ùå Error: Input file not found at {INPUT_FILE}")
else:
    # 2. Load Data
    df = pd.read_csv(INPUT_FILE)
    print(f"üìÇ Loaded {len(df)} proposals.")
    print("\nSample Data:")
    print(df.head(3))

üìÇ Loaded 66 proposals.

Sample Data:
  proposalId                                        description
0       AL-1  for constructing the Bridge Replacement (Gradi...
1       AK-1  This federally funded contract includes all ne...
2       AZ-1  The proposed project is located in Coconino Co...


### Step 1: Generate Taxonomy
We feed all descriptions to Llama 3.3-70B to find natural clusters. We use a **system prompt** that enforces a strict JSON output schema to ensure the results are machine-readable.

In [7]:
def generate_taxonomy(dataframe):
    """
    Phase 1 Core Logic: Taxonomy Discovery.
    Uses an LLM to read the entire dataset context and derive a schema.
    """
    print("üöÄ Analyzing dataset to derive taxonomy...")

    
    all_descriptions = "\n".join([
        f"- {row['description']}"
        for _, row in dataframe.iterrows()
    ])

    # --- PROMPT ENGINEERING ---
    # 1. Persona: "Senior Data Architect" primes the model for structure.
    # 2. Constraint: "Mutually Exclusive" prevents overlapping tags.
    # 3. Output Control: "JSON only" ensures we can parse it programmatically.
    prompt = f"""
    You are a Senior Data Architect.
    Analyze the following project proposals and create a strict classification taxonomy.

    DATA:
    {all_descriptions}

    REQUIREMENTS:
    1. Create 4-8 categories that cover 90% of the data.
    2. Categories must be Mutually Exclusive.
    3. Output JSON only.

    OUTPUT SCHEMA:
    {{
      "taxonomy": {{
        "Category Name": {{
          "definition": "Strict 1-sentence definition.",
          "keywords": ["key", "words"],
          "text_example": "A generic example phrase based on the data (DO NOT use specific IDs)"
        }}
      }}
    }}
    """

    # --- LLM INFERENCE ---
    # Using JSON mode and Temperature 0.0 for deterministic, valid JSON output.
    completion = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"}
    )

    try:
        data = json.loads(completion.choices[0].message.content)
        return data
    except Exception as e:
        print(f"‚ùå JSON Parsing Error: {e}")
        return None

# --- EXECUTION ---
# Run the generation and persist the artifact for Phase 2 (the Agent)
taxonomy_data = generate_taxonomy(df)

if taxonomy_data:
    # Save to disk so the 'main.py' pipeline can load it later.
    with open(OUTPUT_TAXONOMY, 'w') as f:
        json.dump(taxonomy_data, f, indent=2)
    print(f"\n‚úÖ Taxonomy saved to {OUTPUT_TAXONOMY}")
    print(f"   Categories: {list(taxonomy_data['taxonomy'].keys())}")

üöÄ Analyzing dataset to derive taxonomy...

‚úÖ Taxonomy saved to ..\data\taxonomy.json
   Categories: ['Bridge Construction and Repair', 'Roadway Rehabilitation and Resurfacing', 'Traffic and Pedestrian Infrastructure', 'Maintenance and Upkeep', 'Drainage and Water Management', 'Intelligent Transportation Systems (ITS) and Technology', 'Landscaping and Environmental', 'Utility and Facility Upgrades']


### Step 2: Validate Coverage (Spot Check)
We test the new taxonomy against a random sample to ensure the definitions are clear and the model can actually use them.

In [8]:
def validate_sample(dataframe, taxonomy):
    print("\nüîç Spot Checking 5 random proposals...")
    
    # --- 1. RANDOM SAMPLING ---
    sample = dataframe.sample(5)
    sample_text = "\n".join([f"ID {row['proposalId']}: {row['description']}" for _, row in sample.iterrows()])
    
    # --- 2. VALIDATION PROMPT ---
    prompt = f"""
    Taxonomy:
    {json.dumps(taxonomy['taxonomy'], indent=2)}

    Task: Classify these proposals.
    
    Proposals:
    {sample_text}

    Return JSON: {{ "results": [ {{ "id": "...", "category": "..." }} ] }}
    """

    # --- 3. TEST INFERENCE ---
    completion = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"}
    )
    
    # --- 4. VISUAL INSPECTION ---
    results = json.loads(completion.choices[0].message.content)
    for res in results.get('results', []):
        print(f"   {res['id']}: {res['category']}")

# Execute Validation
if taxonomy_data:
    validate_sample(df, taxonomy_data)


üîç Spot Checking 5 random proposals...
   AZ-1: Roadway Rehabilitation and Resurfacing
   ND-1: Roadway Rehabilitation and Resurfacing
   LA-2: Unknown
   FL-2: Landscaping and Environmental
   NS-1: Maintenance and Upkeep
