# Lab 5: Leveraging Open Data from Wikipedia for LLM Prompt Engineering

## Overview
This lab demonstrates how to extract structured data from Wikipedia pages and use it to create effective prompts for Large Language Models (LLMs). You'll learn to work with real-world financial data, process it programmatically, and engineer prompts for various AI tasks.

## Learning Objectives
- ✓ Extract financial index components from Wikipedia
- ✓ Retrieve company infobox data programmatically
- ✓ Build structured datasets from semi-structured web data
- ✓ Design effective LLM prompts for different tasks
- ✓ Process and clean text data for AI consumption
- ✓ Create reusable prompt templates and utilities

## Part 1: Data Extraction from Wikipedia

### What is a Financial Index?
A financial index is a composite measure of a subset of companies in a specific market or sector. Examples include:
- **S&P 500**: 500 largest US companies
- **EURO STOXX 50**: 50 largest Eurozone companies
- **DAX**: 40 largest German companies

### Your Task
1. **Identify components**: Extract the list of companies in each index from Wikipedia
2. **Gather company data**: Retrieve detailed information (infoboxes) from each company's Wikipedia page
3. **Build a dataset**: Combine all data into structured format suitable for LLM processing
4. **Engineer prompts**: Create effective prompts that leverage this data for AI tasks

### Data Sources
- **Index components**: Wikipedia articles listing index members
- **Company data**: Wikipedia infoboxes (structured data boxes on company pages)
- **Dump file**: Optional - for advanced analysis of full Wikipedia articles

### Optional: Full Wikipedia Dump
For advanced analysis, you can download the complete Wikipedia dump from:
- **Link**: https://dumps.wikimedia.org/enwiki/
- **File**: `enwiki-latest-pages-articles-multistream-index.txt.bz2`
- **Use case**: Full-text search, article history analysis, or complete data scraping
- **Note**: Very large files (100+ GB) - requires significant storage and processing power

For this lab, we'll focus on extracting specific data via the Wikipedia API, which is more efficient.

In [4]:
!uv pip install pandas pathlib typing tqdm wptools loguru numpy

[2K[2mResolved [1m15 packages[0m [2min 6ms[0m[0m                                          [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/5)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/5)--------------[0m[0m     0 B/25.70 KiB           [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/5)2m------------[0m[0m 14.84 KiB/25.70 KiB         [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/5)2m------------[0m[0m 14.84 KiB/25.70 KiB         [1A
[2mtyping              [0m [32m------------------[2m------------[0m[0m 14.84 KiB/25.70 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/5)--------------[0m[0m     0 B/33.84 KiB           [2A
[2mtyping              [0m [32m------------------[2m------------[0m[0m 14.84 KiB/25.70 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/5)--------------[0m[0m 16.00 KiB/33.84 KiB         [2A
[2mtyping              [0m [32m------------------

In [10]:
# ============================================================================
# IMPORTS & SETUP
# ============================================================================
# These libraries enable us to work with Wikipedia data

import pandas as pd              # Data manipulation and analysis
import urllib.request           # HTTP requests to Wikipedia
from pathlib import Path        # Cross-platform file path handling
from typing import Union, Dict  # Type hints for better code clarity
from tqdm import tqdm          # Progress bars for long operations
import wptools               # Wikipedia parsing (infobox extraction)
from loguru import logger       # Enhanced logging
import json                     # Working with JSON data
import numpy as np             # Numerical operations
import re                       # Regular expressions for text cleaning

## Step 1: Extract Index Components from Wikipedia

### Task: Extract Company Lists
We'll extract the list of companies that make up each financial index directly from Wikipedia.

### Indices We're Covering:
1. **S&P 500** (USA) - 500 largest US companies
2. **EURO STOXX 50** (Europe) - 50 largest Eurozone companies  
3. **CAC 40** (France) - 40 largest French companies
4. **DAX** (Germany) - 40 largest German companies
5. **CSI 300** (China) - 300 largest Chinese companies
6. **S&P Latin America 40** (Latin America) - 40 major LA companies
7. **BSE SENSEX** (India) - 30 largest Indian companies
8. **NASDAQ-100** (USA Tech) - 100 largest non-financial NASDAQ companies

### How It Works:
- Each index has a Wikipedia article with a table listing its components
- We'll use `pd.read_html()` to extract all tables from these pages
- Tables are saved as CSV files for later processing
- This approach is fast, requires no authentication, and respects Wikipedia's terms

In [2]:
# ============================================================================
# FUNCTION 1: Extract Tables from Wikipedia
# ============================================================================
# This function downloads tables from Wikipedia articles and saves them locally

def get_index_components(wiki_url: str, save_dir: Union[str, Path], 
                         opener: urllib.request.OpenerDirector) -> None:
    """
    Extract all HTML tables from a Wikipedia page and save as CSV files.
    
    Parameters:
    -----------
    wiki_url : str
        The Wikipedia page URL to scrape (e.g., list of index components)
    save_dir : Union[str, Path]
        Directory where CSV files will be saved
    opener : urllib.request.OpenerDirector
        Custom URL opener with proper User-Agent headers
        
    Output:
    -------
    Creates CSV files named table_0.csv, table_1.csv, etc. in save_dir
    Each file contains one table from the Wikipedia page
    
    Example:
    --------
    >>> get_index_components(
    ...     "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies",
    ...     "./data/indices/sp500",
    ...     opener
    ... )
    """
    save_path = Path(save_dir)
    save_path.mkdir(parents=True, exist_ok=True)
    
    # Fetch the Wikipedia page using the opener
    with opener.open(wiki_url) as response:
        html_content = response.read().decode('utf-8')
    
    # Extract all tables from the HTML using pandas
    tables = pd.read_html(html_content)
    
    # Save each table as a CSV file
    for i, table in enumerate(tables):
        csv_path = save_path / f"table_{i}.csv"
        table.to_csv(csv_path, index=False)
    
    logger.info(f"Extracted {len(tables)} tables from {wiki_url} -> {save_dir}")

In [3]:
# ============================================================================
# SETUP: Configure Wikipedia Index URLs and HTTP Headers
# ============================================================================

# Dictionary mapping index names to their Wikipedia article URLs
# These URLs contain tables with the company components of each index
indices = {
    "sp500": "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies",
    "eurostoxx50": "https://en.wikipedia.org/wiki/EURO_STOXX_50",
    "cac40": "https://en.wikipedia.org/wiki/CAC_40",
    "dax": "https://en.wikipedia.org/wiki/DAX",
    "csi300": "https://en.wikipedia.org/wiki/CSI_300_Index",
    "spla40": "https://en.wikipedia.org/wiki/S%26P_Latin_America_40",
    "bsesensex": "https://en.wikipedia.org/wiki/BSE_SENSEX",
    "nasdaq100": "https://en.wikipedia.org/wiki/Nasdaq-100",
}

# IMPORTANT: Configure HTTP headers to identify our bot to Wikipedia
# This is REQUIRED for ethical web scraping - identify yourself!
# Wikipedia may block requests without proper User-Agent headers

opener = urllib.request.build_opener()
opener.addheaders = [
    ("User-Agent", "MyResearchBot/1.0 (contact@example.com)")  # Identify your bot
]
urllib.request.install_opener(opener)

In [4]:
# ============================================================================
# EXECUTION: Download Index Components
# ============================================================================
# Loop through each index and extract its company components from Wikipedia
# This may take a few minutes depending on internet speed

for index_name, wiki_url in tqdm(indices.items(), desc="Downloading indices"):
    save_dir = Path(f"./data/indices/{index_name}")
    get_index_components(wiki_url, save_dir, opener)

Downloading indices:   0%|          | 0/8 [00:00<?, ?it/s]

  tables = pd.read_html(html_content)
[32m2025-11-28 10:56:47.378[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_index_components[0m:[36m48[0m - [1mExtracted 3 tables from https://en.wikipedia.org/wiki/List_of_S%26P_500_companies -> data/indices/sp500[0m
  tables = pd.read_html(html_content)
[32m2025-11-28 10:56:47.628[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_index_components[0m:[36m48[0m - [1mExtracted 10 tables from https://en.wikipedia.org/wiki/EURO_STOXX_50 -> data/indices/eurostoxx50[0m
  tables = pd.read_html(html_content)
[32m2025-11-28 10:56:47.883[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_index_components[0m:[36m48[0m - [1mExtracted 20 tables from https://en.wikipedia.org/wiki/CAC_40 -> data/indices/cac40[0m
  tables = pd.read_html(html_content)
[32m2025-11-28 10:56:48.070[0m | [1mINFO    [0m | [36m__main__[0m:[36mget_index_components[0m:[36m48[0m - [1mExtracted 10 tables from https://en.wikipedia.org/wiki/DAX -> data/indi

## Step 2: Extract Company Infoboxes from Wikipedia

### What are Infoboxes?
Wikipedia infoboxes are structured data boxes that appear on the right side of articles. They contain:
- Company name and alternative names
- Industry classification
- Founded date and location
- Key executives
- Headquarters location
- Number of employees
- Revenue and financial metrics
- Official website URLs
- Stock exchange listings
- And much more...

### Why Infoboxes?
- **Structured data**: Unlike article body text, infoboxes are semi-structured
- **Consistency**: Fields follow a template across similar articles
- **Ease of extraction**: Wikipedia APIs can parse infoboxes directly
- **Rich context**: Perfect for LLM prompts - contains exactly the info LLMs need

### Process
1. Use the `wptools` library to fetch each company's Wikipedia page
2. Extract the infobox (structured data) from the page parse
3. Save as JSON for flexibility and later processing
4. Handle errors gracefully (some companies may not have Wikipedia pages)

In [12]:
# ============================================================================
# EXAMPLE: Extract a Single Company Infobox
# ============================================================================
# This example shows the process for one company (3M) from S&P 500
# In production, we'd loop this for all companies

# Create directory for storing infobox data
infobox_dir = Path("./data/infoboxes/sp500")
infobox_dir.mkdir(parents=True, exist_ok=True)

# Example: Extract infobox for 3M company
company_name = "3M"

try:
    # Create a Wikipedia page object and fetch parsed data
    page = wptools.page(company_name, silent=True)
    page.get_parse()
    
    # Extract the infobox (structured data from the page)
    infobox = page.data.get('infobox', {})
    
    # Save infobox to JSON file
    json_path = infobox_dir / f"{company_name}.json"
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(infobox, f, indent=2, ensure_ascii=False)
    
    logger.info(f"Successfully extracted infobox for {company_name}")
    
except Exception as e:
    logger.error(f"Failed to extract infobox for {company_name}: {e}")

[32m2025-11-28 09:43:11.011[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m27[0m - [1mSuccessfully extracted infobox for 3M[0m


In [13]:
# ============================================================================
# DISPLAY: View the Extracted Infobox
# ============================================================================
# This shows what data we extracted from Wikipedia

# Load and display the saved infobox
json_path = Path("./data/infoboxes/sp500/3M.json")

if json_path.exists():
    with open(json_path, 'r', encoding='utf-8') as f:
        infobox_data = json.load(f)
    
    print(f"Infobox for 3M\n")
    print(f"Total fields extracted: {len(infobox_data)}\n")
    
    # Display each field in the infobox
    for key, value in infobox_data.items():
        print(f"{key}: {value}")
else:
    print("Infobox file not found. Run the extraction cell first.")

Infobox for 3M

Total fields extracted: 24

name: 3M Company
logo: 3M wordmark.svg
logo_size: 175px
image: 3-M Building Maplewood MN1.jpg
image_size: 250px
image_caption: 3M headquarters in [[Maplewood, Minnesota]]
former_name: Minnesota Mining and Manufacturing Company (1902–2002)
type: [[Public company|Public]]
traded_as: {{Unbulleted list|New York Stock Exchange|MMM|[[Dow Jones Industrial Average|DJIA]] component|[[S&P 100]] component|[[S&P 500]] component}} {{New York Stock Exchange|MMM}}
ISIN: {{ISIN|sl|=|n|pl|=|y|US88579Y1010}}
industry: [[Conglomerate (company)|Conglomerate]]
foundation: {{Start date and age|1902|6|13}} in [[Two Harbors, Minnesota]], U.S.
founders: {{Unbulleted list|J. Danley Budd|Henry S. Bryan|William A. McGonagle|John Dwan|Hermon W. Cable | Charles Simmons|ref|{{cite web |url=https://www.3m.com.au/3M/en_AU/company-au/news-releases/full-story/?storyid=51f5cfac-3ea9-4a98-a406-e2b955c3fd40 |title=It all started with a rock |date=June 11, 2021 |work=3M Australia 

## Step 3: Aggregate Infoboxes into Databases

### What We're Building
We're converting individual JSON files (one per company) into consolidated CSV databases (one per index).

### Why?
- **Easier analysis**: CSV format works with pandas, Excel, and most analysis tools
- **Efficiency**: One file per index instead of hundreds of individual JSON files
- **Standardization**: Creates a uniform dataset structure for LLM processing

### Process
1. Read all JSON infobox files for an index from disk
2. Convert each JSON to a DataFrame row
3. Concatenate all rows into a single DataFrame
4. Save as CSV with proper encoding

### Notes for Future Enhancement
- The infoboxes contain many fields beyond what we use now (URLs, images, etc.)
- Future work could extract and leverage additional information
- This foundation allows flexible data extraction later

In [None]:
# ============================================================================
# EXECUTION: Merge All Infoboxes into Index Databases
# ============================================================================
# Loop through each index folder and consolidate all JSON infoboxes into CSV

infoboxes_base = Path("./data/infoboxes")
databases_dir = Path("./data/databases")
databases_dir.mkdir(parents=True, exist_ok=True)

# Process each index folder
for index_dir in infoboxes_base.iterdir():
    if not index_dir.is_dir():
        continue
    
    index_name = index_dir.name
    
    # Collect all JSON files in this index folder
    json_files = list(index_dir.glob("*.json"))
    
    if not json_files:
        logger.warning(f"No JSON files found in {index_dir}")
        continue
    
    # Read each JSON file and collect into a list
    records = []
    for json_file in json_files:
        with open(json_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Add company name from filename
            data['_source_file'] = json_file.stem
            records.append(data)
    
    # Create DataFrame from all records
    df = pd.DataFrame(records)
    
    # Save as CSV
    csv_path = databases_dir / f"{index_name}_infoboxes.csv"
    df.to_csv(csv_path, index=False, encoding='utf-8')
    
    logger.info(f"Aggregated {len(records)} companies from {index_name} -> {csv_path}")

[32m2025-11-28 09:43:14.602[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m41[0m - [1mAggregated 1 companies from sp500 -> data/databases/sp500_infoboxes.csv[0m


# New step : complete process for every index

In [5]:
# ============================================================================
# FUNCTION: Fetch All Company Infoboxes for Any Index
# ============================================================================
import time  # For rate limiting

# Configuration: which table and column contains company names for each index
INDEX_CONFIG = {
    "sp500":       {"table": "table_1.csv", "column": "Security"},
    "eurostoxx50": {"table": "table_4.csv", "column": "Name"},
    "cac40":       {"table": "table_4.csv", "column": "Company"},
    "dax":         {"table": "table_4.csv", "column": "Company"},
    "csi300":      {"table": "table_3.csv", "column": "Company"},
    "spla40":      {"table": "table_1.csv", "column": "Company name"},
    "bsesensex":   {"table": "table_2.csv", "column": "Company"},
    "nasdaq100":   {"table": "table_4.csv", "column": "Company"},
}

def fetch_index_infoboxes(index_name: str) -> pd.DataFrame:
    """
    Fetch Wikipedia infoboxes for all companies in a given index.
    
    Parameters:
    -----------
    index_name : str
        One of: "sp500", "eurostoxx50", "cac40", "dax", "csi300", 
                "spla40", "bsesensex", "nasdaq100"
    
    Returns:
    --------
    pd.DataFrame with all fetched infoboxes
    Also saves to ./data/databases/{index_name}_infoboxes.csv
    """
    if index_name not in INDEX_CONFIG:
        raise ValueError(f"Unknown index: {index_name}. Choose from: {list(INDEX_CONFIG.keys())}")
    
    config = INDEX_CONFIG[index_name]
    
    # Read company list
    table_path = Path(f"./data/indices/{index_name}") / config["table"]
    df_companies = pd.read_csv(table_path)
    companies = df_companies[config["column"]].dropna().unique().tolist()
    logger.info(f"Found {len(companies)} companies in {index_name}")
    
    # Output directory
    databases_dir = Path("./data/databases")
    databases_dir.mkdir(parents=True, exist_ok=True)
    
    # Fetch infoboxes
    records = []
    failed = []
    
    for company_name in tqdm(companies, desc=f"Fetching {index_name} infoboxes"):
        try:
            page = wptools.page(company_name, silent=True)
            page.get_parse()
            infobox = page.data.get('infobox', {})
            
            if infobox:
                infobox['_company_name'] = company_name
                records.append(infobox)
            else:
                failed.append(company_name)
        except Exception:
            failed.append(company_name)
        
        time.sleep(0.5)  # Rate limiting
    
    # Save results
    df_infoboxes = pd.DataFrame(records)
    csv_path = databases_dir / f"{index_name}_infoboxes.csv"
    df_infoboxes.to_csv(csv_path, index=False, encoding='utf-8')
    
    logger.info(f"Successfully fetched {len(records)} infoboxes, failed {len(failed)}")
    if failed:
        logger.warning(f"Failed: {failed[:10]}{'...' if len(failed) > 10 else ''}")
    logger.info(f"Saved to {csv_path}")
    
    return df_infoboxes

In [None]:
fetch_index_infoboxes("nasdaq100")

[32m2025-11-28 10:21:36.495[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m42[0m - [1mFound 102 companies in nasdaq100[0m
Fetching nasdaq100 infoboxes:   3%|▎         | 3/102 [00:04<02:44,  1.66s/it]API error: {'code': 'missingtitle', 'info': "The page you specified doesn't exist.", 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}
Fetching nasdaq100 infoboxes:   4%|▍         | 4/102 [00:05<02:07,  1.30s/it]API error: {'code': 'missingtitle', 'info': "The page you specified doesn't exist.", 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecation

Unnamed: 0,name,logo,image,image_upright,image_caption,trading_name,former_name,type,traded_as,industry,...,locations,members,members_year,production,defunct,fate,successor,native_name,logo_class,genre
0,Adobe Inc.,[[File:Adobe Corporate wordmark.svg|frameless|...,Adobe World Headquarters.jpg,1.1,"[[Adobe World Headquarters]] in [[San Jose, Ca...",Adobe,Adobe Systems Incorporated (1982–2018),[[Public company|Public]],{{Unbulleted list|NASDAQ|ADBE|[[Nasdaq-100]] c...,[[Software]],...,,,,,,,,,,
1,"Advanced Micro Devices, Inc.",[[File:AMD Logo.svg|frameless|upright=1.1|clas...,2485 Augustine Drive headquarters in Santa Cla...,1.1,"Headquarters in [[Santa Clara, California]], i...",,,[[Public company|Public]],{{Unbulleted list\n | |NASDAQ|AMD|\n | [[Nas...,{{ubl|[[Semiconductor industry|Semiconductor]]...,...,,,,,,,,,,
2,"Airbnb, Inc.",Airbnb Logo Bélo.svg,"888 Brannan, San Francisco, 2016.jpg",,Headquarters at 888 Brannan Street,,,[[Public company|Public]],{{ubl|NASDAQ|ABNB| (Class A)|[[Nasdaq-100]] co...,[[Lodging]],...,,,,,,,,,,
3,"American Electric Power Company, Inc.",AEP-Logo-Red-Gray.svg,AEP Building 1.jpg,,"[[AEP Building]], the company's headquarters i...",,,[[Public company|Public]],{{ubl|NASDAQ|AEP|[[DJUA]] component|[[Nasdaq-1...,[[Electric Utility|Electric utilities]],...,,,,,,,,,,
4,Amgen Inc.,Amgen.svg,Amgenheadquarters.jpg,,"Headquarters in Thousand Oaks, California",,,[[Public company|Public]],{{unbulleted list|NASDAQ|AMGN|[[Nasdaq-100]] c...,[[Biotechnology]],...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,Vertex Pharmaceuticals Incorporated,Vertex logo.svg,,,,,,[[Public company|Public]],{{ubl|NASDAQ|VRTX|[[Nasdaq-100]] component|[[S...,{{ubl|[[Pharmaceuticals]] | [[Biotherapy|Bioth...,...,,,,,,,,,,
94,"Warner Bros. Discovery, Inc.",Warner Bros. Discovery.svg,,,WBD's headquarters in 230 [[Park Avenue South]...,,,[[Public company|Public]],{{ublist\n| |NASDAQ|WBD| (Series A)\n| [[Nasda...,{{ublist\n| [[Media conglomerate|Media]]\n| [[...,...,,,,,{{end date and age|2026|04}} (expected),,,,,
95,,Workday logo.svg,Workday Headquarters.jpg,,"Headquarters in Pleasanton, California",,,[[Public company|Public]],{{ubl| class|=|nowrap\n | |NASDAQ|WDAY| (Class...,{{ubl|[[Cloud computing]]|[[Enterprise softwar...,...,,,,,,,,,,
96,Xcel Energy Inc.,Xcel-energy.svg,ExcelEnergyDenver.jpg,,"1800 Larimer, Xcel Energy Regional Headquarter...",,,[[Public company|Public]],{{ubl|NASDAQ|XEL|[[DJUA]] component|[[Nasdaq-1...,[[Public utility|Utilities]],...,,,,{{ubl|Electric: 114.98 [[TWh]]|Natural Gas: 40...,,,,,,


In [19]:
fetch_index_infoboxes("cac40")

[32m2025-11-28 10:25:07.008[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m42[0m - [1mFound 40 companies in cac40[0m
Fetching cac40 infoboxes: 100%|██████████| 40/40 [01:03<00:00,  1.60s/it]
[32m2025-11-28 10:26:10.949[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m73[0m - [1mSuccessfully fetched 37 infoboxes, failed 3[0m
[32m2025-11-28 10:26:10.951[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m76[0m - [1mSaved to data/databases/cac40_infoboxes.csv[0m


Unnamed: 0,name,logo,logo_alt,logo_size,foundation,founders,hq_location,hq_location_city,hq_location_country,type,...,caption,birth_date,birth_place,death_date,era,region,school_tradition,main_interests,notable_ideas,genre
0,Accor S.A.,Accor logo.svg,Accor logo,150,{{start date and age|1967|df|=|yes}} <br />[[P...,{{Unbulleted list|[[Gérard Pelisson]]|[[Paul D...,[[Tour Sequana]],[[Issy-les-Moulineaux]],France,[[Public company|Public]],...,,,,,,,,,,
1,Air Liquide S.A.,"Air Liquide - logo (France, 2017).svg",,250px,{{start date and age|1902}},,,,,[[S.A. (corporation)|Société Anonyme]],...,,,,,,,,,,
2,Airbus SE,Airbus Logo 2017.svg {{!}} class=skin-invert,,180px,,,"{{Indented plainlist|\n* [[Leiden]], Netherlan...",,,[[Public company|Public]],...,,,,,,,,,,
3,ArcelorMittal S.A.,ArcelorMittal.svg,,200px,{{start date and age|2007}},,,,,[[Public company|Public]],...,,,,,,,,,,
4,AXA S.A.,AXA Logo.svg,,165px,{{Start date and age|1921}},,,,,[[Public company|Public]],...,,,,,,,,,,
5,,BNP Paribas.svg,,50px,* {{start date and age|df|=|yes|1822|12|13}} C...,,,,,[[Public company|Public]],...,,,,,,,,,,
6,Bouygues S.A.,Bouygues.svg,,200px,{{start date and age|1952}},,,,,[[Public company|Public]],...,,,,,,,,,,
7,Bureau Veritas S.A.,Bureau Veritas.svg,,150px,{{start date and age|1828}},,,,,[[S.A. (corporation)|Société anonyme]],...,,,,,,,,,,
8,Capgemini SE,Capgemini New logo.svg,,250px,{{Start date and age|df|=|yes|1 October 1967}},,,,,[[Public company|Public]],...,,,,,,,,,,
9,Carrefour S.A.,Carrefour_Groupe.svg,,,{{start date and age|df|=|yes|1958|1|1}},{{ubl|Marcel Fournier|[[Denis Defforey]]|Jacqu...,,,,[[Public company]],...,,,,,,,,,,


In [20]:
fetch_index_infoboxes("spla40")

[32m2025-11-28 10:27:56.008[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m42[0m - [1mFound 40 companies in spla40[0m
Fetching spla40 infoboxes:  42%|████▎     | 17/40 [00:27<00:39,  1.73s/it]API error: {'code': 'missingtitle', 'info': "The page you specified doesn't exist.", 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}
Fetching spla40 infoboxes:  62%|██████▎   | 25/40 [00:41<00:28,  1.91s/it]API error: {'code': 'missingtitle', 'info': "The page you specified doesn't exist.", 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and brea

Unnamed: 0,name,logo,logo_size,type,traded_as,foundation,hq_location_city,hq_location_country,area_served,key_people,...,trade_name,predecessors,brands,members,net_income_year,website,founded,hq_location,production,num_locations_year
0,Ambev S.A.,Ambev logo.svg,225px,[[Public company|Public]] [[subsidiary]],{{Unbulleted list|B3 (stock exchange)|cvm|=|23...,{{Start date and age|1999}},[[São Paulo]],Brazil,"[[Brazil]], [[Latin America]], [[Canada]]",Jean Jereissati ([[chairman]] & [[CEO]]),...,,,,,,,,,,
1,"América Móvil, S.A.B. de C.V.",Logo de América Móvil.svg,100px,[[Public company|Public]],{{BMV|AMX|6024}} <br /> {{New York Stock Excha...,{{start date and age|2000|09|25|df|=|yes}},,,,[[Carlos Slim Helú]] ([[Chairman|chairman emer...,...,,,,,,,,,,
2,Banco Bradesco S.A.,,250px,[[S.A. (corporation)|Sociedade Anônima]],{{B3 (stock exchange)|cvm|=|906|BBDC3|BBDC4}} ...,{{start date and age|1943|03|10}} in [[Marília...,[[Osasco]],[[Brazil]],Worldwide,Luiz Carlos Trabuco Cappi <small>([[Chairman]]...,...,,,,,,,,,,
3,Santander Chile Holding S.A.,Banco Santander Logotipo.svg,200px,[[S.A. (corporation)|Sociedad Anónima]],{{bcs|BSANTANDER}} <br /> {{New York Stock Exc...,1978,,,,"[[Mauricio Larraín]], <small>([[CEO]])</small>",...,,,,,,,,,,
4,Banco de Chile,Banco de Chile logo.svg,,[[S.A. (corporation)|Sociedad Anónima]],{{BCS|CHILE}} <br/> {{NYSE|BCH}} <br/> {{BMAD|...,October 1893,,,,{{unbulleted list|[[Pablo Granifo Lavín]] <sma...,...,,,,,,,,,,
5,Banco do Brasil S.A.,Banco do Brasil Logo.svg,250,[[S.A. (corporation)|Sociedade Anônima]],{{B3 (stock exchange)|cvm|=|1023|BBAS3}} <br>[...,"[[Rio de Janeiro]], [[Captaincy of Rio de Jane...",,,,[[Tarciana Medeiros]] ([[Chairperson|Chairwoma...,...,,,,,,,,,,
6,Bancolombia S.A.,Bancolombia S.A. logo.svg,250px,[[S.A. (corporation)|Sociedad Anónima]],{{BVC|BCOLOMBIA}} <br> {{nyse|CIB}},{{Start date and age|df|=|yes|1875|01|29}} (as...,,,"[[Colombia]], [[Cayman Islands]], [[El Salvado...","Juan Carlos Mora Uribe, ([[President (corporat...",...,,,,,,,,,,
7,BRF S.A.,BRF S.A. logo.svg,150px,[[Public company|Public]],{{B3 (stock exchange)|cvm|=|16292|BRFS3}} <br>...,"{{start date and age|August 18, 1934}}",,,Worldwide,Lorival Nogueira Luz Jr. (CEO) <Br> [[Pedro Pa...,...,,,,,,,,,,
8,Motiva,Motiva.svg,,[[S.A. (corporation)|Sociedade Anônima]],{{B3 (stock exchange)|cvm|=|18821|MOTV3}} <br>...,1999,,,,Miguel Setas ([[Chief executive officer|CEO]]),...,,,,,,,,,,
9,CEMEX S.A.B. de C.V.,Cemex_logo_2023.png,250px,[[S.A. (corporation)|Sociedad Anónima Bursátil...,{{BMV|CEMEX|5203}} <br /> {{NYSE|CX}} <<br/> {...,{{start date and age|1906|df|=|yes}},,,Worldwide,Rogelio Zambrano Lozano<br> {{small|(Executive...,...,,,,,,,,,,


In [6]:
fetch_index_infoboxes("dax")

[32m2025-11-28 10:57:06.553[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m42[0m - [1mFound 41 companies in dax[0m
Fetching dax infoboxes: 100%|██████████| 41/41 [01:06<00:00,  1.62s/it]
[32m2025-11-28 10:58:13.143[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m73[0m - [1mSuccessfully fetched 37 infoboxes, failed 4[0m
[32m2025-11-28 10:58:13.144[0m | [1mINFO    [0m | [36m__main__[0m:[36mfetch_index_infoboxes[0m:[36m76[0m - [1mSaved to data/databases/dax_infoboxes.csv[0m


Unnamed: 0,name,former_name,logo,logo_size,logo_caption,image,image_size,image_caption,type,traded_as,...,website,parent,logo_class,module,former_names,image_alt,logo_upright,trade_name,native_name,areas_served
0,Adidas AG,Gebrüder Dassler Schuhfabrik (1924–1949),Adidas 2022 logo.svg,200,Main logo since 2022,Herzogenaurach - Adidas - 2016.jpg,250,"Current factory outlet in Herzogenaurach, Germ...",[[Public company|Public]],{{FWB|ADS|isin|=|DE000A1EWWW0}} <br />[[DAX|DA...,...,,,,,,,,,,
1,Airbus SE,{{Indented plainlist|\n* '''Parent company:'''...,Airbus Logo 2017.svg {{!}} class=skin-invert,180px,,Airbus Lagardère - Aéroconstellation.jpg,250px,"Lagardère production plant in [[Blagnac]], France",[[Public company|Public]],{{Plainlist|\n* |BMAD|isin|=|NL0000235190|AIR|...,...,,,,,,,,,,
2,Allianz,,Allianz.svg,,,Wzwz_schwabing_26_allianz_building.JPG,,Headquarters in Munich,[[Public company|Public]] (''[[societas Europa...,{{ubl|FWB|ALV|[[DAX]] component}} {{FWB|ALV}},...,,,,,,,,,,
3,BASF SE,,BASF-Logo bw.svg,205px,,,,,[[Public company|Public]],{{ubl|FWB|BAS|isin|=|DE000BASF111|[[DAX]] comp...,...,,,,,,,,,,
4,Bayer AG,,Logo Bayer.svg,160px,,Leverkusen Kaiser-Wilhelm-Allee 0004.jpg,,Headquarters in Leverkusen,[[Public company|Public]],{{ubl|class|=|nowrap|FWB|BAYN|[[DAX]] componen...,...,,,,,,,,,,
5,Beiersdorf AG,,Beiersdorf Logo.svg {{!}} class=skin-invert-image,250px,Beiersdorf's logo used since January 2014,Beiersdorf Headquarters Hamburg 1.jpg,250px,"Headquarters in [[Hamburg]], Germany",[[Public company|Public]] ([[Aktiengesellschaf...,{{Unbulleted list | |FWB|BEI| | [[DAX]] compon...,...,,,,,,,,,,
6,Bayerische Motoren Werke Aktiengesellschaft,,BMW logo (white + grey background square).svg,,Official logo since 2020,"4 cilindros de BMW, Múnich, Alemania1.jpg",,"[[BMW Headquarters]] in Munich, Germany",[[Public company|Public]],{{unbulleted list| |FWB|BMW| |[[DAX]] componen...,...,,,,,,,,,,
7,Brenntag SE,,Brenntag Logo 2022.svg,250px,,,,,[[Public company|Public]] (''[[Societas Europa...,{{Unbulleted list |FWB|BNR| |FWB|BNRA| ([[Amer...,...,,,,,,,,,,
8,Commerzbank AG,,Commerzbank (2009).svg,,,Frankfurt_Commerzbank_vom_Schaumainkai.jpg,,"[[Commerzbank Tower]], the headquarters of Com...",[[Public company|Public]],{{FWB|CBK}} <br>[[DAX]],...,{{Official URL}},,,,,,,,,
9,Covestro AG,,Covestro Logo.svg,150px,,,,,''[[Aktiengesellschaft]]'',{{plainlist|\n* |FWB|1COV|}} {{FWB|1COV}},...,,,,,,,,,,


In [8]:
# ============================================================================
# STEP 4: LOAD AND ANALYZE INFOBOX DATA
# ============================================================================
# Load all available infobox CSVs and display basic statistics

databases_dir = Path("./data/databases")
csv_files = list(databases_dir.glob("*_infoboxes.csv"))

# Store all dataframes in a dictionary for later use
infobox_data = {}

for csv_path in sorted(csv_files):
    index_name = csv_path.stem.replace("_infoboxes", "")
    
    df = pd.read_csv(csv_path)
    infobox_data[index_name] = df
    
    print(f"\n{index_name.upper()}")
    print(f"\nCompanies: {len(df)}")
    print(f"Fields: {len(df.columns)}")
    print(f"\nAvailable columns:")
    print(", ".join(df.columns[:15]))
    if len(df.columns) > 15:
        print(f"... and {len(df.columns) - 15} more")
    print()

print(f"\nData loaded into 'infobox_data' dict with keys: {list(infobox_data.keys())}")


CAC40

Companies: 37
Fields: 74

Available columns:
name, logo, logo_alt, logo_size, foundation, founders, hq_location, hq_location_city, hq_location_country, type, industry, brands, area_served, traded_as, revenue
... and 59 more


DAX

Companies: 37
Fields: 68

Available columns:
name, former_name, logo, logo_size, logo_caption, image, image_size, image_caption, type, traded_as, founder, location_city, location_country, area_served, key_people
... and 53 more


NASDAQ100

Companies: 98
Fields: 70

Available columns:
name, logo, image, image_upright, image_caption, trading_name, former_name, type, traded_as, industry, area_served, key_people, products, services, revenue
... and 55 more


SP500

Companies: 474
Fields: 120

Available columns:
name, logo, logo_size, image, image_size, image_caption, former_name, type, traded_as, ISIN, industry, foundation, founders, location_city, location_country
... and 105 more


SPLA40

Companies: 32
Fields: 51

Available columns:
name, logo, logo_s

In [12]:
# ============================================================================
# STEP 5: DATA PREPROCESSING FOR LLM
# ============================================================================
# Functions to clean and format data for Large Language Models


def clean_text(text) -> str:
    """
    Clean and normalize text for LLM input.
    Removes Wikipedia markup and normalizes whitespace.
    """
    if pd.isna(text) or text is None:
        return ""
    
    text = str(text)
    
    # Handle common Wikipedia templates by extracting useful content
    # {{US$|24.58 billion|...}} -> $24.58 billion
    text = re.sub(r'\{\{US\$\|([^}|]+)[^}]*\}\}', r'$\1', text)
    
    # {{circa|61,500}} -> ~61,500
    text = re.sub(r'\{\{circa\|([^}|]+)[^}]*\}\}', r'~\1', text)
    
    # {{increase}}, {{decrease}} -> arrows
    text = re.sub(r'\{\{increase\}\}', '↑', text)
    text = re.sub(r'\{\{decrease\}\}', '↓', text)
    
    # {{Start date and age|1902|6|13}} -> 1902-06-13
    text = re.sub(r'\{\{Start date and age\|(\d+)\|(\d+)\|(\d+)[^}]*\}\}', r'\1-\2-\3', text)
    
    # {{URL|example.com}} -> example.com
    text = re.sub(r'\{\{URL\|([^}|]+)[^}]*\}\}', r'\1', text)
    
    # {{plainlist|...}} and {{Unbulleted list|...}} - extract items
    text = re.sub(r'\{\{(?:plainlist|Unbulleted list)\|', '', text)
    
    # Remove remaining templates iteratively (handles nesting)
    prev_text = ""
    while prev_text != text:
        prev_text = text
        text = re.sub(r'\{\{[^{}]*\}\}', '', text)
    
    # Extract text from wiki links [[text|display]] -> display or [[text]] -> text
    text = re.sub(r'\[\[([^|\]]*\|)?([^\]]+)\]\]', r'\2', text)
    
    # Remove reference tags
    text = re.sub(r'<ref[^>]*>.*?</ref>', '', text, flags=re.DOTALL)
    text = re.sub(r'<ref[^/]*/?>', '', text)
    text = re.sub(r'</ref>', '', text)
    text = re.sub(r'<[^>]+>', '', text)
    
    # Clean up remaining brackets, braces, pipes, asterisks
    text = re.sub(r'[\[\]{}|*]', '', text)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text


def extract_key_facts(row: pd.Series) -> Dict[str, str]:
    """
    Extract key facts from a company infobox row.
    Uses case-insensitive matching to handle Wikipedia's varying field names.
    """
    # Map of output field names to possible column names (lowercase)
    field_mappings = {
        'name': ['name', '_company_name', 'company_name'],
        'type': ['type'],
        'industry': ['industry', 'industries'],
        'founded': ['foundation', 'founded', 'established'],
        'founder': ['founder', 'founders'],
        'headquarters': ['hq_location', 'headquarters', 'location_city', 'hq_location_city'],
        'country': ['location_country', 'hq_location_country', 'country'],
        'key_people': ['key_people', 'ceo', 'chairman'],
        'employees': ['num_employees', 'employees'],
        'revenue': ['revenue'],
        'website': ['website', 'url', 'homepage'],
    }
    
    # Create lowercase column mapping
    col_lower = {col.lower(): col for col in row.index}
    
    facts = {}
    for field, possible_names in field_mappings.items():
        for name in possible_names:
            if name in col_lower:
                value = row[col_lower[name]]
                cleaned = clean_text(value)
                if cleaned:
                    facts[field] = cleaned
                    break
    
    return facts


def row_to_context(row: pd.Series) -> str:
    """
    Convert a company row into a formatted context string for LLM input.
    """
    facts = extract_key_facts(row)
    
    if not facts:
        return "No company information available."
    
    lines = ["COMPANY INFORMATION:", "-" * 40]
    
    # Order of fields to display
    field_order = ['name', 'type', 'industry', 'founded', 'founder', 
                   'headquarters', 'country', 'key_people', 'employees', 
                   'revenue', 'website']
    
    for field in field_order:
        if field in facts:
            label = field.replace('_', ' ').title()
            lines.append(f"{label}: {facts[field]}")
    
    return "\n".join(lines)


# ============================================================================
# TEST: Demonstrate preprocessing on sample company
# ============================================================================

# Test on first company from S&P 500 data
if 'infobox_data' in dir() and 'sp500' in infobox_data:
    sample_row = infobox_data['sp500'].iloc[0]
    
    print("=" * 60)
    print("SAMPLE: Raw vs Cleaned Data")
    print("=" * 60)
    
    # Show raw value examples
    for col in ['revenue', 'num_employees', 'foundation']:
        if col in sample_row.index:
            print(f"\nRaw '{col}':")
            print(f"  {sample_row[col]}")
            print(f"Cleaned:")
            print(f"  {clean_text(sample_row[col])}")
    
    print("\n" + "=" * 60)
    print("EXTRACTED KEY FACTS:")
    print("=" * 60)
    facts = extract_key_facts(sample_row)
    for k, v in facts.items():
        print(f"  {k}: {v}")
    
    print("\n" + "=" * 60)
    print("FORMATTED CONTEXT FOR LLM:")
    print("=" * 60)
    print(row_to_context(sample_row))
else:
    print("Run Step 4 first to load infobox_data")

SAMPLE: Raw vs Cleaned Data

Raw 'revenue':
  {{decrease}} {{US$|24.58 billion|link|=|yes}} (2024)
Cleaned:
  ↓ $24.58 billion (2024)

Raw 'num_employees':
  {{circa|61,500}} (2024)
Cleaned:
  ~61,500 (2024)

Raw 'foundation':
  {{Start date and age|1902|6|13}} in [[Two Harbors, Minnesota]], U.S.
Cleaned:
  1902-6-13 in Two Harbors, Minnesota, U.S.

EXTRACTED KEY FACTS:
  name: 3M Company
  type: Public
  industry: Conglomerate
  founded: 1902-6-13 in Two Harbors, Minnesota, U.S.
  founder: J. Danley BuddHenry S. BryanWilliam A. McGonagleJohn DwanHermon W. Cable Charles Simmonsref
  headquarters: Maplewood, Minnesota
  country: U.S.
  key_people: Michael F. Roman (chairman) William M. Brown (CEO)ref
  employees: ~61,500 (2024)
  revenue: ↓ $24.58 billion (2024)
  website: 3m.com

FORMATTED CONTEXT FOR LLM:
COMPANY INFORMATION:
----------------------------------------
Name: 3M Company
Type: Public
Industry: Conglomerate
Founded: 1902-6-13 in Two Harbors, Minnesota, U.S.
Founder: J. Danl

In [13]:
# ============================================================================
# STEP 6: CREATE PROMPT TEMPLATES FOR LLM TASKS
# ============================================================================
# Different tasks require different prompt structures. Build task-specific
# prompt templates that can be reused across your dataset.

from typing import List


class PromptBuilder:
    """
    A class for creating structured prompts for various LLM tasks.
    Each method returns a formatted prompt string ready for LLM input.
    """
    
    def qa_prompt(self, context: str, question: str) -> str:
        """
        Create a Q&A prompt about a company.
        
        Parameters:
        -----------
        context : str
            Company information from row_to_context()
        question : str
            The question to answer about the company
            
        Returns:
        --------
        str : Formatted prompt for Q&A task
        """
        return f"""You are a helpful assistant answering questions about companies.

{context}

Question: {question}

Please provide a clear, concise answer based only on the information provided above. If the information is not available, say so."""

    def classification_prompt(self, context: str, categories: List[str]) -> str:
        """
        Create a classification prompt to categorize a company.
        
        Parameters:
        -----------
        context : str
            Company information from row_to_context()
        categories : List[str]
            List of possible categories to classify into
            
        Returns:
        --------
        str : Formatted prompt for classification task
        """
        categories_str = "\n".join(f"- {cat}" for cat in categories)
        
        return f"""You are a business analyst classifying companies into categories.

{context}

Based on the company information above, classify this company into ONE of the following categories:
{categories_str}

Respond with ONLY the category name, followed by a brief one-sentence justification."""

    def summarization_prompt(self, context: str) -> str:
        """
        Create a summarization prompt for company information.
        
        Parameters:
        -----------
        context : str
            Company information from row_to_context()
            
        Returns:
        --------
        str : Formatted prompt for summarization task
        """
        return f"""You are a business analyst creating company summaries.

{context}

Please provide a concise 2-3 sentence summary of this company, highlighting:
1. What the company does (industry/business)
2. Key characteristics (size, location, notable facts)
3. Any distinguishing features

Keep the summary factual and professional."""

    def comparison_prompt(self, context1: str, context2: str) -> str:
        """
        Create a comparison prompt for two companies.
        
        Parameters:
        -----------
        context1 : str
            First company information from row_to_context()
        context2 : str
            Second company information from row_to_context()
            
        Returns:
        --------
        str : Formatted prompt for comparison task
        """
        return f"""You are a business analyst comparing companies.

=== COMPANY 1 ===
{context1}

=== COMPANY 2 ===
{context2}

Please compare these two companies across the following dimensions:
1. Industry & Business Focus
2. Size & Scale (employees, revenue if available)
3. Geographic Presence
4. Key Similarities
5. Key Differences

Provide a structured comparison with clear insights."""

    def extraction_prompt(self, context: str, fields: List[str]) -> str:
        """
        Create an extraction prompt to extract specific fields as JSON.
        
        Parameters:
        -----------
        context : str
            Company information from row_to_context()
        fields : List[str]
            List of field names to extract
            
        Returns:
        --------
        str : Formatted prompt for extraction task
        """
        fields_str = ", ".join(f'"{f}"' for f in fields)
        
        return f"""You are a data extraction specialist.

{context}

Extract the following fields from the company information above:
[{fields_str}]

Respond with a valid JSON object containing only these fields.
Use null for any field that cannot be determined from the information provided.

Example format:
{{
  "field1": "value1",
  "field2": "value2"
}}"""


# ============================================================================
# TEST: Demonstrate PromptBuilder with sample data
# ============================================================================

# Create PromptBuilder instance
prompt_builder = PromptBuilder()

# Test with sample company data
if 'infobox_data' in dir() and 'sp500' in infobox_data:
    sample_context = row_to_context(infobox_data['sp500'].iloc[0])
    
    print("=" * 70)
    print("EXAMPLE 1: Q&A PROMPT")
    print("=" * 70)
    qa = prompt_builder.qa_prompt(sample_context, "What industry is this company in?")
    print(qa)
    
    print("\n" + "=" * 70)
    print("EXAMPLE 2: CLASSIFICATION PROMPT")
    print("=" * 70)
    categories = ["Technology", "Healthcare", "Finance", "Industrial", "Consumer Goods"]
    classification = prompt_builder.classification_prompt(sample_context, categories)
    print(classification)
    
    print("\n" + "=" * 70)
    print("EXAMPLE 3: SUMMARIZATION PROMPT")
    print("=" * 70)
    summary = prompt_builder.summarization_prompt(sample_context)
    print(summary)
    
    print("\n" + "=" * 70)
    print("EXAMPLE 4: EXTRACTION PROMPT")
    print("=" * 70)
    fields = ["company_name", "industry", "employee_count", "annual_revenue"]
    extraction = prompt_builder.extraction_prompt(sample_context, fields)
    print(extraction)
    
    # Comparison requires two companies
    if len(infobox_data['sp500']) >= 2:
        print("\n" + "=" * 70)
        print("EXAMPLE 5: COMPARISON PROMPT")
        print("=" * 70)
        context2 = row_to_context(infobox_data['sp500'].iloc[1])
        comparison = prompt_builder.comparison_prompt(sample_context, context2)
        print(comparison)
else:
    print("Run Step 4 first to load infobox_data")

EXAMPLE 1: Q&A PROMPT
You are a helpful assistant answering questions about companies.

COMPANY INFORMATION:
----------------------------------------
Name: 3M Company
Type: Public
Industry: Conglomerate
Founded: 1902-6-13 in Two Harbors, Minnesota, U.S.
Founder: J. Danley BuddHenry S. BryanWilliam A. McGonagleJohn DwanHermon W. Cable Charles Simmonsref
Headquarters: Maplewood, Minnesota
Country: U.S.
Key People: Michael F. Roman (chairman) William M. Brown (CEO)ref
Employees: ~61,500 (2024)
Revenue: ↓ $24.58 billion (2024)
Website: 3m.com

Question: What industry is this company in?

Please provide a clear, concise answer based only on the information provided above. If the information is not available, say so.

EXAMPLE 2: CLASSIFICATION PROMPT
You are a business analyst classifying companies into categories.

COMPANY INFORMATION:
----------------------------------------
Name: 3M Company
Type: Public
Industry: Conglomerate
Founded: 1902-6-13 in Two Harbors, Minnesota, U.S.
Founder: J. 

Create a PromptBuilder class with different methods for creating prompts:
- qa prompt
- classification prompt
- summarization prompt
- comparison prompt
- information extraction prompt

In [None]:
# ============================================================================
# STEP 7: SUMMARY & BEST PRACTICES GUIDE
# ============================================================================
# Comprehensive guide to using Wikipedia data with LLMs



summary = """
╔════════════════════════════════════════════════════════════════════════════╗
║                    LLM PROMPT ENGINEERING SUMMARY                         ║
╚════════════════════════════════════════════════════════════════════════════╝


🎯 BEST PRACTICES FOR LLM PROMPT INJECTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. CONTEXT QUALITY
   • Keep context focused and relevant
   • Clean and normalize text thoroughly
   • Remove ambiguous or conflicting information
   • Include metadata (source, confidence, date)

2. PROMPT DESIGN
   • Use clear, specific instructions
   • Provide examples when possible (few-shot)
   • Specify output format explicitly (JSON, tables, etc.)
   • Include role/perspective for better results



📋 USE CASES FOR YOUR DATA:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Company Classification          → Categorize by industry, size, sector
✓ Market Analysis                → Competitive landscape, positioning
✓ Risk Assessment               → Financial health, strategic risks
✓ Investment Analysis           → Potential returns, growth prospects
✓ Data Enrichment               → Fill gaps from Wikipedia data
✓ Text Generation               → Create summaries, reports, profiles
✓ Knowledge Extraction          → Key metrics, relationships, entities
✓ Sentiment Analysis            → Company reputation, public perception
✓ Trend Detection               → Emerging patterns, growth areas
✓ Comparative Analysis          → Company benchmarking, peer analysis
"""


End of lab 5

🚀 NEXT STEPS (in anticipation of the final lab)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Export your full dataset using PromptExporter
2. Test prompts with a small sample (5-10 companies)
3. Evaluate LLM outputs for quality and accuracy
4. Iterate on prompts based on results
5. Scale up to full dataset using batch APIs
6. Monitor token usage and costs
7. Implement feedback loops for continuous improvement
8. Build evaluation metrics for output quality