# 6. API Exploration: Public Health England's Fingertips

## From Theory to Practice

In the previous notebooks, we used synthetic data to explore how Probability Density Functions (PDFs) can model health prevalence. Now, we need **real data** to ground these models in reality. 

One of the richest sources of UK public health data is the **Fingertips API** from Public Health England (now the Office for Health Improvement and Disparities). This API provides access to hundreds of health indicators across dozens of thematic profiles.

## Objective

The goal of this notebook is to systematically explore the structure of the Fingertips API. We need to understand what data is available before we can extract it. We will:

1.  **List all available Profiles**: These are thematic collections of data (e.g., 'Cardiovascular Disease', 'Public Health Outcomes Framework').
2.  **Inspect the Indicators within a Profile**: See what specific metrics are available.
3.  **List all available Area Types**: Understand the different geographic levels we can get data for (e.g., ICBs, Local Authorities, GP Practices).
4.  **Fetch a sample of data**: Test our functions to pull data for a specific indicator and area type.

This process will create a 'map' of the API that will guide our data acquisition in the following notebooks.

In [6]:
!pip install requests_cache  

Collecting requests_cache
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests_cache)
  Downloading cattrs-25.1.1-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests_cache)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
Downloading cattrs-25.1.1-py3-none-any.whl (69 kB)
Downloading url_normalize-2.2.1-py3-none-any.whl (14 kB)
Installing collected packages: url-normalize, cattrs, requests_cache
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [requests_cache]
[1A[2KSuccessfully installed cattrs-25.1.1 requests_cache-1.2.1 url-normalize-2.2.1


In [7]:
# Environment & imports ----------------------------------------------------
# Activate the correct Conda env first: conda activate health_modelling
from pathlib import Path
import json
import warnings
import pandas as pd
import requests
import requests_cache           # <— light-weight API call caching
import fingertips_py as ftp     # pip install fingertips_py

In [8]:
# Local helper (already in src/)
# This assumes your script is running from the 'notebooks' directory.
try:
    sys.path.append('../src')
    import phe_api_utils as phe_utils
except ImportError:
    print("Could not import phe_api_utils.py, defining fallbacks.")
    # Define functions here as a fallback if import fails
    import requests
    import time

    BASE_URL = "https://fingertips.phe.org.uk/api/"

    def fetch_json(endpoint: str, params: dict = None):
        url = BASE_URL.rstrip('/') + endpoint
        try:
            resp = requests.get(url, params=params)
            resp.raise_for_status()
            time.sleep(0.2)
            return resp.json()
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None

    def get_profile_indicators(profile_id: int) -> pd.DataFrame:
        data = fetch_json(f'/indicator_metadata/by_profile_id', params={'profile_id': profile_id})
        if not data: return pd.DataFrame()
        df = pd.DataFrame.from_dict(data, orient='index')
        df.index.name = 'IndicatorID'
        return df

    def get_area_types() -> pd.DataFrame:
        data = fetch_json('/area_types')
        return pd.DataFrame(data) if data else pd.DataFrame()

    def get_data_for_indicator(indicator_id: int, area_type_id: int, parent_code: str = None) -> pd.DataFrame:
        params = {'indicator_id': indicator_id, 'area_type_id': area_type_id}
        if parent_code:
            params['parent_area_code'] = parent_code
        data = fetch_json('/all_data/for_indicator_at_area_type', params=params)
        return pd.DataFrame(data) if data else pd.DataFrame()

    class PheApiUtils:
        get_profile_indicators = staticmethod(get_profile_indicators)
        get_area_types = staticmethod(get_area_types)
        get_data_for_indicator = staticmethod(get_data_for_indicator)
    
    phe_utils = PheApiUtils()

# Cache GET requests for 12 hours to avoid hammering the API
requests_cache.install_cache("fingertips_cache", expire_after=12*60*60)

pd.set_option("display.max_columns", 50)
warnings.filterwarnings("ignore")

In [9]:
# Ranked profiles you care about ------------------------------------------
RANKED_PROFILES = [
    # rank, profile_id, short_name
    (1,  19,  "PHOF"),
    (2,  20,  "GP_Profiles"),
    (3, 135,  "Cardio"),
    (4,  84,  "Dementia"),
    (5, 139,  "Diabetes"),
    (6,  92,  "Cancer"),
    (7,  29,  "Respiratory"),
]

rank_df = (
    pd.DataFrame(RANKED_PROFILES, columns=["Rank", "Id", "Key"])
      .assign(Profile=lambda d: d["Id"].map(
          lambda x: ftp.get_profile_by_id(x)["Name"]))
)
print("--- Ranked Profiles ---")
print(rank_df)

--- Ranked Profiles ---
   Rank   Id          Key                             Profile
0     1   19         PHOF    Public Health Outcomes Framework
1     2   20  GP_Profiles  National General Practice Profiles
2     3  135       Cardio              Cardiovascular Disease
3     4   84     Dementia                    Dementia Profile
4     5  139     Diabetes                            Diabetes
5     6   92       Cancer                     Cancer Services
6     7   29  Respiratory                 Respiratory disease


In [11]:
# Fast indicator-level summary for each profile ----------------------------
def profile_quick_meta(profile_id: int) -> pd.DataFrame:
    """Return indicator metadata with minimal cleaning. Handles missing columns."""
    meta = ftp.get_metadata_for_profile_as_dataframe(profile_id)
    if meta.empty:
        meta = phe_utils.get_profile_indicators(profile_id)  # fallback to custom helper
    if meta.empty:
        return pd.DataFrame()

    renamed_meta = meta.rename(columns=str.capitalize)

    # Define desired columns and find which ones are actually available
    desired_cols = ["Indicator id", "Indicator", "Age", "Sex", "Unit"]
    available_cols = [col for col in desired_cols if col in renamed_meta.columns]
    
    return (
        renamed_meta.loc[:, available_cols]
            .dropna(subset=["Indicator id"])
    )

profile_meta = {pid: profile_quick_meta(pid) for pid in rank_df["Id"]}
rank_df["Indicators"] = rank_df["Id"].map(lambda x: len(profile_meta.get(x, [])))
print("\n--- Profile Summary with Indicator Counts ---")
print(rank_df)


--- Profile Summary with Indicator Counts ---
   Rank   Id          Key                             Profile  Indicators
0     1   19         PHOF    Public Health Outcomes Framework         173
1     2   20  GP_Profiles  National General Practice Profiles         143
2     3  135       Cardio              Cardiovascular Disease          63
3     4   84     Dementia                    Dementia Profile          28
4     5  139     Diabetes                            Diabetes          61
5     6   92       Cancer                     Cancer Services          35
6     7   29  Respiratory                 Respiratory disease          40


In [24]:
# Explore and Export Indicator Metadata -------------------------------------
# ADDED: This new block prints the first 5 indicators and exports the full list to CSV.
print("\n--- Exploring and Exporting Indicator Metadata for each Profile ---")
META_OUT = Path("../data/metadata") # Define output path for metadata
META_OUT.mkdir(parents=True, exist_ok=True)

for pid, meta_df in profile_meta.items():
    profile_name = rank_df.loc[rank_df['Id'] == pid, 'Profile'].iloc[0]
    profile_key = rank_df.loc[rank_df['Id'] == pid, 'Key'].iloc[0]
    print(f"\n\n--- {profile_key} ({profile_name}) ---")
    if not meta_df.empty:
        print(f"Found {len(meta_df)} indicators. First 5:")
        # In a real notebook, `display()` is better, but print() works in scripts.
        print(meta_df.head(5))

        # Export the full metadata to CSV
        meta_csv_path = META_OUT / f"meta_{pid}_{profile_key}.csv"
        meta_df.to_csv(meta_csv_path, index=False)
        print(f"  → Full list of {len(meta_df)} indicators exported to: {meta_csv_path.name}")
    else:
        print("No indicator metadata found.")


--- Exploring and Exporting Indicator Metadata for each Profile ---


--- PHOF (Public Health Outcomes Framework) ---
Found 173 indicators. First 5:
   Indicator id                                        Indicator   Unit
0         90362          A01a - Healthy life expectancy at birth  Years
1         90366                  A01b - Life expectancy at birth  Years
2         91102                     A01b - Life expectancy at 65  Years
3         92901    A02a - Inequality in life expectancy at birth  Years
4         93562  A01c - Disability free life expectancy at birth  Years
  → Full list of 173 indicators exported to: meta_19_PHOF.csv


--- GP_Profiles (National General Practice Profiles) ---
Found 143 indicators. First 5:
   Indicator id  \
0         93468   
1         93553   
2         93438   
3         94126   
4         94129   

                                                                   Indicator  \
0                       Proportion of GP registered populations by age 

In [18]:
# Decide geographies once, use everywhere ----------------------------------
# area_type_id reference table
area_types = phe_utils.get_area_types()

# Note: Some IDs from the prompt were for MSOA/UTLA which can be ambiguous.
# Using more specific IDs from the API documentation for clarity.
# 101 = Counties & UAs (4/23), 102 = MSOAs (2011), 153 = ICBs (7/22), 152=Sub-ICBs (7/22)
TARGET_AREA_TYPE_IDS = {
    "MSOA": 102,
    "UTLA": 101, # Counties & UAs
    "ICB":  153,
    "GP":   7,
}
print("\n--- Target Geographies ---")
print(TARGET_AREA_TYPE_IDS)


--- Target Geographies ---
{'MSOA': 102, 'UTLA': 101, 'ICB': 153, 'GP': 7}


In [16]:
# Quick peek across profiles ----------------------------------------------
print("\n--- Combined Preview of Fetched Data ---")
if preview_tables:
    combined_preview = pd.concat(preview_tables, ignore_index=True)
    print(combined_preview)
else:
    print("No data was downloaded to preview.")



--- Combined Preview of Fetched Data ---
No data was downloaded to preview.


### Setup
We will use our utility functions from `src/phe_api_utils.py` to keep this notebook clean and focused on exploration.

In [3]:
import pandas as pd
import sys
sys.path.append('../src')
import phe_api_utils as phe

# Widen pandas display options
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', 100)

### Step 1: List All Available Profiles

A 'Profile' is a curated collection of indicators on a specific topic. This is our starting point. We need to identify the IDs of the profiles relevant to our project.

In [4]:
profiles_df = phe.get_profiles()

# Filter for the profiles identified as high-priority
priority_profile_ids = [19, 20, 135, 84, 139, 92, 29, 105, 130, 58, 41]

print("All Available Profiles:")
display(profiles_df[['Id', 'Name']])

print("\n\nPriority Profiles for our Project:")
display(profiles_df[profiles_df['Id'].isin(priority_profile_ids)][['Id', 'Name']])

Error fetching https://fingertips.phe.org.uk/api/profiles: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
All Available Profiles:


KeyError: "None of [Index(['Id', 'Name'], dtype='object')] are in the [columns]"

### Step 2: Explore Indicators within a Key Profile (Public Health Outcomes Framework)

Let's dive into the most comprehensive profile, the **Public Health Outcomes Framework (PHOF)**, which has `Profile ID = 19`. We can list all the indicators it contains to see what metrics are available.

In [None]:
phof_profile_id = 19
phof_indicators = phe.get_profile_indicators(phof_profile_id)

print(f"Found {len(phof_indicators)} indicators in the PHOF Profile (ID: {phof_profile_id}).")
print("Sample of available indicators:")

# Display a sample of columns and rows
display(phof_indicators[['Name', 'Definition', 'Unit', 'YearType']].head(10))

This gives us a huge list of potential metrics to model, from mortality rates to the prevalence of specific conditions.

### Step 3: List All Available Area Types

Next, we need to know the different geographical levels at which we can analyse the data. This is crucial for matching health data to administrative boundaries like Integrated Care Boards (ICBs) or local authorities.

In [None]:
area_types_df = phe.get_area_types()

print(f"Found {len(area_types_df)} different area types.")
print("Key Area Types for NHS Planning:")

# Filter for some common and useful area types
useful_area_names = [
    'Counties & UAs (from 4/23)',
    'Sub-ICB Locations (from 7/22)',
    'ICBs (from 7/22)',
    'Middle Super Output Area',
    'General Practice'
]

display(area_types_df[area_types_df['Name'].isin(useful_area_names)])

This table gives us the `Id` for each geography. For example, to get data at the ICB level, we would use `Area Type ID = 153`.

### Step 4: Test Run - Fetching Sample Data

Let's put it all together. We will fetch data for a specific indicator and area type.

- **Indicator**: 'Prevalence of diagnosed hypertension' (Indicator ID 224, from PHOF)
- **Area Type**: ICBs (Area Type ID 153)

In [None]:
indicator_id = 224 # Diagnosed hypertension
area_type_id = 153 # ICBs (from 7/22)

hypertension_data_icb = phe.get_data_for_indicator(indicator_id, area_type_id)

print(f"Successfully fetched {len(hypertension_data_icb)} records.")
print("Sample of Hypertension Prevalence Data for all ICBs:")
display(hypertension_data_icb.head())

## Conclusion and Next Steps

We have successfully mapped the key components of the Fingertips API. We know how to:
1.  Find the `Id` of a **Profile** we care about.
2.  Find the `Id` of an **Indicator** within that profile.
3.  Find the `Id` of an **Area Type** for our desired geography.
4.  Use these IDs to fetch the raw data.

This notebook provides the foundation for our data acquisition phase. In the following notebooks, we will systematically work through our priority profiles, extract the relevant data for the South West region, and prepare it for analysis and for testing against our PDF models.