# 0Ad_modality_usage_ct.ipynb  
**Diagnostic Modality Analysis – CT Usage by Demographics and Procedure Type**

---

### **Aim**
This notebook analyses **Computed Tomography (CT)** usage across key demographics (`age_band`, `gender`) and procedure classifications (e.g. **head**, **chest**, **abdomen**, **angiography**).

Aligned with the `04b` (MRI), and `04c` (Endoscopy) format, this notebook provides:
- Frequency counts of common CT procedures  
- Crosstab breakdowns by `age_band × anatomical_group`  
- Density and bar plots of age-modality distributions  
- Referral type and patient source matrices  

---

### **Purpose**
To understand demographic drivers of CT demand, supporting elective diagnostic service modelling by identifying high-volume anatomical groups and pathway types.

---

### **Output**
- Ranked CT procedure categories  
- Age-based usage visualisations  
- Referral-source interaction tables  
- Inputs for future LSOA-level geospatial demand modelling  

---

### **Notes**
- Cancer-specific indicators are excluded (non-oncology focus)  
- Uses anatomy classification mapping from `procedure_name → anatomical_group`  
- Consistent with the 04x-series design pattern, feeding into 04e (MRI deep-dive)  

---


Imports & Raw-Folder Ingestion

In [1]:
import pandas as pd
import numpy as np
import os
import glob
import re
from IPython.display import display

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)

# ----------- 1A. Point to your raw folder OR hard-code the list ----------
raw_dir   = "/Users/rosstaylor/Downloads/Research Project/Code Folder/nhs-diagnostics-dids-eda/nhs-dids-explorer/data/raw/2024 NHS SW ICBs"
csv_files = glob.glob(os.path.join(raw_dir, "*.csv"))

if not csv_files:
    raise FileNotFoundError("csv_files list is empty – check raw_dir or paths")

print(f"Detected {len(csv_files)} files")

# ----------- canonical 23-column schema from the SQL query ---------------
expected_cols = [
    'icb_code','icb_name','lsoa_code','nhs_region',
    'site_code','site_name','provider_code','provider_name',
    'activity_month','financial_year','financial_month','test_date',
    'age','sex','modality','sub_modality','procedure_name',
    'referral_type','patient_source','cancer_flag','subcancer_flag',
    'referring_org_code','referring_org_name'
]

dfs, meta = [], []
for fp in csv_files:
    peek = pd.read_csv(fp, nrows=5)
    if not set(expected_cols).issubset(peek.columns):
        print(f" {os.path.basename(fp)} – no header found, re-loading with names")
        df_tmp = pd.read_csv(fp, header=None, names=expected_cols, low_memory=False)
    else:
        df_tmp = pd.read_csv(fp, low_memory=False)

    df_tmp = df_tmp.dropna(axis=1, how='all')
    df_tmp.columns = df_tmp.columns.str.strip().str.lower()
    df_tmp = df_tmp[[c for c in expected_cols if c in df_tmp.columns]]
    for col in (set(expected_cols) - set(df_tmp.columns)):
        df_tmp[col] = pd.NA
    df_tmp = df_tmp[expected_cols]
    dfs.append(df_tmp)

    meta.append({
        "file": os.path.basename(fp),
        "rows": len(df_tmp),
        "cols": df_tmp.shape[1],
        "MB": round(df_tmp.memory_usage(deep=True).sum()/1e6, 2)
    })

meta_df = pd.DataFrame(meta)
display(meta_df.style.set_caption("Loaded files – rows / cols / size"))

df = pd.concat(dfs, ignore_index=True)
print(f"Combined shape: {df.shape}")  # expect ~4 M × 23

# Basic type coercion
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df["test_date"] = pd.to_datetime(df["test_date"], errors="coerce")
df["activity_month"] = pd.to_datetime(
    df["activity_month"].astype(str), format="%Y%m", errors="coerce"
)


Detected 7 files
 2024_NHS_SW_Somerset_ICB_11X.csv – no header found, re-loading with names
 2024_NHS_SW_Cornwall_ICB_11N.csv – no header found, re-loading with names
 2024_NHS_SW_Gloucestershire_ICB_11M.csv – no header found, re-loading with names
 2024_NHS_SW_Dorset_ICB_11J.csv – no header found, re-loading with names
 2024_NHS_SW_Devon_ICB_15N.csv – no header found, re-loading with names
 2024_NHS_SW_BSW_ICB_92G.csv – no header found, re-loading with names
 2024_NHS_SW_BNSSG_ICB_15C.csv – no header found, re-loading with names


Unnamed: 0,file,rows,cols,MB
0,2024_NHS_SW_Somerset_ICB_11X.csv,481843,23,695.27
1,2024_NHS_SW_Cornwall_ICB_11N.csv,512857,23,748.35
2,2024_NHS_SW_Gloucestershire_ICB_11M.csv,229186,23,335.49
3,2024_NHS_SW_Dorset_ICB_11J.csv,525091,23,762.79
4,2024_NHS_SW_Devon_ICB_15N.csv,676563,23,991.61
5,2024_NHS_SW_BSW_ICB_92G.csv,741719,23,1106.54
6,2024_NHS_SW_BNSSG_ICB_15C.csv,821993,23,1240.98


Combined shape: (3989252, 23)


Clean & Bucket Patient Source

In [None]:
df["patient_source"] = (
    df["patient_source"]
      .astype(str).str.strip().str.lower()
      .str.replace(r"\(this health care provider\)", "", regex=True)
      .str.replace(r"\s+-\s+", " – ", regex=True)
      .replace({"nan": np.nan})
      .fillna("unknown")
)

bucket_map = {
    r"accident|emergency|aed|a&e": "Emergency",
    r"gp direct|gp ":             "GP",
    r"inpatient":                 "Inpatient",
    r"outpatient":                "Outpatient",
    r"elective|planned":          "Elective",
}
def ps_bucket(txt: str) -> str:
    for pat, lab in bucket_map.items():
        if re.search(pat, txt):
            return lab
    return "Other/Unknown"

df["ps_bucket"] = df["patient_source"].apply(ps_bucket)
df["ps_bucket"].value_counts(dropna=False)


ONS 5-Year Age Bands

In [None]:
# Define ONS-style bands
bands = pd.DataFrame({
    "lower":[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85],
    "upper":[4,9,14,19,24,29,34,39,44,49,54,59,64,69,74,79,84,np.inf],
    "label":["0-4","5-9","10-14","15-19","20-24","25-29","30-34","35-39",
             "40-44","45-49","50-54","55-59","60-64","65-69","70-74",
             "75-79","80-84","85+"]
})
bins   = bands["lower"].tolist() + [np.inf]
labels = bands["label"]

# Keep only whole-number ages
df = df[df["age"].notna() & (df["age"] % 1 == 0)]
df["age"] = df["age"].astype(int)
df["age_band"] = pd.cut(df["age"], bins=bins, labels=labels, right=True)


Columns to Summarise

In [14]:
for col in ['modality', 'sub_modality', 'procedure_name']:
    print(f"\n Top 100 values for: {col}")

    vc = df[col].value_counts(dropna=False)
    total = vc.sum()

    top100 = vc.head(100).reset_index()
    top100.columns = [col, "Count"]
    top100["% of Total"] = (top100["Count"] / total * 100).round(2)

    display(
        top100.style
            .set_caption(f"Top 100: {col} (Count and %)")
            .format({"Count": "{:,}", "% of Total": "{:.2f}%"})
            .background_gradient(cmap="Blues", subset=["Count"])
    )

    print(f"Unique values in {col!r}: {df[col].nunique(dropna=False)}")


 Top 100 values for: modality


Unnamed: 0,modality,Count,% of Total
0,Plain radiography (procedure),1561194,39.14%
1,Diagnostic ultrasonography (procedure),782294,19.61%
2,Computerized axial tomography (procedure),647463,16.23%
3,,510252,12.79%
4,Magnetic resonance imaging (procedure),324885,8.14%
5,Fluoroscopy (procedure),92541,2.32%
6,Nuclear medicine procedure (procedure),29358,0.74%
7,Positron emission tomography (procedure),24039,0.60%
8,Endoscopy (procedure),7174,0.18%
9,Single photon emission computerized tomography (procedure),3563,0.09%


Unique values in 'modality': 12

 Top 100 values for: sub_modality


Unnamed: 0,sub_modality,Count,% of Total
0,,3897393,97.70%
1,X-ray photon absorptiometry (procedure),42811,1.07%
2,Diagnostic Doppler ultrasonography (procedure),26540,0.67%
3,Positron emission tomography with computed tomography (procedure),20614,0.52%
4,Single photon emission computed tomography with computed tomography (procedure),1830,0.05%


Unique values in 'sub_modality': 5

 Top 100 values for: procedure_name


Unnamed: 0,procedure_name,Count,% of Total
0,Plain chest X-ray (procedure),386207,9.68%
1,Plain chest X-ray (procedure) (399208008),298780,7.49%
2,Computed tomography of entire head (procedure),92626,2.32%
3,Computed tomography of entire head (procedure) (408754009),69122,1.73%
4,Ultrasonography of abdomen (procedure),50641,1.27%
5,Ultrasound scan for fetal growth (procedure),45361,1.14%
6,"Computed tomography of thorax, abdomen and pelvis with contrast (procedure)",45150,1.13%
7,Transthoracic echocardiography (procedure) (433236007),41864,1.05%
8,"Computed tomography of thorax, abdomen and pelvis with contrast (procedure) (433761009)",35316,0.89%
9,Computed tomography of abdomen and pelvis with contrast (procedure),34668,0.87%


Unique values in 'procedure_name': 3759
