# Web Scraping Italian Transplant Data
This notebook uses Selenium to scrape transplant statistics by year and organ from the official Italian *Centro Nazionale Trapianti* website:  
🔗 https://trapianti.sanita.it/statistiche/trapianti_per_anno.aspx  

We will:
- Load the main table of yearly transplant statistics
- Navigate to data for a specific year and organ
- Extract the transplant subtype headers
- Get the per-center data for that organ
- Return to the main page for further navigation


## 📘 1. Setup & Imports
These handle:

✅ Selenium core functionality (`webdriver`, `By`, `EC`, exceptions)

✅ Waiting strategies (`WebDriverWait`)

✅ Basic error handling

✅ Data processing (`pandas`)

✅ Sleep/delay control (`time`)

In [1]:
# Install required packages (uncomment if needed)
# !pip install selenium pandas

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
import pandas as pd
import time

## 🧭 2. Initialize Selenium and Navigate to the Main Table

This page contains buttons by **year**, which link to organ-level transplant data for that year.


In [2]:
# Start the browser
driver = webdriver.Chrome()  # Assumes chromedriver is in PATH

# Define URL
main_table_url = "https://trapianti.sanita.it/statistiche/trapianti_per_anno.aspx"
driver.get(main_table_url)

# Define reusable XPath parts for clicking buttons by visible text
path_sx = "//button[normalize-space()='"
path_dx = "']"

## 🧪 3. Define the Scraping Function

Define `scrape_year(year, organ)`

This function:
1. Clicks the year (e.g., '2024')
2. Clicks the organ (e.g., 'Cuore')
3. Extracts header labels (subtypes of transplant)
4. Clicks 'Regione' to get transplant numbers by center
5. Returns headers and table rows
6. Navigates back to the main table


In [3]:
def scrape_year(year: str, organ: str):
    wait = WebDriverWait(driver, 10)

    try:
        # Step 1: Click on the year button
        wait.until(EC.element_to_be_clickable((By.XPATH, path_sx + year + path_dx))).click()

        # Step 2: Click on the organ button
        wait.until(EC.element_to_be_clickable((By.XPATH, path_sx + organ + path_dx))).click()

        # Step 3: Click on "Regione" to get full transplant center data
        wait.until(EC.element_to_be_clickable((By.XPATH, path_sx + 'Regione' + path_dx))).click()

        # Step 4: Wait for the table rows to load
        wait.until(EC.presence_of_all_elements_located((By.XPATH, "//tr")))

        # Step 5: Safely extract headers (fresh call after wait to avoid stale refs)
        for _ in range(3):  # Retry block to prevent StaleElementReferenceException
            try:
                headers = driver.find_elements(By.XPATH, "//th")
                header_labels = [h.text.strip() for h in headers if h.text.strip()]
                break
            except StaleElementReferenceException:
                time.sleep(1)
        else:
            print(f"⚠️ Failed to load headers for {year}-{organ}")
            return [], []

        # Step 6: Safely extract data rows (fresh call)
        for _ in range(3):
            try:
                rows_elements = driver.find_elements(By.XPATH, "//tr")
                data_rows = []
                for row in rows_elements[1:]:  # skip table title/header
                    cells = row.find_elements(By.XPATH, ".//td")
                    if not cells:
                        continue
                    row_data = [cell.text.strip() for cell in cells]
                    data_rows.append(row_data)
                break
            except StaleElementReferenceException:
                time.sleep(1)
        else:
            print(f"⚠️ Failed to load rows for {year}-{organ}")
            return [], []

        # Step 7: Navigate back to the main table
        wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="primo_link"]/a[2]'))).click()

        return header_labels, data_rows

    except TimeoutException as e:
        print(f"[Timeout] Timeout error for {year}-{organ}: {e}")
        return [], []

    except Exception as e:
        print(f"[Error] Unexpected error for {year}-{organ}: {e}")
        return [], []

## ✅ 4. Test the Function
Run a Test for `Year = 2024`, `Organ = 'Cuore'`

This will return:
- The header labels (e.g., 'Tipo di Trapianto')
- The list of transplant counts per center


In [4]:
headers, rows = scrape_year("2024", "Cuore")

print("🔹 Headers:")
print(headers)

print("\n📋 Table rows:")
for row in rows[:5]:  # Show only first 5 rows
    print(row)

🔹 Headers:
["TRAPIANTI DI CUORE EFFETTUATI IN ITALIA NELL'ANNO 2024", 'Struttura trapianto', 'Cuore', 'Cuore - fegato', 'Totale Trapianti']

📋 Table rows:
['TO - AOU Città della Salute, PO OIRM', '5', '0', '5']
['TO - AOU Città della Salute, PO S.G.Battista', '26', '2', '28']
['BG - OSPEDALE PAPA GIOVANNI XXIII - BERGAMO', '22', '0', '22']
["MI - AO NIGUARDA CA' GRANDA - MILANO", '35', '0', '35']
['PV - OSPEDALE POLICLINICO S. MATTEO - PAVIA', '15', '0', '15']


## 🧹 5. Cleanup
Close the Browser:
Don't forget to quit the driver to release system resources.


In [5]:
driver.quit()

## 💾 Save Scraped Transplant Data (One Year, Multiple Organs)

This function loops over a list of organs for a single year and:

1. Navigates to the transplant statistics site and scrapes the center-level data for each selected organ and year.
2. Extracts the table headers (excluding the title row) and aligns the data rows accordingly.
3. Saves each organ's dataset to a separate CSV file named `{year}_{organ}.csv` inside the specified output folder.
   
Each saved table maintains the wide format exactly as displayed on the website, making it easier to clean, analyze, or transform later — for example in Excel, Power BI, or Looker Studio.

> 📁 Each organ/year combination is saved as an individual CSV file, preserving its unique structure and subtype columns.


📌 Add This Utility Function to Reset to the Main Page

In [6]:
def reset_main_page(driver):
    """Load or reload the main transplant statistics page."""
    driver.get("https://trapianti.sanita.it/statistiche/trapianti_per_anno.aspx")

In [7]:
def save_each_organ_table_for_year(year: str, organs: list, output_folder: str):
    """
    Scrapes and saves one CSV per organ for a given year in wide format.
    """
    for organ in organs:
        print(f"🔄 Scraping {organ} data for {year}...")

        reset_main_page(driver)
        time.sleep(1)

        headers, rows = scrape_year(year, organ)

        if not rows or not headers:
            print(f"⚠️ Skipping {organ} due to missing data.")
            continue

        # Clean and prepare header (skip title row)
        table_headers = [h for h in headers if h.strip()][1:]

        # Pad or truncate rows to match header length
        cleaned_rows = []
        for row in rows:
            row = row[:len(table_headers)]
            while len(row) < len(table_headers):
                row.append("")
            cleaned_rows.append(row)

        # Create DataFrame and save to CSV
        df = pd.DataFrame(cleaned_rows, columns=table_headers)
        filename = f"{year}_{organ}.csv".replace(" ", "_")
        filepath = f"{output_folder}/{filename}"

        df.to_csv(filepath, index=False, encoding="utf-8-sig")
        print(f"✅ Saved: {filepath}")

▶️ Example Usage: Save 2024 Data for Rene, Cuore, and Fegato.

In [8]:
# Step 2: Start the WebDriver
driver = webdriver.Chrome()

# Step 3: Navigate to main page
main_table_path = "https://trapianti.sanita.it/statistiche/trapianti_per_anno.aspx"
driver.get(main_table_path)

# Step 4: Call your scraping function
organs = ["Rene", "Cuore", "Fegato"]  # or any other organ names shown on the site
save_each_organ_table_for_year("2024", organs, output_folder="../data_raw/YYYY_test")

# Step 5: Quit the driver
driver.quit()

🔄 Scraping Rene data for 2024...
✅ Saved: ../data_raw/YYYY_test/2024_Rene.csv
🔄 Scraping Cuore data for 2024...
✅ Saved: ../data_raw/YYYY_test/2024_Cuore.csv
🔄 Scraping Fegato data for 2024...
⚠️ Skipping Fegato due to missing data.


In [9]:
pd.read_csv('../data_raw/YYYY_test/2024_Rene.csv').head()

Unnamed: 0,Struttura trapianto,Rene,Rene doppio,Rene - fegato,Rene - pancreas,Rene doppio - fegato,Totale Trapianti
0,NO - AOU MAGGIORE DELLA CARITA' - NOVARA,37,2,0,0,0,39
1,"TO - AOU Città della Salute, PO OIRM",4,0,0,0,0,4
2,"TO - AOU Città della Salute, PO S.G.Battista",194,10,2,2,0,208
3,BG - OSPEDALE PAPA GIOVANNI XXIII - BERGAMO,44,7,5,0,0,56
4,BS - PRES. OSPEDAL. SPEDALI CIVILI BRESCIA,60,6,0,0,0,66


In [10]:
pd.read_csv('../data_raw/YYYY_test/2024_Fegato.csv').head()

Unnamed: 0,Struttura trapianto,Fegato,Fegato - pancreas - intestino,Fegato - polmone doppio,Rene - fegato,Cuore - fegato,Rene doppio - fegato,Totale Trapianti
0,"TO - AOU Città della Salute, PO S.G.Battista",174,0,0,2,2,0,178
1,BG - OSPEDALE PAPA GIOVANNI XXIII - BERGAMO,97,1,0,5,0,0,103
2,MI - AO NIGUARDA CA' GRANDA - MILANO,107,0,0,5,0,0,112
3,MI - ISTITUTO NAZ.LE PER CURA TUMORI - MILANO,48,0,0,0,0,0,48
4,MI - OSPEDALE MAGGIORE POLICLINICO - MILANO,63,0,0,0,0,0,63


In [11]:
pd.read_csv('../data_raw/YYYY_test/2024_Cuore.csv').head()

Unnamed: 0,Struttura trapianto,Cuore,Cuore - fegato,Totale Trapianti
0,"TO - AOU Città della Salute, PO OIRM",5,0,5
1,"TO - AOU Città della Salute, PO S.G.Battista",26,2,28
2,BG - OSPEDALE PAPA GIOVANNI XXIII - BERGAMO,22,0,22
3,MI - AO NIGUARDA CA' GRANDA - MILANO,35,0,35
4,PV - OSPEDALE POLICLINICO S. MATTEO - PAVIA,15,0,15


I created the `utils.py` file to reuse the functions defined in this notebook for scraping transplant data for all organs across multiple years.

I also want to organize all the scraped data according to the following structure:

### 📁 Structure: One folder per year

```text
/data_raw/
├── 2022/
│   ├── 2022_Rene.csv
│   ├── 2022_Cuore.csv
│   └── ...
├── 2023/
│   ├── 2023_Rene.csv
│   ├── 2023_Cuore.csv
│   └── ...
├── 2024/
│   ├── 2024_Rene.csv
│   ├── 2024_Cuore.csv
│   └── ...
