In [1]:
# import required libraries and modules
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

In [2]:
# initialise the driver (and open up a browser window)
# driver = webdriver.Chrome(path + '/chromedriver')
driver = webdriver.Chrome()

## 🔍 Inspecting the Source Website: Sistema Informativo Trapianti

This section describes the structure and navigation logic of the **Sistema Informativo Trapianti** website and how we reach the tables from which we'll scrape transplant data.

🔗 Access the website here: [Sistema Informativo Trapianti](https://trapianti.sanita.it/statistiche/trapianti_per_anno.aspx)

The starting point is a summary table we'll call the **Year-Organ Transplant Matrix**, available at the homepage URL. This table displays **years in rows** and **transplanted organs in columns**.

**Year-Organ Transplant Matrix**

![Transplants per year](../images/Transplants_per_year_table.jpg)

Clicking on a specific **year** loads the **Region-Organ Transplant Matrix** for that year. This table shows **regions in rows** and **organs in columns**.

**Region-Organ Transplant Matrix (One Year)**

![Transplants per Organ in one year](../images/Transplants_per_Organ_in_one_year_table.jpg)

From there, selecting an **organ** brings you to the **Region-Specific Organ Transplant Matrix** for that year. The **rows are regions**, and the **columns show subtypes of transplants** (e.g., single kidney, double kidney, kidney-liver, etc.).

**Region-Specific Organ Transplant Matrix**

![Transplants per region of one organ in one year](../images/Tranplants_per_Region_of_one_organ_in_one_year.jpg)

Finally, clicking the **Regione** button opens the most detailed view: the **Center-Specific Organ Transplant Matrix** for that year and organ. Here, each row represents a transplant center (or program), and columns detail the counts for each subtype.

**Center-Specific Organ Transplant Matrix (Final Target)**

![Transplants per Center in Italy of one organ in one year](../images/Tranplants_per_Center_in_Italy_of_one_organ_in_one_year.jpg)

🎯 These final tables (one per year-organ pair) are the **actual data sources** we scrape.

## 🔍 Navigating with XPath: Dynamic Button Selection

From inspecting the HTML structure of the website, we found that the following XPath pattern is especially useful for navigating through the interface: "//button[normalize-space()='YYYY']"

This XPath targets buttons by their visible text — for example, a button labeled with a year like **2023**.
To make this dynamic and reusable in code, we define two components:

In [3]:
path_sx = "//button[normalize-space()='"
path_dx = "']"

You can then construct full XPaths like so: 

In [4]:
year_xpath = path_sx + "2023" + path_dx

🔗 This results in:

In [5]:
year_xpath

"//button[normalize-space()='2023']"

✅ This pattern allows for consistent and clean navigation through the site's buttons for years, organs, and other labeled elements (e.g. Regione)

In [6]:
main_table_path = "https://trapianti.sanita.it/statistiche/trapianti_per_anno.aspx"
# Step1: open up the main table web page Sistema Informativo Trapianti
driver.get(main_table_path)

In [7]:
# Step 2: go to the table of the year = '2024'
year = '2024'
driver.find_element(By.XPATH, path_sx + year + path_dx).click()

In [8]:
# Step 3: go to the table of the organ = 'Rene'
organ = 'Rene'
driver.find_element(By.XPATH, path_sx + organ + path_dx).click()

In [9]:
# Step 4: go to page of the all the centers in all Regions
driver.find_element(By.XPATH, "//button[normalize-space()='Regione']").click()

Now I want to extract the data in the table of my interest with all the number of patients that received a kidney transplantations in 2024 in each center in Lombardia.

In [10]:
# Step 5: take the list of headers from the table with transplants of that Organ in that year
headers=driver.find_elements(By.TAG_NAME, "th")
headers = [i.text for i in headers[1:]]

headers

['Struttura trapianto',
 'Rene',
 'Rene doppio',
 'Rene - fegato',
 'Rene - pancreas',
 'Rene doppio - fegato',
 'Totale Trapianti']

In [11]:
# Step 6: estract data per each center from the table 
# with transplants of that Organ, in that Region, in that year 
rows_elements = driver.find_elements(By.XPATH, "//tr")
data_rows = []
for row in rows_elements[1:]: # skip table title/header
    cells = row.find_elements(By.XPATH, ".//td")
    if not cells:
        continue
    row_data = [cell.text.strip() for cell in cells]
    data_rows.append(row_data)                    

In [12]:
data_rows[0:2]

[["NO - AOU MAGGIORE DELLA CARITA' - NOVARA", '37', '2', '0', '0', '0', '39'],
 ['TO - AOU Città della Salute, PO OIRM', '4', '0', '0', '0', '0', '4']]

In [13]:
# Go back to the main table
driver.find_element(By.XPATH, '//*[@id="primo_link"]/a[2]').click()
# Quit the driver
driver.quit()

Now I need to arrange all data in one DataFrame, using the headers of the table the data come from.

In [14]:
# create the dataframe
df = pd.DataFrame(data_rows, columns=headers)
df.head()

Unnamed: 0,Struttura trapianto,Rene,Rene doppio,Rene - fegato,Rene - pancreas,Rene doppio - fegato,Totale Trapianti
0,NO - AOU MAGGIORE DELLA CARITA' - NOVARA,37,2,0,0,0,39
1,"TO - AOU Città della Salute, PO OIRM",4,0,0,0,0,4
2,"TO - AOU Città della Salute, PO S.G.Battista",194,10,2,2,0,208
3,BG - OSPEDALE PAPA GIOVANNI XXIII - BERGAMO,44,7,5,0,0,56
4,BS - PRES. OSPEDAL. SPEDALI CIVILI BRESCIA,60,6,0,0,0,66


In [15]:
df.tail()

Unnamed: 0,Struttura trapianto,Rene,Rene doppio,Rene - fegato,Rene - pancreas,Rene doppio - fegato,Totale Trapianti
36,CT - A.O. UNIVERSITARIA DI CATANIA,26,1,0,0,0,27
37,PA - Is.Me.T.T.,80,0,5,2,0,87
38,PA - P.O. CIVICO E BENFRATELLI,39,0,0,0,0,39
39,CA - AZIENDA OSPEDALIERA G. BROTZU,36,0,0,0,0,36
40,Totale,1843,124,29,34,1,2031


I plan to improve the quality of the scraped DataFrame by removing the unnecessary `Totale trapianti` column and the final `Totale` row, which summarize data and are not needed for detailed analysis.

These cleaning steps will be implemented in the next notebook, where I will also begin designing a strategy for scraping **all years and organs**. This approach will take into account important execution factors such as page load times, potential delays, and error handling to ensure robust and scalable data collection.