## 📊 Load, Clean, and Reshape Transplant Data
In this notebook, we begin the data processing phase of the project. After scraping and storing transplant data by organ and year in individual CSV files, our next steps are to load, clean, and consolidate these datasets into a unified format suitable for analysis and visualization.

### 🔧 Goals:
1. **Load** the CSV files saved for each year and organ.

2. **Clean** the data:

    - Remove total summary rows and columns (e.g., `Totale`, `Totale trapianti`).

    - Strip whitespace and handle missing or inconsistent values.

3. **Reshape** each DataFrame to a long format using `pd.melt()`:

    - One row per transplant record.

    - Columns: `Struttura trapianto`, `Città`, `Organo`, `Sottotipo`, `Numero`, `Anno`.

4. **Enrich** the dataset:

    - Extract city codes from hospital names and map to full city names (optional).

    - Add the corresponding region (optional, if mapping available).

5. **Combine** all yearly-organ files into a single long-format DataFrame.

6. **Preview** the final dataset to ensure consistency before moving on to data exploration.

In [1]:
import pandas as pd
import os
#from glob import glob

In [2]:
# List all CSV files in all subfolders of ../data_raw/
#csv_files = glob("../data_raw/**/*.csv", recursive=True)

#for file in csv_files:
    #print(file)

In [3]:
# --- CONFIGURATION ---
data_path = "../data_raw/"
organs = ['Rene', 'Fegato', 'Cuore', 'Polmone', 'Pancreas', 'Intestino']
years = [str(y) for y in range(2010, 2025)]  # Update if needed
all_data = []

In [4]:
# --- HELPER FUNCTION ---
def load_and_clean_csv(file_path, organ, year):
    df = pd.read_csv(file_path)

    # Drop summary column if present
    if 'Totale Trapianti' in df.columns:
        df.drop(columns='Totale Trapianti', inplace=True)

    # Drop summary row if last row contains 'totale' or 'totali'
    if df.iloc[-1, 0].strip().lower() in ['totale', 'totali']:
        df.drop(index=df.index[-1], inplace=True)

    # Extract city short code (e.g. 'MI') before the dash
    df['Città'] = df['Struttura trapianto'].str.split('-').str[0].str.strip()

    # Reshape from wide to long
    df_long = pd.melt(
        df,
        id_vars=['Struttura trapianto', 'Città'],
        var_name='Sottotipo',
        value_name='Numero'
    )

    # Add organ and year
    df_long['Organo'] = organ
    df_long['Anno'] = int(year)

    # Reorder columns
    return df_long[['Struttura trapianto', 'Città', 'Organo', 'Sottotipo', 'Numero', 'Anno']]

In [5]:

# --- MAIN LOOP ---
for year in years:
    for organ in organs:
        file_path = os.path.join(data_path, year, f"{year}_{organ}.csv")
        if os.path.exists(file_path):
            try:
                df_clean = load_and_clean_csv(file_path, organ, year)
                all_data.append(df_clean)
            except Exception as e:
                print(f"❌ Failed to process {file_path}: {e}")
        else:
            print(f"⚠️ Missing file: {file_path}")

⚠️ Missing file: ../data_raw/2012\2012_Intestino.csv
⚠️ Missing file: ../data_raw/2013\2013_Intestino.csv
⚠️ Missing file: ../data_raw/2014\2014_Intestino.csv
⚠️ Missing file: ../data_raw/2017\2017_Intestino.csv
⚠️ Missing file: ../data_raw/2018\2018_Intestino.csv
⚠️ Missing file: ../data_raw/2020\2020_Intestino.csv
⚠️ Missing file: ../data_raw/2021\2021_Intestino.csv


File are missing because no Intestine Transplants had been performed in 2012, 2013, 2014, 2017, 2018, 2020, 2021.

In [6]:
# --- CONCATENATE ALL ---
df_all = pd.concat(all_data, ignore_index=True)

In [7]:
# --- Preview ---
df_all.head()

Unnamed: 0,Struttura trapianto,Città,Organo,Sottotipo,Numero,Anno
0,NO - AOU MAGGIORE DELLA CARITA' - NOVARA,NO,Rene,Rene,67,2010
1,"TO - AOU Città della Salute, PO OIRM",TO,Rene,Rene,5,2010
2,"TO - AOU Città della Salute, PO S.G.Battista",TO,Rene,Rene,109,2010
3,BG - OSPEDALE PAPA GIOVANNI XXIII - BERGAMO,BG,Rene,Rene,17,2010
4,BS - PRES. OSPEDAL. SPEDALI CIVILI BRESCIA,BS,Rene,Rene,47,2010


In [8]:
# --- Save ---
df_all.to_csv("../data_cleaned/transplants_italy_2010_2024_long.csv", index=False)

### 🧹 Filtering Out Inactive Transplant Centers

In this section, I identify and remove transplant centers that have not performed any transplants in the last 10 years. This step focuses the analysis on currently active centers and reduces noise from outdated or discontinued programs.

To assess the impact of this filtering, I compare the number of transplant centers and the total number of transplants **before and after** the cleaning step.

The filtering process includes:

- Defining a cutoff year (`current year - 10`).
- Identifying centers that performed **at least one transplant** since the cutoff year.
- Removing from the dataset all centers that have been **inactive** during this period.

In [9]:
# 📅 1. Define the cutoff year (e.g., last 10 years)
cutoff_year = df_all['Anno'].max() - 10  # You can adjust the value (e.g., 5 for last 5 years)

# 🏥 2. Count transplant centers before filtering
centers_before = df_all['Struttura trapianto'].nunique()

# 📊 3. Total transplants before filtering
transplants_before = df_all['Numero'].sum()

# 🔍 4. Identify active centers (performed at least one transplant since cutoff year)
active_centers = df_all[df_all['Anno'] > cutoff_year]['Struttura trapianto'].unique()

# 📁 5. Filter dataset to only include active centers
df_active = df_all[df_all['Struttura trapianto'].isin(active_centers)]

# 🏥 6. Count transplant centers after filtering
centers_after = df_active['Struttura trapianto'].nunique()

# 📊 7. Total transplants after filtering
transplants_after = df_active['Numero'].sum()

# 📉 8. Transplants removed from analysis
excluded_transplants = transplants_before - transplants_after

# ✅ 9. Summary
print(f"🏥 Transplant centers before filtering: {centers_before}")
print(f"✅ Active centers (since {cutoff_year}): {centers_after}")
print(f"📊 Transplants before filtering: {transplants_before}")
print(f"📊 Transplants after filtering:  {transplants_after}")
print(f"📉 Transplants removed from analysis: {excluded_transplants}")

🏥 Transplant centers before filtering: 52
✅ Active centers (since 2014): 46
📊 Transplants before filtering: 50879
📊 Transplants after filtering:  50807
📉 Transplants removed from analysis: 72


Although inactive centers were removed based on their transplant activity over the last 10 years, we retain **all historical data from 2010 onward** for the active centers. This ensures the dataset remains comprehensive and robust for longitudinal analysis.

### 🧼 Filtering Dataset to Active Transplant Centers

To ensure that the analysis focuses only on currently active transplant centers, we filter out centers that have **not performed any transplants in the last 10 years**.

The filtering process involves:
- Identifying the **cutoff year** (`current year - 10`).
- Selecting centers that recorded **at least one transplant** since that year.
- Filtering the full dataset to include only these active centers.

This step improves data quality by removing outdated programs and centers no longer in operation.

In [10]:
# 1. Define the cutoff year (e.g., last 10 years of data)
cutoff_year = df_all['Anno'].max() - 10

# 2. Filter to include only data from the last 10 years
recent_df = df_all[df_all['Anno'] > cutoff_year]

# 3. Identify active transplant centers (at least 1 transplant in last 10 years)
active_centers = recent_df[recent_df['Numero'] > 0]['Struttura trapianto'].unique()

# 4. Filter the original dataset to include only active centers
df_filtered = df_all[df_all['Struttura trapianto'].isin(active_centers)].copy()
df_filtered = df_filtered.reset_index(drop=True)

In [11]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6794 entries, 0 to 6793
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Struttura trapianto  6794 non-null   object
 1   Città                6794 non-null   object
 2   Organo               6794 non-null   object
 3   Sottotipo            6794 non-null   object
 4   Numero               6794 non-null   int64 
 5   Anno                 6794 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 318.6+ KB


### 🏙️ Enriching with Full City Names and Regions
To improve readability and support geographic analyses, we enrich the dataset by mapping the city codes (Città column) to their full city names and corresponding Italian regions. This adds important contextual information for each transplant center.

We add two new columns to the `df_filtered` DataFrame:

- `Città completa`: full name of the city

- `Regione`: the Italian region where the city is located

In [12]:
df_filtered['Città'].unique()

array(['NO', 'TO', 'BG', 'BS', 'MI', 'PV', 'VA', 'PD', 'TV', 'VI', 'VR',
       'UD', 'GE', 'BO', 'MO', 'PR', 'FI', 'PI', 'SI', 'PG', 'AN', 'RM',
       'AQ', 'SA', 'BA', 'CS', 'RC', 'CT', 'PA', 'CA', 'NA'], dtype=object)

In [13]:
# Mapping of city codes to full city names and regions
city_region_map = {
    'NO': ('Novara', 'Piemonte'),
    'TO': ('Torino', 'Piemonte'),
    'BG': ('Bergamo', 'Lombardia'),
    'BS': ('Brescia', 'Lombardia'),
    'MI': ('Milano', 'Lombardia'),
    'PV': ('Pavia', 'Lombardia'),
    'VA': ('Varese', 'Lombardia'),
    'PD': ('Padova', 'Veneto'),
    'TV': ('Treviso', 'Veneto'),
    'VI': ('Vicenza', 'Veneto'),
    'VR': ('Verona', 'Veneto'),
    'UD': ('Udine', 'Friuli Venezia Giulia'),
    'GE': ('Genova', 'Liguria'),
    'BO': ('Bologna', 'Emilia-Romagna'),
    'MO': ('Modena', 'Emilia-Romagna'),
    'PR': ('Parma', 'Emilia-Romagna'),
    'FI': ('Firenze', 'Toscana'),
    'PI': ('Pisa', 'Toscana'),
    'SI': ('Siena', 'Toscana'),
    'PG': ('Perugia', 'Umbria'),
    'AN': ('Ancona', 'Marche'),
    'RM': ('Roma', 'Lazio'),
    'AQ': ('L’Aquila', 'Abruzzo'),
    'SA': ('Salerno', 'Campania'),
    'BA': ('Bari', 'Puglia'),
    'CS': ('Cosenza', 'Calabria'),
    'RC': ('Reggio Calabria', 'Calabria'),
    'CT': ('Catania', 'Sicilia'),
    'PA': ('Palermo', 'Sicilia'),
    'CA': ('Cagliari', 'Sardegna'),
    'NA': ('Napoli', 'Campania'),
}

In [14]:
# Map city codes to full names and regions
df_filtered['Città_nome'] = df_filtered['Città'].map(lambda x: city_region_map.get(x, (None, None))[0])
df_filtered['Regione'] = df_filtered['Città'].map(lambda x: city_region_map.get(x, (None, None))[1])

### ✅ Data Consistency Check
Before proceeding to data exploration and visualization, we perform a few key checks to ensure the dataset is clean and consistent:

- Ensure there are no missing or null values in critical columns (`Anno`, `Organo`, `Struttura trapianto`, `Numero`, `Città`, `Regione`)
- Verify that the `Numero` field is numeric and non-negative
- Confirm that each combination of year, organ, and transplant center is unique (or appropriately duplicated)

These checks help validate the integrity of the data and prepare it for reliable analysis.

In [15]:
# ✅ Data Consistency Checks

# 1. Check for missing values in key columns
missing_summary = df_filtered[['Anno', 'Organo', 'Struttura trapianto', 'Numero', 'Città', 'Regione']].isnull().sum()
print("🔍 Missing values per column:\n", missing_summary)

# 2. Check if 'Numero' contains only non-negative values
if (df_filtered['Numero'] < 0).any():
    print("⚠️ Warning: Negative values found in 'Numero'")
else:
    print("✅ All values in 'Numero' are non-negative.")

# 3. Confirm data types
print("\n🔢 Data types:\n", df_filtered.dtypes)

# 4. Check for unexpected duplicates (optional)
duplicates = df_filtered.duplicated(subset=['Anno', 'Organo', 'Struttura trapianto', 'Sottotipo'])
print(f"\n📌 Duplicate rows found: {duplicates.sum()}")

🔍 Missing values per column:
 Anno                   0
Organo                 0
Struttura trapianto    0
Numero                 0
Città                  0
Regione                0
dtype: int64
✅ All values in 'Numero' are non-negative.

🔢 Data types:
 Struttura trapianto    object
Città                  object
Organo                 object
Sottotipo              object
Numero                  int64
Anno                    int64
Città_nome             object
Regione                object
dtype: object

📌 Duplicate rows found: 0


### 💾 Finalizing the Cleaned Dataset
At this stage, we've successfully cleaned, filtered, and enriched the dataset with relevant location details. We now have a consolidated and robust dataset covering transplant activity across all active centers in Italy from 2010 to 2024.

This cleaned dataset will serve as the foundation for the next steps, focused on data exploration and visualization.

We'll now save the `df_filtered` DataFrame for reuse in the upcoming notebook.

In [16]:
# Save the cleaned and enriched dataset
df_filtered.to_csv("../data_cleaned/Transplants_Italy_2010_2024_clean.csv", index=False)

print("✅ Cleaned dataset saved successfully.")

✅ Cleaned dataset saved successfully.
