# Data preprocess

## Overview
This notebook prepares and explores demographic and real-estate CSV data (originally for Barcelona paths in the code) and sets up plotting/EDA for Paris. Key pieces of code:

- Two helper functions to merge yearly CSV files into multi-year CSVs.
- A cell to convert Barcelona's geometric info into GeoJSON.
- A cell that calls those functions to produce merged files.
- Imports and plotting styling for subsequent exploratory analysis.
- Placeholder cells for loading Paris data and running EDA.

---

## Plotting & styling setup (imports cell)
The notebook imports commonly used libraries and sets plotting defaults:

- `pandas as pd`, `numpy as np` — data handling.
- `matplotlib.pyplot as plt`, `seaborn as sns` — plotting.
- Sets seaborn style to `whitegrid` and default figure size to `(12, 6)` via `plt.rcParams`.

These settings affect all subsequent plotting code in the notebook.

---

## TODO cells / placeholders
- Cell for loading Paris data:
    - Currently a commented placeholder: `# data = pd.read_csv('../data/raw/paris/your_data_file.csv')`.
    - Replace with the correct Paris CSV path and filename to load data into a DataFrame (e.g., `data`).
- Exploratory Data Analysis cell:
    - Empty placeholder for EDA steps (summary stats, visualizations, cleaning, joins between merged CSVs and Paris dataset, etc.).

---

## Notes / suggestions
- The merge functions add a `year` column so downstream analysis can easily group or filter by year.
- Be careful not to re-import the same modules redundantly in later cells (imports are already present).
- Update the Paris data path and implement EDA steps: data cleaning, descriptive stats, time-series plots, and neighborhood-level comparisons.

## Merge Barcelona info into multiyear

In [1]:
import pandas as pd
from pathlib import Path

def merge_renda_disponible_llars_per_persona(input_dir, output_path=None, verbose=True):
    """
    Merge yearly files named {year}_renda_disponible_llars_per_persona.csv for years 2015-2025
    into a single CSV named 2015-2025_affordable_income_households_per_person.csv by default.

    Parameters:
    - input_dir: path to the folder containing the yearly CSVs (string or Path)
    - output_path: optional path (string or Path) for the merged CSV file
    - verbose: print progress/messages if True

    Returns:
    - pandas.DataFrame with the concatenated data
    """
    input_dir = Path(input_dir)
    years = range(2015, 2026)
    dfs = []
    for y in years:
        fp = input_dir / f"{y}_renda_disponible_llars_per_persona.csv"
        if fp.exists():
            try:
                df = pd.read_csv(fp)
                df['year'] = y
                dfs.append(df)
                if verbose:
                    print(f"Loaded: {fp.name} ({len(df)} rows)")
            except Exception as e:
                if verbose:
                    print(f"Failed to read {fp}: {e}")
        else:
            if verbose:
                print(f"Missing: {fp.name}")
    if not dfs:
        raise FileNotFoundError("No input files found for 2015-2025 in " + str(input_dir))
    out_df = pd.concat(dfs, ignore_index=True, sort=False)
    out_path = Path(output_path) if output_path else input_dir / "2015-2025_affordable_income_households_per_person.csv"
    out_df.to_csv(out_path, index=False)
    if verbose:
        print(f"Saved merged CSV to: {out_path} ({len(out_df)} total rows)")
    return out_df

def merge_loc_hab_valors(input_dir, output_path=None, verbose=True):
    """
    Merge yearly files named {year}_loc_hab_valors.csv for years 2018-2025
    into a single CSV named 2018-2025_loc_hab_valors.csv by default.

    Parameters and return: same semantics as merge_renda_disponible_llars_per_persona
    """
    input_dir = Path(input_dir)
    years = range(2018, 2026)
    dfs = []
    for y in years:
        fp = input_dir / f"{y}_loc_hab_valors.csv"
        if fp.exists():
            try:
                df = pd.read_csv(fp)
                df['year'] = y
                dfs.append(df)
                if verbose:
                    print(f"Loaded: {fp.name} ({len(df)} rows)")
            except Exception as e:
                if verbose:
                    print(f"Failed to read {fp}: {e}")
        else:
            if verbose:
                print(f"Missing: {fp.name}")
    if not dfs:
        raise FileNotFoundError("No input files found for 2018-2025 in " + str(input_dir))
    out_df = pd.concat(dfs, ignore_index=True, sort=False)
    out_path = Path(output_path) if output_path else input_dir / "2018-2025_loc_hab_valors.csv"
    out_df.to_csv(out_path, index=False)
    if verbose:
        print(f"Saved merged CSV to: {out_path} ({len(out_df)} total rows)")
    return out_df

In [7]:
merge_loc_hab_valors(input_dir="../data/raw/barcelona", output_path="../data/preprocessed/barcelona/2018-2025_loc_hab_valors.csv", verbose=True)
merge_renda_disponible_llars_per_persona(input_dir="../data/raw/barcelona", output_path="../data/preprocessed/barcelona/2018-2025_renda_disponible_llars_per_persona.csv", verbose=True)

Loaded: 2018_loc_hab_valors.csv (6408 rows)
Loaded: 2019_loc_hab_valors.csv (6408 rows)
Loaded: 2020_loc_hab_valors.csv (6408 rows)
Loaded: 2021_loc_hab_valors.csv (6408 rows)
Loaded: 2022_loc_hab_valors.csv (6408 rows)
Loaded: 2023_loc_hab_valors.csv (6408 rows)
Loaded: 2024_loc_hab_valors.csv (6408 rows)
Loaded: 2025_loc_hab_valors.csv (6408 rows)
Saved merged CSV to: ../data/preprocessed/barcelona/2018-2025_loc_hab_valors.csv (51264 total rows)
Loaded: 2015_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2016_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2017_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2018_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2019_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2020_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2021_renda_disponible_llars_per_persona.csv (1068 rows)
Loaded: 2022_renda_disponible_llars_per_persona.csv (1068 rows)
Missing: 2023_renda_disponible_llars

Unnamed: 0,Any,Codi_Districte,Nom_Districte,Codi_Barri,Nom_Barri,Seccio_Censal,Import_Euros,year
0,2015,1,Ciutat Vella,1,el Raval,1,12845,2015
1,2015,1,Ciutat Vella,1,el Raval,2,10442,2015
2,2015,1,Ciutat Vella,1,el Raval,3,10048,2015
3,2015,1,Ciutat Vella,1,el Raval,4,13121,2015
4,2015,1,Ciutat Vella,1,el Raval,5,10579,2015
...,...,...,...,...,...,...,...,...
8539,2022,10,Sant Martí,73,la Verneda i la Pau,143,16402,2022
8540,2022,10,Sant Martí,65,el Clot,234,21047,2022
8541,2022,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,235,18576,2022
8542,2022,10,Sant Martí,69,Diagonal Mar i el Front Marítim del Poblenou,236,19369,2022


# Export Barcelona geometric data to GeoJSON

In [None]:
import geopandas as gpd

# Read directly from the zip (no need to unzip manually)
gdf = gpd.read_file("../data/raw/barcelona/BCN_UNITATS_ADM.zip")
gdf = gdf.to_crs(epsg=4326)
# Save to GeoJSON
gdf.to_file("../data/preprocessed/barcelona/barcelona_neighborhoods.geojson", driver="GeoJSON")

  result = read_func(


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Load Data

Load the raw data for Paris from the data directory.

In [19]:
# IRIS with respective label, to be matched with neighbourhoods ("quartier")
iris_reference = pd.read_excel('../data/raw/paris/reference_IRIS_geo2025.xlsx', sheet_name = "Emboitements_IRIS", header = 5)
iris_reference.head()

iris_filtered = iris_reference[iris_reference["DEP"] == "75"].drop(columns = ["TYP_IRIS", "UU2020", "REG", "DEP"])
iris_filtered.head()

Unnamed: 0,CODE_IRIS,LIB_IRIS,GRD_QUART,DEPCOM,LIBCOM
37678,751010101,Saint-Germain l'Auxerrois 1,7510101,75101,Paris 1er Arrondissement
37679,751010102,Saint-Germain l'Auxerrois 2,7510101,75101,Paris 1er Arrondissement
37680,751010103,Saint-Germain l'Auxerrois 3,7510101,75101,Paris 1er Arrondissement
37681,751010104,Saint-Germain l'Auxerrois 4,7510101,75101,Paris 1er Arrondissement
37682,751010105,Tuileries,7510101,75101,Paris 1er Arrondissement


## Exploratory Data Analysis

Analyze the characteristics of the neighborhoods in Paris.

In [None]:
# TODO: Add exploratory analysis