# Spatial Join/Enrichment: Assigning District and Neighbourhood

Assign Berlin districts and district IDs to libraries using spatial joins.

To enrich each libraries with its administrative context;

- **Load districts**: Imported official Berlin boundaries from lor_ortsteile.geojson.
- **Perform spatial join**: The spatial join (using predicate="within") matches each institution’s geometry with the polygon representing the *Ortsteil*(neighbourhood) in which it lies.
- **Rename columns**: Columns BEZIRK (district) and OTEIL (neighbourhood) were renamed to district and neighbourhood for consistency.
- **Add district IDs**: Mapped each district to its **official statistical ID**(district_id) according to Regionalstatistik Berlin/Brandenburg.
- These standardized district identifiers ensure that all institutions can be integrated with other city datasets (e.g., population, demographics, infrastructure).

This enrichment step closes the remaining geographic gaps left after geocoding, ensuring every record is now spatially anchored within a known Berlin district and neighbourhood.

In [21]:
import geopandas as gpd
import pandas as pd
import numpy as np

In [11]:
# Load official Berlin districts GeoDataFrame from lor_ortsteile.geojson
berlin_districts = gpd.read_file("../sources/lor_ortsteile.geojson")


# Inspect column names to find the district name column
print(berlin_districts.columns)

Index(['gml_id', 'spatial_name', 'spatial_alias', 'spatial_type', 'OTEIL',
       'BEZIRK', 'FLAECHE_HA', 'geometry'],
      dtype='object')


In [12]:
# --- 1 . Reproject GeoDataFrames to EPSG:4326 ---
gdf = gpd.read_file("../sources/libraries_cleaned.geojson")
                     
gdf = gdf.to_crs(epsg=4326)

# Ensure the district data is also 4326 for a valid spatial join
berlin_districts_gdf = berlin_districts.to_crs(epsg=4326)

In [13]:
# --- 2. Perform the Spatial Join ---

# We are using 'spatial_name' as the unique ID (district_id) and selecting the names
library_df_district = gpd.sjoin(
    gdf,
    berlin_districts_gdf[["BEZIRK", "OTEIL", "spatial_name", "geometry"]],
    how="left",
    predicate="within"
)
# --- 3. Rename Columns to Match SQL Schema ---

library_df_district = library_df_district.rename(columns={
    "BEZIRK": "district",       # Matches SQL 'district' (Bezirk name)
    "OTEIL": "neighbourhood",     # Matches SQL 'neighbourhood' (Ortsteil name)
    "spatial_name": "neighbourhood_id" # Matches SQL 'district_id' (Foreign Key)
}).drop(columns=["index_right"])  # Drop the index column added by sjoin

In [14]:
library_df_district.head()

Unnamed: 0,name,amenity,library_id,operator_type,operator,street,housenumber,postcode,city,country,...,contact_website,final_email,final_phone,website_url,longitude,latitude,geometry,district,neighbourhood,neighbourhood_id
0,Bruno-Lösche-Bibliothek,library,29071031,,,Perleberger Straße,33,10559,Berlin,DE,...,,,+49 30901833025,http://www.berlin.de/stadtbibliothek-mitte/bib...,13.347514,52.531245,POINT (13.34751 52.53124),Mitte,Moabit,102
1,Anton-Saefkow-Bibliothek,library,60848456,,,Anton-Saefkow-Platz,14,10369,Berlin,DE,...,,,+4930902963773,http://www.berlin.de/ba-lichtenberg/auf-einen-...,13.470838,52.530777,POINT (13.47084 52.53078),Lichtenberg,Fennpfuhl,1111
2,Stadtteilbibliothek Erich Weinert,library,203557001,,,Helene-Weigel-Platz,4,12681,Berlin,DE,...,,,+49 30 5429251,https://www.berlin.de/bibliotheken-mh/biblioth...,13.538715,52.528158,POINT (13.53872 52.52816),Marzahn-Hellersdorf,Marzahn,1001
3,Stadtteilbibliothek Halemweg,library,256922190,,,Halemweg,18,13627,Berlin,DE,...,,,,,13.287186,52.537504,POINT (13.28719 52.5375),Charlottenburg-Wilmersdorf,Charlottenburg-Nord,406
4,Bezirkszentralbibliothek Spandau,library,257708789,,,Carl-Schurz-Straße,13,13597,Berlin,DE,...,https://www.berlin.de/stadtbibliothek-spandau/...,bibliothek@ba-spandau.berlin.de,+49 30 90279 5537,https://www.berlin.de/stadtbibliothek-spandau/...,13.201386,52.536133,POINT (13.20139 52.53613),Spandau,Spandau,501


In [15]:
# Generating district ids


# District mapping (official codes as strings)
district_mapping = {
    'Mitte': '11001001',
    'Friedrichshain-Kreuzberg': '11002002',
    'Pankow': '11003003',
    'Charlottenburg-Wilmersdorf': '11004004',
    'Spandau': '11005005',
    'Steglitz-Zehlendorf': '11006006',
    'Tempelhof-Schöneberg': '11007007',
   'Neukölln': '11008008',
    'Treptow-Köpenick': '11009009',
    'Marzahn-Hellersdorf': '11010010',
    'Lichtenberg': '11011011',
    'Reinickendorf': '11012012'
}

# Apply mapping to create district_id column (string)
library_df_district['district_id'] = library_df_district['district'].map(district_mapping).astype(str)

print("\nSpatial join complete. Ready for WKT conversion and database upload.")
display(library_df_district[['district_id', 'district', 'neighbourhood', 'neighbourhood_id']].head())


Spatial join complete. Ready for WKT conversion and database upload.


Unnamed: 0,district_id,district,neighbourhood,neighbourhood_id
0,11001001,Mitte,Moabit,102
1,11011011,Lichtenberg,Fennpfuhl,1111
2,11010010,Marzahn-Hellersdorf,Marzahn,1001
3,11004004,Charlottenburg-Wilmersdorf,Charlottenburg-Nord,406
4,11005005,Spandau,Spandau,501


In [16]:
print("✅ Dataset after cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {library_df_district.shape[0]}")
print(f"Number of columns: {library_df_district.shape[1]}")

# Column list
print("\nRemaining columns:")
print(library_df_district.columns.tolist())

# Missing values check
missing = library_df_district.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

✅ Dataset after cleaning and transforming

Number of rows: 150
Number of columns: 31

Remaining columns:
['name', 'amenity', 'library_id', 'operator_type', 'operator', 'street', 'housenumber', 'postcode', 'city', 'country', 'opening_hours', 'wheelchair_accessible', 'toilets_wheelchair', 'level', 'internet_access', 'isil_code', 'email', 'phone', 'contact_phone', 'website', 'contact_website', 'final_email', 'final_phone', 'website_url', 'longitude', 'latitude', 'geometry', 'district', 'neighbourhood', 'neighbourhood_id', 'district_id']

Missing values after cleaning and transforming :
name                       0
amenity                    0
library_id                 0
operator_type            112
operator                  96
street                     1
housenumber               15
postcode                   0
city                       0
country                    0
opening_hours             28
wheelchair_accessible     47
toilets_wheelchair       114
level                    125
inte

In [18]:
# --- 1. Configuration (Define Constraints) ---

# Fields required to be NOT NULL in the SQL table
NOT_NULL_FIELDS = [
    'library_id', 'name', 'amenity', 'postcode', 'city', 'country',
    'latitude', 'longitude', 'district_id', 'district'
]

# Coordinate bounds for Berlin (Data Accuracy Check)
BERLIN_BOUNDS = {
    'lat_min': 52.33, 'lat_max': 52.68,
    'lon_min': 13.08, 'lon_max': 13.77
}

# SQL VARCHAR length limits to check against
MAX_LENGTHS = {
    'name': 255,
    'library_id': 20, # VARCHAR(20) check
    'district_id': 8  # Full 8-digit ID (e.g., '11001001')
}

# --- 2. Validation Functions ---

def run_completeness_check(df: pd.DataFrame, fields: list) -> bool:
    """Checks for missing values in mandatory NOT NULL fields."""
    print("### A. Completeness Check (NOT NULL) ###")
    
    # Calculate missing values for the mandatory fields
    missing_data = df[fields].isnull().sum()
    critical_missing = missing_data[missing_data > 0]

    if critical_missing.empty:
        print("✅ PASS: All mandatory fields are 100% complete.")
        return True
    else:
        print("❌ FAIL: The following mandatory fields have missing values:")
        print(critical_missing)
        # Display sample rows with issues for quick debugging
        null_rows = df[df[critical_missing.index].isnull().any(axis=1)]
        print(f"\nSample of {len(null_rows)} problematic rows:")
        print(null_rows[['library_id', 'name'] + critical_missing.index.tolist()].head())
        return False


def run_consistency_check(df: pd.DataFrame, lengths: dict) -> bool:
    """Checks data types and ensures field content respects SQL length constraints."""
    print("\n### B. Consistency Check (Types & Lengths) ###")
    passed = True
    
    # 1. Primary Key Type Conversion (library_id)
    # Convert to pandas nullable integer (Int64) first, then to string for SQL VARCHAR match
    try:
        # Fill NA temporarily with a marker, convert, then replace marker with None/NaN if needed later
        df['library_id'] = df['library_id'].astype('Int64').astype(str).replace('<NA>', np.nan)
        print("✅ PASS: library_id successfully standardized to string type.")
    except Exception as e:
        print(f"❌ FAIL: library_id conversion error: {e}")
        passed = False
    
    # 2. Length Constraint Check (e.g., name, library_id)
    for col, max_len in lengths.items():
        # Only check strings and non-missing values
        long_items = df[df[col].astype(str).str.len() > max_len]
        if not long_items.empty:
            print(f"❌ FAIL: {col} has {len(long_items)} item(s) exceeding VARCHAR({max_len}).")
            passed = False
        else:
            print(f"✅ PASS: {col} length checked against VARCHAR({max_len}) limit.")
            
    return passed


def run_accuracy_check(df: pd.DataFrame, bounds: dict) -> bool:
    """Validates geographic coordinates fall within the expected Berlin boundaries."""
    print("\n### C. Data Accuracy Check (Geographic Bounds) ###")

    # Check bounds simultaneously for latitude and longitude
    out_of_bounds = df[
        (df['latitude'] < bounds['lat_min']) | (df['latitude'] > bounds['lat_max']) |
        (df['longitude'] < bounds['lon_min']) | (df['longitude'] > bounds['lon_max'])
    ]

    if out_of_bounds.empty:
        print("✅ PASS: All coordinates fall within the expected Berlin bounding box.")
        return True
    else:
        print(f"❌ FAIL: {len(out_of_bounds)} record(s) are outside Berlin's bounds.")
        print(out_of_bounds[['library_id', 'latitude', 'longitude']].head())
        return False

def run_relational_check(df: pd.DataFrame) -> bool:
    """Checks district_id format and lists unique IDs for external parent table verification."""
    print("\n### D. Relational Integrity Check (District ID) ###")

    # Ensure district_id is string and has correct length
    bad_ids = df[~df['district_id'].astype(str).str.match(r"^\d{8}$")]
    if bad_ids.empty:
        print("✅ PASS: All district_id values are exactly 8 digits.")
        print(f"Unique districts in this dataset: {df['district_id'].nunique()}")
        print(df['district_id'].unique())
        return True
    else:
        print(f"❌ FAIL: {len(bad_ids)} district_id(s) are not 8-digit strings.")
        print(bad_ids[['district_id', 'district', 'library_id']].head())
        return False



In [19]:
print("✅ Dataset after cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {library_df_district.shape[0]}")
print(f"Number of columns: {library_df_district.shape[1]}")

# Column list
print("\nRemaining columns:")
print(library_df_district.columns.tolist())

# Missing values check
missing = library_df_district.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

# ---> Place your validation checks here (this is "after the main summary print")
run_completeness_check(library_df_district, NOT_NULL_FIELDS)
run_consistency_check(library_df_district, MAX_LENGTHS)
run_accuracy_check(library_df_district, BERLIN_BOUNDS)
run_relational_check(library_df_district)

✅ Dataset after cleaning and transforming

Number of rows: 150
Number of columns: 31

Remaining columns:
['name', 'amenity', 'library_id', 'operator_type', 'operator', 'street', 'housenumber', 'postcode', 'city', 'country', 'opening_hours', 'wheelchair_accessible', 'toilets_wheelchair', 'level', 'internet_access', 'isil_code', 'email', 'phone', 'contact_phone', 'website', 'contact_website', 'final_email', 'final_phone', 'website_url', 'longitude', 'latitude', 'geometry', 'district', 'neighbourhood', 'neighbourhood_id', 'district_id']

Missing values after cleaning and transforming :
name                       0
amenity                    0
library_id                 0
operator_type            112
operator                  96
street                     1
housenumber               15
postcode                   0
city                       0
country                    0
opening_hours             28
wheelchair_accessible     47
toilets_wheelchair       114
level                    125
inte

False

In [20]:
library_df_district = library_df_district.dropna(subset=["district", "district_id"])

In [22]:
print("✅ Dataset after cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {library_df_district.shape[0]}")
print(f"Number of columns: {library_df_district.shape[1]}")

# Column list
print("\nRemaining columns:")
print(library_df_district.columns.tolist())

# Missing values check
missing = library_df_district.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

# ---> Place your validation checks here (this is "after the main summary print")
run_completeness_check(library_df_district, NOT_NULL_FIELDS)
run_consistency_check(library_df_district, MAX_LENGTHS)
run_accuracy_check(library_df_district, BERLIN_BOUNDS)
run_relational_check(library_df_district)

✅ Dataset after cleaning and transforming

Number of rows: 149
Number of columns: 31

Remaining columns:
['name', 'amenity', 'library_id', 'operator_type', 'operator', 'street', 'housenumber', 'postcode', 'city', 'country', 'opening_hours', 'wheelchair_accessible', 'toilets_wheelchair', 'level', 'internet_access', 'isil_code', 'email', 'phone', 'contact_phone', 'website', 'contact_website', 'final_email', 'final_phone', 'website_url', 'longitude', 'latitude', 'geometry', 'district', 'neighbourhood', 'neighbourhood_id', 'district_id']

Missing values after cleaning and transforming :
name                       0
amenity                    0
library_id                 0
operator_type            111
operator                  96
street                     1
housenumber               15
postcode                   0
city                       0
country                    0
opening_hours             28
wheelchair_accessible     47
toilets_wheelchair       114
level                    124
inte

True

In [24]:
# --- Final Export of Fixed Data ---
library_df_district.to_file("../sources/libraries_db_unified.geojson", driver="GeoJSON")

# library_df_district.to_csv("../sources/libraries_db_unified.csv", index=False)
# print("\n✅ SUCCESS: Fixed dataset exported for database to ../sources/libraries_db_unified.csv")