# Data Transformation & Preprocessing OSM Libraries

This notebook:
- Selects relevant columns
- Normalizes raw data through standardize (naming conventions, address formats, and category labels)
- Merges contact fields
- Creates point geometry


In [60]:
# Imports & Load Data

import geopandas as gpd
import pandas as pd

from geopy.geocoders import Nominatim             # Import Nominatim geocoder from geopy to convert addresses/places to coordinates
from geopy.extra.rate_limiter import RateLimiter  # Import rate limiter to not exceed API limits(slow down geocoding to avoid blocking)
from tqdm import tqdm                             # Progress bar for tracking enrichment progress

# Load OSM Libraries Data
gdf = gpd.read_file("../sources/osm_libraries.geojson")


In [61]:
# Count total number of missing values per column
missing_count = gdf.isna().sum().sort_values(ascending=False)

# Build table with counts and % of missing values
# This is important to understand data quality - so we can drop the columns with too many missing values

missing = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": (missing_count / len(gdf) * 100).round(1)
}).sort_values(by="missing_pct", ascending=False)

display(missing)

Unnamed: 0,missing_count,missing_pct
dog,150,99.3
layer,150,99.3
addr:floor,150,99.3
wheelchair:description:eo,150,99.3
wheelchair:description:en,150,99.3
...,...,...
name,2,1.3
id,0,0.0
element,0,0.0
amenity,0,0.0


üîë Key Attributes for a Library Layer
| Key Attribute         | Source Field(s)                                                                 | Definition                                                                                                                   |
|-----------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| Library Name          | name, name:en, alt_name, short_name                                             | The primary, official name of the library. Use name or the most appropriate language version if available.                    |
| Library Type          | amenity, operator:type, type, operator                                          | The function or category (e.g., public library ‚Äì implied by `amenity=library`, university library, research library) and the type of managing body. |
| Address               | addr:street, addr:housenumber, addr:postcode, addr:city, addr:country           | The full physical street address for navigation and location.                                                                 |
| Geolocation           | latitude, longitude, geom_point                                                  | The geographic coordinates (latitude and longitude) of the library's location.                                                |
| Contact Info          | phone, contact:phone, email, contact:email, website, contact:website            | Key methods for contacting the library (phone number, email address, and official website URL).                               |
| Opening Hours         | opening_hours                                                                    | The regular schedule indicating when the library is open to the public.                                                       |
| Accessibility         | wheelchair, toilets:wheelchair, level, access                                   | Indicators of physical accessibility, primarily for mobility (e.g., wheelchair access status).                                 |
| Managing Organization | operator, operator:type, network                                                | The name or type of the organization, institution, or network that runs the library.                                          |
| Core Services         | internet_access, room:group_study, room:study_cabin, service:copy, service:scanner, toilets | Availability of essential on-site resources like Wi-Fi (internet_access), study rooms, and other key services.                |


In [62]:
# Columns to Keep constants
KEEP_COLUMNS = [
    "name", "amenity", "id", "operator:type", "operator",
    "addr:street", "addr:housenumber", "addr:postcode",
    "addr:city", "addr:country", 
    "opening_hours", "wheelchair", "toilets:wheelchair", "service:scanner",
    "level", "internet_access", "ref:isil", "service:copy",
    "email", "contact:email", 
    "phone", "contact:phone",
    "website", "contact:website",
    "geometry",
]


In [63]:
# Cleaning Function

def clean_libraries(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    gdf = gdf[[c for c in KEEP_COLUMNS if c in gdf.columns]].copy()

    gdf = gdf.rename(columns={
        "operator:type": "operator_type",
        "addr:street": "street",
        "addr:housenumber": "housenumber",
        "addr:postcode": "postcode",
        "addr:city": "city",
        "id": "library_id",
        "addr:country": "country",
        "wheelchair": "wheelchair_accessible",
        "toilets:wheelchair": "toilets_wheelchair",
        "service:copy": "service_copy",
        "service:scanner": "service_scanner",
        "ref:isil": "isil_code",
        "contact:email": "contact_email",
        "contact:phone": "contact_phone",
        "contact:website": "contact_website",
    })

    gdf["final_email"] = gdf["email"].fillna(gdf["contact_email"])
    gdf["final_phone"] = gdf["phone"].fillna(gdf["contact_phone"])
    gdf["website_url"] = gdf["website"].fillna(gdf["contact_website"])

    gdf["geom_point"] = gdf.geometry.centroid
    gdf["longitude"] = gdf.geom_point.x
    gdf["latitude"] = gdf.geom_point.y

    gdf["name"] = gdf["name"].fillna("unknown")
    gdf = gdf.drop_duplicates(subset=["name", "street", "housenumber"])

    return gdf


In [64]:
# Execute Transformation

cleaned_gdf = clean_libraries(gdf)
cleaned_gdf.head()



  gdf["geom_point"] = gdf.geometry.centroid


Unnamed: 0,name,amenity,library_id,operator_type,operator,street,housenumber,postcode,city,country,...,contact_phone,website,contact_website,geometry,final_email,final_phone,website_url,geom_point,longitude,latitude
0,Bruno-L√∂sche-Bibliothek,library,29071031,,,Perleberger Stra√üe,33,10559,Berlin,DE,...,,http://www.berlin.de/stadtbibliothek-mitte/bib...,,POINT (13.34751 52.53124),,+49 30901833025,http://www.berlin.de/stadtbibliothek-mitte/bib...,POINT (13.34751 52.53124),13.347514,52.531245
1,Anton-Saefkow-Bibliothek,library,60848456,,,Anton-Saefkow-Platz,14,10369,Berlin,DE,...,+4930902963773,http://www.berlin.de/ba-lichtenberg/auf-einen-...,,POINT (13.47084 52.53078),,+4930902963773,http://www.berlin.de/ba-lichtenberg/auf-einen-...,POINT (13.47084 52.53078),13.470838,52.530777
2,Stadtteilbibliothek Erich Weinert,library,203557001,,,Helene-Weigel-Platz,4,12681,Berlin,DE,...,,https://www.berlin.de/bibliotheken-mh/biblioth...,,POINT (13.53872 52.52816),,+49 30 5429251,https://www.berlin.de/bibliotheken-mh/biblioth...,POINT (13.53872 52.52816),13.538715,52.528158
3,Stadtteilbibliothek Halemweg,library,256922190,,,Halemweg,18,13627,Berlin,DE,...,,,,POINT (13.28719 52.5375),,,,POINT (13.28719 52.5375),13.287186,52.537504
4,Bezirkszentralbibliothek Spandau,library,257708789,,,Carl-Schurz-Stra√üe,13,13597,Berlin,DE,...,+49 30 90279 5537,,https://www.berlin.de/stadtbibliothek-spandau/...,POINT (13.20139 52.53613),bibliothek@ba-spandau.berlin.de,+49 30 90279 5537,https://www.berlin.de/stadtbibliothek-spandau/...,POINT (13.20139 52.53613),13.201386,52.536133


In [65]:
# Use geopy(Geopy Nominatim) with rate limiting to fill in missing address components (street, housenumber, postcode) for rows that have coordinates but lack address details.
# Crucial Step: You need to instantiate the geocoding service and apply rate limiting.

#from geopy.geocoders import Nominatim
#from geopy.extra.rate_limiter import RateLimiter


# --- Set up geocoder and rate limiter ---
geolocator = Nominatim(user_agent="libraries_in_berlin")  # Instantiate geocoder with a custom user agent
geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1.5)  # Limit requests to 1 every 1.5 seconds

# --- Function to fill missing address fields for a row ---
def enrich_missing_address(row):
    # Check if any address components are missing and coordinates are present
    needed = (
        pd.isna(row.get("street")) or
        pd.isna(row.get("housenumber")) or
        pd.isna(row.get("postcode")) or
        pd.isna(row.get("city")) or
        pd.isna(row.get("country"))
    )
    if needed and not (pd.isna(row.get("latitude")) or pd.isna(row.get("longitude"))):
        try:
            # Reverse geocode using coordinates
            location = geocode((row["latitude"], row["longitude"]), exactly_one=True, language="en")
            if location and location.raw and "address" in location.raw:
                address = location.raw["address"]
                # Fill missing street name
                row["street"] = row["street"] or address.get("road")
                # Fill missing house number
                row["housenumber"] = row["housenumber"] or address.get("house_number")
                # Fill missing postcode
                row["postcode"] = row["postcode"] or address.get("postcode")
                # Fill missing city
                row["city"] = row["city"] or address.get("city")
                # Fill missing country
                row["country"] = row["country"] or address.get("country")
        except Exception:
            # Fail silently if geocoding fails
            pass
    return row

print("\nStarting Nominatim Address Enrichment (rate limited)...")
cols = ["street", "housenumber", "postcode", "city", "country"]
# Find rows with any missing address component
missing_idx = cleaned_gdf[cols].isna().any(axis=1).index

if not missing_idx.empty:
    tqdm.pandas(desc="Geocoding missing addresses")
    # Apply enrichment function only to missing rows, with progress bar
    cleaned_gdf.loc[missing_idx] = cleaned_gdf.loc[missing_idx].progress_apply(enrich_missing_address, axis=1)
else:
    print("No rows require address enrichment.")

print("Nominatim address enrichment complete.")


Starting Nominatim Address Enrichment (rate limited)...


Geocoding missing addresses: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 [01:13<00:00,  2.03it/s]

Nominatim address enrichment complete.





In [74]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", cleaned_gdf["latitude"].min(), "to", cleaned_gdf["latitude"].max())

print("Longitude range:", cleaned_gdf["longitude"].min(), "to", cleaned_gdf["longitude"].max())

Latitude range: 52.3859756 to 52.6356441
Longitude range: 13.1429895 to 13.6214013


In [75]:
print("\nTop 10 streets:")
print(cleaned_gdf["street"].value_counts().head(10))


Top 10 streets:
street
Garystra√üe             3
Stra√üe des 17. Juni    3
Potsdamer Stra√üe       2
Dorotheenstra√üe        2
Hauptstra√üe            2
Malteserstra√üe         2
Prenzlauer Allee       2
Unter den Linden       2
Greifswalder Stra√üe    2
Bonhoefferweg          2
Name: count, dtype: int64


In [76]:
# Example: most commen wheelchair_accessible
print("\nTop wheelchair_accessible:")
print(cleaned_gdf["wheelchair_accessible"].value_counts().head(10))


Top wheelchair_accessible:
wheelchair_accessible
yes        74
limited    19
no         10
Name: count, dtype: int64


In [77]:
# Count total number of missing values per column
missing_count = cleaned_gdf.isna().sum().sort_values(ascending=False)

# Build table with counts and % of missing values
# This is important to understand data quality - so we can drop the columns with too many missing values

missing = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": (missing_count / len(cleaned_gdf) * 100).round(1)
}).sort_values(by="missing_pct", ascending=False)

display(missing)

Unnamed: 0,missing_count,missing_pct
level,125,83.3
email,119,79.3
toilets_wheelchair,114,76.0
operator_type,112,74.7
phone,109,72.7
contact_phone,103,68.7
contact_website,101,67.3
final_email,98,65.3
isil_code,96,64.0
operator,96,64.0


In [78]:
# 2.1 Remove columns with > 85% missing data (unless useful for metadata).

cols_to_drop = [
    "service_scanner", "service_copy", "contact_email"
]

cleaned_gdf = cleaned_gdf.drop(columns=cols_to_drop, errors="ignore") # Drop sparse columns

In [79]:
cleaned_gdf.head()

Unnamed: 0,name,amenity,library_id,operator_type,operator,street,housenumber,postcode,city,country,...,contact_phone,website,contact_website,geometry,final_email,final_phone,website_url,geom_point,longitude,latitude
0,Bruno-L√∂sche-Bibliothek,library,29071031,,,Perleberger Stra√üe,33,10559,Berlin,DE,...,,http://www.berlin.de/stadtbibliothek-mitte/bib...,,POINT (13.34751 52.53124),,+49 30901833025,http://www.berlin.de/stadtbibliothek-mitte/bib...,POINT (13.34751 52.53124),13.347514,52.531245
1,Anton-Saefkow-Bibliothek,library,60848456,,,Anton-Saefkow-Platz,14,10369,Berlin,DE,...,+4930902963773,http://www.berlin.de/ba-lichtenberg/auf-einen-...,,POINT (13.47084 52.53078),,+4930902963773,http://www.berlin.de/ba-lichtenberg/auf-einen-...,POINT (13.47084 52.53078),13.470838,52.530777
2,Stadtteilbibliothek Erich Weinert,library,203557001,,,Helene-Weigel-Platz,4,12681,Berlin,DE,...,,https://www.berlin.de/bibliotheken-mh/biblioth...,,POINT (13.53872 52.52816),,+49 30 5429251,https://www.berlin.de/bibliotheken-mh/biblioth...,POINT (13.53872 52.52816),13.538715,52.528158
3,Stadtteilbibliothek Halemweg,library,256922190,,,Halemweg,18,13627,Berlin,DE,...,,,,POINT (13.28719 52.5375),,,,POINT (13.28719 52.5375),13.287186,52.537504
4,Bezirkszentralbibliothek Spandau,library,257708789,,,Carl-Schurz-Stra√üe,13,13597,Berlin,DE,...,+49 30 90279 5537,,https://www.berlin.de/stadtbibliothek-spandau/...,POINT (13.20139 52.53613),bibliothek@ba-spandau.berlin.de,+49 30 90279 5537,https://www.berlin.de/stadtbibliothek-spandau/...,POINT (13.20139 52.53613),13.201386,52.536133


In [81]:
cleaned_gdf.columns

Index(['name', 'amenity', 'library_id', 'operator_type', 'operator', 'street',
       'housenumber', 'postcode', 'city', 'country', 'opening_hours',
       'wheelchair_accessible', 'toilets_wheelchair', 'level',
       'internet_access', 'isil_code', 'email', 'phone', 'contact_phone',
       'website', 'contact_website', 'geometry', 'final_email', 'final_phone',
       'website_url', 'geom_point', 'longitude', 'latitude'],
      dtype='object')

In [82]:
# Save Output
cleaned_gdf = cleaned_gdf.drop(columns=["geom_point"])  # Drop the extra geometry column before saving. 
cleaned_gdf.to_file("../sources/libraries_cleaned.geojson", driver="GeoJSON")
