# Bike Lane Data Transformations

## About this Notebook

1. Importing Libraries & Loading Data
2. Cleaning & Standardising OSM Data
3. Prepare the Ortsteile Dataset
4. Spatial Join
5. Data Validation
6. Quick EDA & Addressing Remaining Missing `street_name` Values
7. Calculating Lengths of Bikelanes
8. Data Check
9. Duplicate `bikelane_id`s 
10. Final Check of `geometry`Column

**Total records**: 
- 81,343

**Columns with null values**:
- `street_name`: 307 missing (0.38% of dataset). Maybe we can replace the nulls with 'Unknown' or leave them as null values

**Duplicate `bikelane_id`s**: 
- 1,677 IDs (~4% of dataset)
- All duplicates correspond to bikelanes spanning across multiple districts or neighborhoods.
- No duplicates were found within the same district and neighborhood, indicating no errors in the duplicates.
- A subset of 142 bikelanes spans multiple streets, representing long continuous lanes across different areas.

**Duplicate `street_names`**:
- Many occurrences per district (ie., A 100 in Charlottenburg-Wilmersdorf appears 60 times).
- These duplicates reflect multiple bikelane segments along the same street.

**Duplicate geometry and length_m:**
-  814 repeated geometries
- Duplicates occur because single bikelanes span multiple neighborhoods/districts.
- Corresponding lengths are also duplicated for these repeated segments.
- Example: way/894678838 has the same geometry repeated across Marzahn, Lichtenberg, and Alt-Hohenschönhausen with associated segment lengths.
- These duplicates are spatially valid and required for linking bikelanes to administrative boundaries.

**`Geometry` types**:
- LineString: 81,212
- MultiLineString: 131
- Remaining MultiLineStrings represent genuine multi-segment bikelanes spanning several areas.

**`street_name`s per district (sample & note)**:
- Charlottenburg-Wilmersdorf:
    - A 100 appears 60 times
    - A 111 appears 6 times
    - AVUS appears 2 times
    - Other streets (e.g., Aachener Straße, Abbestraße) also appear multiple times
- These counts reflect multiple bikelane segments per street, not errors.

**Conclusion**:
- The dataset is spatially consistent: duplicate `bikelane_id`s, `street_names`, `geometries`, and `lengths` can be explained by real-world lane spans across districts and neighborhoods.
- Missing `street_name`s are minimal and can be handled as 'Unknown' if necessary.
- No further deduplication is required for `bikelane_id`, `geometry`, or `length_m.
- All anomalies (duplicate street names, MultiLineStrings) are expected in urban spatial data.

## 1. Importing Libraries and Loading Data

### 1.1. Importing Libraries

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely.geometry import LineString, MultiLineString

### 1.2. Loading Data

#### Data Source Aliases
1. **osm** = osm_bikelanes dataset *(Used as the base layer for the final bikelane dataset)*
2. **ort** = lor_ortsteile dataset *(Used as the key that will be used to connect to the ERD. Connected to raw data via spatial join)*

In [2]:
osm = gpd.read_file("osm_bikelanes_raw.geojson")
ort = gpd.read_file("lor_ortsteile.geojson")

### 1.3. Confirming CRS type (`epsg: 4326`)

In [3]:
print("OSM CRS:", osm.crs)
print("Orteile CRS:", ort.crs)

OSM CRS: EPSG:4326
Orteile CRS: EPSG:4326


### 1.4 Overview of Raw Data Tables

#### OSM - *OpenStreetMap*

##### Note: 
With a total of 1064 columns, having 1041 object columns, 22 datetime columns and 1 geometry column this is too much to keep for the purposes of the project. 

Based on the parameters of the project, the following columns will be isolated to keep the relevant fields: 

- *`osm_id`*
- *`street_name`*
- *`geometry`*


In [4]:
print("OSM:", osm.shape)

OSM: (78865, 1064)


In [5]:
print("OSM Bike Lanes Columns & Data Types")
osm.info()

OSM Bike Lanes Columns & Data Types
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 78865 entries, 0 to 78864
Columns: 1064 entries, id to geometry
dtypes: datetime64[ms](22), geometry(1), object(1041)
memory usage: 640.2+ MB


In [6]:
osm.head(5)

Unnamed: 0,id,@id,HFCS,NHS,NJDOT_SRI,TMC:cid_58:tabcd_1:Class,TMC:cid_58:tabcd_1:Direction,TMC:cid_58:tabcd_1:LCLversion,TMC:cid_58:tabcd_1:LocationCode,TMC:cid_58:tabcd_1:NextLocationCode,...,wikipedia,wikipedia:left,wikipedia:right,winter_service,workrules,year_of_construction,zone:maxspeed,zone:parking,zone:traffic,geometry
0,way/43998936,way/43998936,,,,,,,,,...,,,,,,,,,,"POLYGON ((13.60278 52.53787, 13.60269 52.53786..."
1,way/517805554,way/517805554,,,,,,,,,...,,,,,,,,,,"POLYGON ((13.46367 52.47111, 13.46348 52.47116..."
2,way/1186003574,way/1186003574,,,,,,,,,...,,,,,,,,,,"POLYGON ((13.34615 52.58978, 13.34614 52.58978..."
3,way/1186011275,way/1186011275,,,,,,,,,...,,,,,,,,,,"POLYGON ((13.34569 52.58962, 13.3457 52.58961,..."
4,way/1187324842,way/1187324842,,,,,,,,,...,,,,,,,,,,"POLYGON ((13.42565 52.48772, 13.42563 52.48772..."


#### Ortsteile (ORT) - *Districts*

In [7]:
print("Ortsteile:", ort.shape)

Ortsteile: (96, 8)


In [8]:
print("Ortsteile Bike Lanes Columns & Data Types")
ort.info()

Ortsteile Bike Lanes Columns & Data Types
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   gml_id         96 non-null     object  
 1   spatial_name   96 non-null     object  
 2   spatial_alias  96 non-null     object  
 3   spatial_type   96 non-null     object  
 4   OTEIL          96 non-null     object  
 5   BEZIRK         96 non-null     object  
 6   FLAECHE_HA     96 non-null     float64 
 7   geometry       96 non-null     geometry
dtypes: float64(1), geometry(1), object(6)
memory usage: 6.1+ KB


In [9]:
ort.head(5)

Unnamed: 0,gml_id,spatial_name,spatial_alias,spatial_type,OTEIL,BEZIRK,FLAECHE_HA,geometry
0,re_ortsteil.0101,101,Mitte,Polygon,Mitte,Mitte,1063.8748,"POLYGON ((13.41649 52.52696, 13.41635 52.52702..."
1,re_ortsteil.0102,102,Moabit,Polygon,Moabit,Mitte,768.7909,"POLYGON ((13.33884 52.51974, 13.33884 52.51974..."
2,re_ortsteil.0103,103,Hansaviertel,Polygon,Hansaviertel,Mitte,52.5337,"POLYGON ((13.34322 52.51557, 13.34323 52.51557..."
3,re_ortsteil.0104,104,Tiergarten,Polygon,Tiergarten,Mitte,516.0672,"POLYGON ((13.36879 52.49878, 13.36891 52.49877..."
4,re_ortsteil.0105,105,Wedding,Polygon,Wedding,Mitte,919.9112,"POLYGON ((13.34656 52.53879, 13.34664 52.53878..."


## 2. Cleaning & Standardising OSM

### 2.1. Column Names

In [10]:
osm = osm.rename(columns={"id": "bikelane_id"})
osm = osm.rename(columns={"name": "street_name"})
osm = osm[["bikelane_id", "street_name", "geometry"]]

osm.head()

Unnamed: 0,bikelane_id,street_name,geometry
0,way/43998936,Fritz-Lang-Platz,"POLYGON ((13.60278 52.53787, 13.60269 52.53786..."
1,way/517805554,,"POLYGON ((13.46367 52.47111, 13.46348 52.47116..."
2,way/1186003574,,"POLYGON ((13.34615 52.58978, 13.34614 52.58978..."
3,way/1186011275,,"POLYGON ((13.34569 52.58962, 13.3457 52.58961,..."
4,way/1187324842,,"POLYGON ((13.42565 52.48772, 13.42563 52.48772..."


### 2.2. Validating Geometry & CRS types

In [11]:
osm = osm[osm.geometry.notnull()].copy()

In [12]:
osm = osm.set_crs("EPSG:4326", allow_override=True)

### 2.3. Addressing Geometry Column

In [13]:
print(osm.geom_type.value_counts())

LineString    78860
Polygon           5
Name: count, dtype: int64


In [14]:
from shapely.geometry import LineString

def polygon_to_linestring(geom):
    if geom.geom_type == "Polygon":
        return LineString(geom.exterior.coords)
    elif geom.geom_type == "MultiPolygon":
        return LineString([pt for poly in geom.geoms for pt in poly.exterior.coords])
    else:
        return geom  # keep LineString/other as-is

osm["geometry"] = osm.geometry.apply(polygon_to_linestring)

## 3. Prepare the Ortsteile Dataset

### 3.1. Column Names & CRS Normalisation

In [15]:
# harmonize the CRS
gdf_neighborhoods = ort.to_crs("EPSG:4326")

# rename columns for consistency
gdf_neighborhoods = gdf_neighborhoods.rename(columns={
    "BEZIRK": "district_name",
    "OTEIL": "neighborhood_name",
    "spatial_name": "neighborhood_id",
    "gml_id": "district_id"
})

## 4. Spatial Join

### 4.1. Conducting Spatial Overlay

In [16]:
# Bikelanes are a GeoDataFrame
gdf_bikelanes = gpd.GeoDataFrame(osm, geometry="geometry", crs="EPSG:4326")

bikelanes_enriched = gpd.overlay(
    gdf_bikelanes,
    gdf_neighborhoods,
    how="intersection",
    keep_geom_type=True
)

# Quick check
print(bikelanes_enriched.head())
print(bikelanes_enriched.shape)

      bikelane_id       street_name       district_id neighborhood_id  \
0    way/43998936  Fritz-Lang-Platz  re_ortsteil.1005            1005   
1   way/517805554              None  re_ortsteil.0801            0801   
2  way/1186003574              None  re_ortsteil.1201            1201   
3  way/1186003574              None  re_ortsteil.1209            1209   
4  way/1186011275              None  re_ortsteil.1201            1201   

   spatial_alias spatial_type neighborhood_name        district_name  \
0    Hellersdorf      Polygon       Hellersdorf  Marzahn-Hellersdorf   
1       Neukölln      Polygon          Neukölln             Neukölln   
2  Reinickendorf      Polygon     Reinickendorf        Reinickendorf   
3       Wittenau      Polygon          Wittenau        Reinickendorf   
4  Reinickendorf      Polygon     Reinickendorf        Reinickendorf   

   FLAECHE_HA                                           geometry  
0    811.3239  LINESTRING (13.60278 52.53787, 13.60269 52.537

### 4.2. Selecting Relevant Columns

In [17]:
bikelanes_enriched = bikelanes_enriched[[
    "bikelane_id",
    "street_name",
    "district_id",
    "district_name",
    "neighborhood_id",
    "neighborhood_name",
    "geometry"
]]

## 5. Data Validation
**Note:** I am checking for the missing `street_name` values via Reverse Geocoding **before** calculating lengths of the bikelanes & checking if bikelanes intersect with multiple neighborhoods/districts

#### 5.1. Checking for missing `street_name` values

In [18]:
missing_streets = bikelanes_enriched[bikelanes_enriched["street_name"].isna()]
print("Missing street names:", len(missing_streets))

Missing street names: 13344


#### 5.2. Reverse Geocoding for the missing `street_names`s

##### Import Libraries for Reverse Geocoding

In [19]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from time import sleep
from tqdm.notebook import tqdm

##### Initialising the Geolocator & Building Function to Get the Missing `street_name`s

**Note:** To generate the missing `street_name`values, the function will be using the "centroid" of the LineString/MultiLineString values

In [34]:
geolocator = Nominatim(user_agent="berlin_bikelanes_locator")

def get_street_name(lat, lon, max_retries=3):
    attempt = 0
    while attempt < max_retries:
        try:
            location = geolocator.reverse((lat, lon), exactly_one=True)
            sleep(1)  # avoid being blocked by API
            if location and "address" in location.raw:
                address = location.raw["address"]
                return (
                    address.get("road")
                    or address.get("pedestrian")
                    or address.get("footway")
                    or address.get("street")
                    or None
                )
            return None
        except GeocoderTimedOut:
            attempt += 1
            print(f"⚠️ Timeout at ({lat}, {lon}), retrying {attempt}/{max_retries}...")
            sleep(2)
        except Exception as e:
            print(f"❌ Error at ({lat}, {lon}): {e}")
            return None
    print(f"❌ Failed after {max_retries} retries: ({lat}, {lon})")
    return None

##### Filtering Rows with Missing `street_name` values

In [35]:
missing_street_rows = bikelanes_enriched[bikelanes_enriched["street_name"].isna()].copy()
print(f"ℹ️ Total rows to geocode: {len(missing_street_rows)}")

ℹ️ Total rows to geocode: 13344


#### Creating Representation Points & Progress Bar

In [33]:
missing_street_rows["rep_point"] = missing_street_rows.geometry.centroid

tqdm.pandas()


  missing_street_rows["rep_point"] = missing_street_rows.geometry.centroid


#### Safeguard Against Losing Data

In [36]:
results = []
SAVE_INTERVAL = 50
partial_file = "street_name_progress_partial.csv"

#### Conduct the Reverse Geocoding

In [38]:
tqdm.pandas()

for i, row in tqdm(missing_street_rows.iterrows(), total=len(missing_street_rows)):
    centroid = row.geometry.centroid
    lat, lon = centroid.y, centroid.x
    street_name = get_street_name(lat, lon)
    
    results.append({
        "bikelane_id": row.bikelane_id,
        "latitude": lat,
        "longitude": lon,
        "street_name_filled": street_name
    })
    
    # Save progress every SAVE_INTERVAL rows
    if (i + 1) % SAVE_INTERVAL == 0:
        partial_df = pd.DataFrame(results)
        partial_df.to_csv(partial_file, index=False)
        print(f"💾 Saved progress after {i + 1} rows...")

  0%|          | 0/13344 [00:00<?, ?it/s]

💾 Saved progress after 3100 rows...
💾 Saved progress after 5850 rows...
💾 Saved progress after 6250 rows...
💾 Saved progress after 9250 rows...
💾 Saved progress after 12100 rows...
💾 Saved progress after 12350 rows...
💾 Saved progress after 12500 rows...
💾 Saved progress after 13000 rows...
💾 Saved progress after 13150 rows...
💾 Saved progress after 14150 rows...
💾 Saved progress after 15600 rows...
💾 Saved progress after 15750 rows...
💾 Saved progress after 16850 rows...
💾 Saved progress after 16900 rows...
💾 Saved progress after 16950 rows...
💾 Saved progress after 17500 rows...
💾 Saved progress after 17600 rows...
💾 Saved progress after 17750 rows...
💾 Saved progress after 17950 rows...
💾 Saved progress after 18050 rows...
💾 Saved progress after 18350 rows...
💾 Saved progress after 21850 rows...
💾 Saved progress after 22800 rows...
💾 Saved progress after 22900 rows...
💾 Saved progress after 23050 rows...
💾 Saved progress after 23250 rows...
💾 Saved progress after 23400 rows...
💾 Sav

#### Saving the Final Results & Merging into Main Table

In [39]:
final_results_df = pd.DataFrame(results)
final_results_df.to_csv("street_name_progress_final.csv", index=False)
print(f"✅ Reverse geocoding complete. Total rows processed: {len(final_results_df)}")

# --- Step 7: Merge back into main table ---
bikelanes_enriched = bikelanes_enriched.merge(
    final_results_df[["bikelane_id", "street_name_filled"]],
    on="bikelane_id",
    how="left"
)

✅ Reverse geocoding complete. Total rows processed: 13344


#### Updating the `street_name`s Column

In [40]:
bikelanes_enriched["street_name"] = bikelanes_enriched["street_name"].fillna(bikelanes_enriched["street_name_filled"])
bikelanes_enriched.drop(columns=["street_name_filled"], inplace=True)


#### Checking the New `street_name` Column Values

In [41]:
print(bikelanes_enriched[["bikelane_id", "street_name"]].head(10))
print(f"Remaining missing street_name: {bikelanes_enriched['street_name'].isna().sum()}")


      bikelane_id        street_name
0    way/43998936   Fritz-Lang-Platz
1   way/517805554        Sonnenallee
2  way/1186003574      Am Nordgraben
3  way/1186003574  Schorfheidestraße
4  way/1186003574      Am Nordgraben
5  way/1186003574  Schorfheidestraße
6  way/1186011275      Am Nordgraben
7  way/1186011275  Schorfheidestraße
8  way/1186011275      Am Nordgraben
9  way/1186011275  Schorfheidestraße
Remaining missing street_name: 307


## 6. Quick EDA & Addressing Remaining Missing `street_name` Values

In [87]:
summary_data = []

# Total number of rows
total_rows = len(bikelanes_enriched)

# Loop through each column
for col in bikelanes_enriched.columns:
    num_nulls = bikelanes_enriched[col].isna().sum()
    num_unique = bikelanes_enriched[col].nunique()
    num_non_null_values = total_rows - num_nulls
    pct_missing = (num_nulls / total_rows) * 100
    pct_present = (num_non_null_values / total_rows) * 100
    num_duplicates = total_rows - num_unique - num_nulls  # duplicates = total - unique - nulls

    summary_data.append({
        "column": col,
        "total_rows": total_rows,
        "num_unique": num_unique,
        "num_duplicates": num_duplicates,
        "num_nulls": num_nulls,
        "num_non_null_values": num_non_null_values,
        "%_missing": round(pct_missing, 2),
        "%_present": round(pct_present, 2)
    })

# Convert to DataFrame
eda_summary = (
    pd.DataFrame(summary_data)
    .sort_values(by="%_missing", ascending=False)
    .reset_index(drop=True)
)

# Display neatly
print("📊 Column Completeness & Duplicates Summary:")
display(eda_summary)

📊 Column Completeness & Duplicates Summary:


Unnamed: 0,column,total_rows,num_unique,num_duplicates,num_nulls,num_non_null_values,%_missing,%_present
0,street_name,81343,8540,72496,307,81036,0.38,99.62
1,bikelane_id,81343,78833,2510,0,81343,0.0,100.0
2,district_id,81343,96,81247,0,81343,0.0,100.0
3,district_name,81343,12,81331,0,81343,0.0,100.0
4,neighborhood_id,81343,96,81247,0,81343,0.0,100.0
5,neighborhood_name,81343,96,81247,0,81343,0.0,100.0
6,geometry,81343,80517,826,0,81343,0.0,100.0
7,length_m,81343,80517,826,0,81343,0.0,100.0


 There are still 307 missing `street_name` values which only makes up 0.38% of all of the street_names. Maybe we list the remaining values as 'Unknown'?

## 7. Calculating Lengths of Bikelanes

1. Temporarily project to metric CRS (`EPSG:25833`) for accuracy.
2. Calculate the length in meters.
3. Return to `EPSG:4326`

In [50]:

bikelanes_enriched_metric = bikelanes_enriched.to_crs("EPSG:25833")


bikelanes_enriched_metric["length_m"] = bikelanes_enriched_metric.geometry.length


bikelanes_enriched = bikelanes_enriched_metric.to_crs("EPSG:4326")

In [51]:
bikelanes_enriched.head()

Unnamed: 0,bikelane_id,street_name,district_id,district_name,neighborhood_id,neighborhood_name,geometry,centroid,latitude,longitude,length_m
0,way/43998936,Fritz-Lang-Platz,re_ortsteil.1005,Marzahn-Hellersdorf,1005,Hellersdorf,"LINESTRING (13.60278 52.53787, 13.60269 52.537...",POINT (13.60277 52.53762),52.537618,13.60277,163.576194
1,way/517805554,Sonnenallee,re_ortsteil.0801,Neukölln,801,Neukölln,"LINESTRING (13.46367 52.47111, 13.46348 52.471...",POINT (13.46352 52.47109),52.471086,13.463517,54.172358
2,way/1186003574,Am Nordgraben,re_ortsteil.1201,Reinickendorf,1201,Reinickendorf,"LINESTRING (13.34613 52.58977, 13.34616 52.589...",POINT (13.34621 52.58973),52.589728,13.34621,27.333073
3,way/1186003574,Schorfheidestraße,re_ortsteil.1201,Reinickendorf,1201,Reinickendorf,"LINESTRING (13.34613 52.58977, 13.34616 52.589...",POINT (13.34621 52.58973),52.589728,13.34621,27.333073
4,way/1186003574,Am Nordgraben,re_ortsteil.1209,Reinickendorf,1209,Wittenau,"MULTILINESTRING ((13.34615 52.58978, 13.34614 ...",POINT (13.34614 52.58978),52.589777,13.346142,2.378107


##### Dropping unnescesary `Longitude`& `Latitude`Columns

In [70]:
bikelanes_enriched = bikelanes_enriched.drop(columns=["centroid"])
bikelanes_enriched.columns

Index(['bikelane_id', 'street_name', 'district_id', 'district_name',
       'neighborhood_id', 'neighborhood_name', 'geometry', 'length_m'],
      dtype='object')

In [71]:
bikelanes_enriched.head()

Unnamed: 0,bikelane_id,street_name,district_id,district_name,neighborhood_id,neighborhood_name,geometry,length_m
0,way/43998936,Fritz-Lang-Platz,re_ortsteil.1005,Marzahn-Hellersdorf,1005,Hellersdorf,"LINESTRING (13.60278 52.53787, 13.60269 52.537...",163.576194
1,way/517805554,Sonnenallee,re_ortsteil.0801,Neukölln,801,Neukölln,"LINESTRING (13.46367 52.47111, 13.46348 52.471...",54.172358
2,way/1186003574,Am Nordgraben,re_ortsteil.1201,Reinickendorf,1201,Reinickendorf,"LINESTRING (13.34613 52.58977, 13.34616 52.589...",27.333073
3,way/1186003574,Schorfheidestraße,re_ortsteil.1201,Reinickendorf,1201,Reinickendorf,"LINESTRING (13.34613 52.58977, 13.34616 52.589...",27.333073
4,way/1186003574,Am Nordgraben,re_ortsteil.1209,Reinickendorf,1209,Wittenau,"LINESTRING (13.34615 52.58978, 13.34615 52.589...",2.378107


## 8. Data Check

#### 8.1. Basic Overall Data Check

In [72]:
print("bikelanes_enriched dataset")
print("Shape:", bikelanes_enriched.shape)
print(bikelanes_enriched.info())

bikelanes_enriched dataset
Shape: (81343, 8)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 81343 entries, 0 to 81342
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   bikelane_id        81343 non-null  object  
 1   street_name        81036 non-null  object  
 2   district_id        81343 non-null  object  
 3   district_name      81343 non-null  object  
 4   neighborhood_id    81343 non-null  object  
 5   neighborhood_name  81343 non-null  object  
 6   geometry           81343 non-null  geometry
 7   length_m           81343 non-null  float64 
dtypes: float64(1), geometry(1), object(6)
memory usage: 5.0+ MB
None


In [73]:
bikelane_id_counts = bikelanes_enriched['bikelane_id'].value_counts()

# Inspect the most common duplicate counts
bikelane_id_counts.head(10)

bikelane_id
way/1160406678    9
way/894678838     9
way/963450770     9
way/936867362     9
way/1089640150    4
way/1323122054    4
way/1391917094    4
way/32608073      4
way/282611784     4
way/1119757603    4
Name: count, dtype: int64

In [74]:
duplicate_summary = bikelane_id_counts[bikelane_id_counts > 1]

print(f"Number of duplicated bikelane_ids: {len(duplicate_summary)}")
print(f"Total duplicate entries (all repeats combined): {duplicate_summary.sum() - len(duplicate_summary)}")

Number of duplicated bikelane_ids: 1677
Total duplicate entries (all repeats combined): 2510


In [75]:
bikelane_street_name = bikelanes_enriched['street_name'].value_counts()

# Inspect the most common duplicate counts
bikelane_street_name.head(10)

street_name
Landsberger Allee        535
Hauptstraße              435
Berliner Straße          362
Märkische Allee          303
Sonnenallee              289
Allee der Kosmonauten    277
Heerstraße               269
Blumberger Damm          241
Hohenzollerndamm         239
Seestraße                227
Name: count, dtype: int64

In [76]:
duplicate_street_names = bikelane_street_name[bikelane_street_name > 1]

print(f"Number of duplicated bikelane_ids: {len(duplicate_street_names)}")
print(f"Total duplicate entries (all repeats combined): {duplicate_street_names.sum() - len(duplicate_street_names)}")

Number of duplicated bikelane_ids: 6439
Total duplicate entries (all repeats combined): 72496


## 9. Duplicate `bikelane_id`s

#### 9.1. Checking for `bikelane_id`s that Intersect with Different neighborhoods/districts

In [77]:
# Aggregate all unique neighborhoods and districts per bikelane
bikelane_summary = bikelanes_enriched.groupby("bikelane_id").agg({
    "neighborhood_id": lambda x: list(x.unique()),
    "district_id": lambda x: list(x.unique())
}).reset_index()

# Add counts of unique neighborhoods and districts
bikelane_summary["num_unique_neighborhoods"] = bikelane_summary["neighborhood_id"].apply(len)
bikelane_summary["num_unique_districts"] = bikelane_summary["district_id"].apply(len)

# Optional: Inspect only bikelanes that span multiple neighborhoods or districts
bikelanes_multi_areas = bikelane_summary[
    (bikelane_summary["num_unique_neighborhoods"] > 1) |
    (bikelane_summary["num_unique_districts"] > 1)
]

# Preview
print(bikelanes_multi_areas.head(10))

        bikelane_id neighborhood_id                           district_id  \
6    way/1000383790    [0801, 0901]  [re_ortsteil.0801, re_ortsteil.0901]   
56   way/1002392336    [0801, 0902]  [re_ortsteil.0801, re_ortsteil.0902]   
62   way/1002431111    [0801, 0903]  [re_ortsteil.0801, re_ortsteil.0903]   
77   way/1002890683    [0501, 0509]  [re_ortsteil.0501, re_ortsteil.0509]   
92     way/10039499    [0601, 0701]  [re_ortsteil.0601, re_ortsteil.0701]   
108  way/1004875831    [0302, 1110]  [re_ortsteil.0302, re_ortsteil.1110]   
112  way/1004875837    [0302, 1110]  [re_ortsteil.0302, re_ortsteil.1110]   
139  way/1005195273    [1003, 1005]  [re_ortsteil.1003, re_ortsteil.1005]   
201  way/1007431584    [0503, 1202]  [re_ortsteil.0503, re_ortsteil.1202]   
214  way/1008241164    [0801, 0802]  [re_ortsteil.0801, re_ortsteil.0802]   

     num_unique_neighborhoods  num_unique_districts  
6                           2                     2  
56                          2               

#### 9.2. `bikelane_id`s Spanning Accross Multiple `street_name`s

##### Finding `bikelane_id`s that appear more than once & filtering the DataFrame for just those values

In [78]:
duplicate_ids = bikelanes_enriched["bikelane_id"].value_counts()
duplicate_ids = duplicate_ids[duplicate_ids > 1].index

duplicates_df = bikelanes_enriched[bikelanes_enriched["bikelane_id"].isin(duplicate_ids)]

##### Grouping by `bikelane_id` column & checking the `street_name`s

In [79]:
streetname_check = (
    duplicates_df.groupby("bikelane_id")["street_name"]
    .nunique()
    .reset_index()
    .rename(columns={"street_name": "num_unique_street_names"})
)

##### Filtering for `bikelane_id`s that appear with multiple street names

In [80]:
multi_street_ids = streetname_check[streetname_check["num_unique_street_names"] > 1]
print("Number of bikelanes with multiple street names:", len(multi_street_ids))

Number of bikelanes with multiple street names: 142


##### Quick preview of `bikelane_id`s that span over multiple `street_names`

In [81]:
example_ids = multi_street_ids["bikelane_id"].head(10).tolist()
bikelanes_enriched[bikelanes_enriched["bikelane_id"].isin(example_ids)][
    ["bikelane_id", "street_name", "neighborhood_name", "district_name"]
].sort_values("bikelane_id")

Unnamed: 0,bikelane_id,street_name,neighborhood_name,district_name
38661,way/1002890683,Ziegelhof,Spandau,Spandau
38662,way/1002890683,Kerstenweg,Spandau,Spandau
38663,way/1002890683,Ziegelhof,Wilhelmstadt,Spandau
38664,way/1002890683,Kerstenweg,Wilhelmstadt,Spandau
38694,way/1004875837,Hansastraße,Weißensee,Pankow
38695,way/1004875837,Orankeweg,Weißensee,Pankow
38696,way/1004875837,Hansastraße,Alt-Hohenschönhausen,Lichtenberg
38697,way/1004875837,Orankeweg,Alt-Hohenschönhausen,Lichtenberg
38785,way/1007431584,Tegeler Brücke,Tegel,Reinickendorf
38784,way/1007431584,Straße L,Tegel,Reinickendorf


### 9.3. Drill down of results

As seen above, where `street_name`s are repeated, they are often repeated across different neighborhoods and/or districts as well. As such, I have conducted the further query below to check for true duplicates where each duplicated `bikelane_id` is categorised into a single type of duplicate based on a hierarchical classification to avoid possible duplicate overlapping. 
Heirarchy begins at `district_id` to `neighborhood_id` to `street_name` 

In [82]:
# Group and summarize duplicates
dup_check = (
    bikelanes_enriched.groupby('bikelane_id')
    .agg({
        'street_name': pd.Series.nunique,
        'neighborhood_id': pd.Series.nunique,
        'district_id': pd.Series.nunique
    })
    .reset_index()
)

# Only keep true duplicates
dup_check = dup_check[(dup_check['street_name'] > 1) | 
                      (dup_check['neighborhood_id'] > 1) | 
                      (dup_check['district_id'] > 1)]

# Apply mutually exclusive classification
def classify_dup(row):
    # Prioritize based on hierarchy to avoid double counting
    if row['district_id'] > 1:
        return 'Cross-district'
    elif row['neighborhood_id'] > 1:
        return 'Cross-neighborhood'
    elif row['street_name'] > 1:
        return 'Multi-street'
    else:
        return 'Other'

dup_check['dup_type'] = dup_check.apply(classify_dup, axis=1)

# Summarize counts
dup_type_counts = dup_check['dup_type'].value_counts()
print(dup_type_counts)

dup_type
Cross-district    1677
Name: count, dtype: int64


In [83]:
# Identify duplicates that are not cross-district
unexplained_dups = (
    bikelanes_enriched[bikelanes_enriched["bikelane_id"].isin(dup_check["bikelane_id"])]
    .groupby("bikelane_id")
    .filter(lambda x: x["district_name"].nunique() == 1 and x["neighborhood_name"].nunique() == 1)
)

print(f"Unexplained duplicates (same district & neighborhood): {unexplained_dups['bikelane_id'].nunique()}")

Unexplained duplicates (same district & neighborhood): 0


## 10. Final Check of `geometry` Column 

#### 10.1. Validating `geometry` type

In [84]:
bikelanes_enriched.geom_type.value_counts()

LineString         81212
MultiLineString      131
Name: count, dtype: int64

In [85]:
from shapely.geometry import MultiLineString, LineString
from shapely.ops import linemerge

def ensure_linestring(geom):
    if geom.geom_type == "MultiLineString":
        merged = linemerge(geom)
        # If merging results in a single continuous line, return it as LineString
        if isinstance(merged, LineString):
            return merged
        else:
            # If still MultiLineString, keep as-is
            return geom
    else:
        return geom

bikelanes_enriched["geometry"] = bikelanes_enriched["geometry"].apply(ensure_linestring)

In [86]:
bikelanes_enriched.geom_type.value_counts()

LineString         81212
MultiLineString      131
Name: count, dtype: int64

#### 10.2. Counting duplicate `geometry` values

In [88]:
geom_counts = bikelanes_enriched['geometry'].value_counts()
duplicate_geometries = geom_counts[geom_counts > 1]
print(f"Number of geometries that appear more than once: {len(duplicate_geometries)}")

Number of geometries that appear more than once: 814


In [89]:
top_geom = duplicate_geometries.head(5).index
bikelanes_enriched[bikelanes_enriched['geometry'].isin(top_geom)][
    ['bikelane_id','street_name','neighborhood_name','district_name','length_m']
]

Unnamed: 0,bikelane_id,street_name,neighborhood_name,district_name,length_m
35250,way/894678838,Landsberger Allee,Marzahn,Marzahn-Hellersdorf,55.487146
35251,way/894678838,Landsberger Allee,Marzahn,Marzahn-Hellersdorf,55.487146
35252,way/894678838,Landsberger Allee,Marzahn,Marzahn-Hellersdorf,55.487146
35253,way/894678838,Landsberger Allee,Lichtenberg,Lichtenberg,2.319407
35254,way/894678838,Landsberger Allee,Lichtenberg,Lichtenberg,2.319407
35255,way/894678838,Landsberger Allee,Lichtenberg,Lichtenberg,2.319407
35256,way/894678838,Landsberger Allee,Alt-Hohenschönhausen,Lichtenberg,5.857071
35257,way/894678838,Landsberger Allee,Alt-Hohenschönhausen,Lichtenberg,5.857071
35258,way/894678838,Landsberger Allee,Alt-Hohenschönhausen,Lichtenberg,5.857071
36295,way/936867362,Landsberger Allee,Lichtenberg,Lichtenberg,26.262257


In [90]:
street_dups = bikelanes_enriched.groupby(['district_name', 'street_name']).size().reset_index(name='count')
street_dups[street_dups['count'] > 1].head(10)

Unnamed: 0,district_name,street_name,count
0,Charlottenburg-Wilmersdorf,A 100,60
1,Charlottenburg-Wilmersdorf,A 111,6
2,Charlottenburg-Wilmersdorf,AVUS,2
3,Charlottenburg-Wilmersdorf,Aachener Straße,14
4,Charlottenburg-Wilmersdorf,Abbestraße,5
5,Charlottenburg-Wilmersdorf,Adam-von-Trott-Straße,6
6,Charlottenburg-Wilmersdorf,Ahornallee,13
7,Charlottenburg-Wilmersdorf,Ahrweilerstraße,6
8,Charlottenburg-Wilmersdorf,Akazienallee,5
9,Charlottenburg-Wilmersdorf,Albrecht-Achilles-Straße,5
