### 🧪 Step 1: Research & Data Modelling  
PR Branch Name: clubs-data-modelling  

This notebook documents the process for Step 1 of the "Clubs & Social Activities in Berlin" project:  

- **1.1 Data Source Discovery**  
- **1.2 Modelling & Planning**  
- **1.3 Prepare the /sources Directory**  
- **1.4 Review**  

---

### 🎯 Goal  
- Identify and document relevant data sources.  
- Select the key parameters for our use case.  
- Draft the planned table schema.  
- Plan cleaning and transformation steps before database population.  

---

## 1.1 Data Source Discovery  

**Topic:** Clubs & Social Activities in Berlin  

**Main source:**  
- **Name:** OpenStreetMap (OSM) via OSMnx / Overpass API  
- **Source and origin:** Public crowdsourced geospatial database  
- **Update frequency:** Continuous (dynamic)  
- **Data type:** Dynamic (API query using tags such as `club=*`, `leisure=*`, `sport=*`, `community_centre=*`)  
- **Reason for selection:**  
  - Covers a wide variety of sports clubs, cultural clubs, and social activity centers in Berlin  
  - Includes geospatial data (coordinates, polygons), names, addresses, and attributes  
  - Open, free, and queryable programmatically  

**Optional additional sources:**  
- **Name:** Berliner Turn- und Freizeitsport-Bund (BTFB)  (https://btfb.de/vereinsservice/vereinssuche/#Vereine-im-Portrait)
  - Source: Official Berlin sports association website  
  - Type: Static (manual export / scraping)  
  - Use: Provides official structured list of sports clubs in Berlin  

- **Name:** Berlin Open Data Portal (daten.berlin.de)  
  - Source: Berlin city government  
  - Type: Static or semi-static (CSV, GeoJSON)  
  - Use: Enrichment with official district boundaries or metadata  

  **Enrichment potential:**  
- Use Berlin shapefiles (districts, neighborhoods) for spatial joins.  


---

## 1.2 Modelling & Planning  

**Key Parameters (planned):**  
- Identification: `name`, `club`, `category`, `subcategory`  
- Location: `address`, `district`, `geometry (lat/lon)`  
- Contact: `website`, `phone`, `email`  
- Attributes: `opening_hours`, `membership`, `fees`, `sport` / `leisure type`  
- Metadata: `source`, `last_updated`  

**Integration with existing tables:**  
- Join on `district_id` from the Berlin districts reference table.  


**Planned table schema:**  
```sql
CREATE TABLE berlin_clubs (
    club_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    club TEXT,
    leisure TEXT,
    sport TEXT,
    amenity TEXT,
    street TEXT,
    housenumber TEXT,
    postcode TEXT,
    district TEXT,
    city TEXT,
    country TEXT,
    district_id INT REFERENCES berlin_districts(district_id),
    latitude FLOAT NOT NULL,
    longitude FLOAT NOT NULL,
    website TEXT,
    phone TEXT,
    email TEXT,
    opening_hours TEXT,
    wheelchair TEXT
);

In [86]:
# Install Libraries

# !pip install osmnx geopandas

In [87]:
# Import Libraries

import osmnx as ox
import geopandas as gpd
import pandas as pd

In [88]:
ox.settings.use_cache = False

In [89]:
# Define multiple tags
# tags = {
    # "amenity": ["community_centre", "arts_centre", "youth_centre", "music_school"],
    # "leisure": ["sports_centre", "fitness_centre", "dance"],
    # "club": True  # will capture all club types
# }
tags = {
    "amenity": [
        "community_centre", "arts_centre", "social_centre", 
        "youth_centre", "social_club", "music_school","events_venue",
        "music_venue", 
        "dojo", "dancing_school","studio",
        "theatre"
    ],
    "leisure": [
       "sports_centre", "fitness_centre", "dance", 
        "hackerspace", "music_venue", "garden"
    ],
   "club": True 
}
clubs_gdf = ox.features_from_place("Berlin, Germany", tags)

print(clubs_gdf.head())
print(len(clubs_gdf), "clubs/activities found in Berlin")

                                   geometry       amenity     contact:phone  \
element id                                                                    
node    30012753  POINT (13.42919 52.49404)  events_venue  +49 30 338402320   
        60775321  POINT (13.48162 52.53862)           NaN               NaN   
        66917094  POINT (13.38888 52.52392)       theatre               NaN   
        66917098  POINT (13.38862 52.52362)       theatre   +49 30 27879030   
        66917115  POINT (13.38851 52.52067)       theatre   +49 30 203000-0   

                                       contact:website  \
element id                                               
node    30012753  http://www.umspannwerk-kreuzberg.de/   
        60775321                                   NaN   
        66917094                                   NaN   
        66917098                                   NaN   
        66917115                                   NaN   

                                     na

In [90]:
clubs_gdf = clubs_gdf.to_crs(epsg=4326)

In [91]:
clubs_gdf['geometry'] = clubs_gdf['geometry'].apply(lambda geom: geom if geom.geom_type == 'Point' else geom.representative_point())
#Extract latitude and longitude
clubs_gdf["latitude"] = clubs_gdf.geometry.y
clubs_gdf["longitude"] = clubs_gdf.geometry.x
clubs_gdf

Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,amenity,contact:phone,contact:website,name,wheelchair,addr:housenumber,addr:street,club,addr:city,...,construction,manufacturer,monitoring:harvesting,type,not:name,length,maxdepth,communication:amateur_radio:pota,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,30012753,POINT (13.42919 52.49404),events_venue,+49 30 338402320,http://www.umspannwerk-kreuzberg.de/,Umspannwerk,yes,,,,,...,,,,,,,,,52.494042,13.429187
node,60775321,POINT (13.48162 52.53862),,,,KW76,,76,Konrad-Wolf-Straße,poker,,...,,,,,,,,,52.538623,13.481623
node,66917094,POINT (13.38888 52.52392),theatre,,,Friedrichstadt-Palast,yes,107,Friedrichstraße,,Berlin,...,,,,,,,,,52.523922,13.388879
node,66917098,POINT (13.38862 52.52362),theatre,+49 30 27879030,,Quatsch Comedy Club,limited,107,Friedrichstraße,,Berlin,...,,,,,,,,,52.523624,13.388621
node,66917115,POINT (13.38851 52.52067),theatre,+49 30 203000-0,,Kabarett-Theater Distel,yes,101,Friedrichstraße,,Berlin,...,,,,,,,,,52.520667,13.388505
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
way,1428999282,POINT (13.3565 52.56115),,,,,,,,,,...,,,,,,,,,52.561152,13.356499
way,1428999283,POINT (13.35644 52.56128),,,,,,,,,,...,,,,,,,,,52.561283,13.356439
way,1428999284,POINT (13.35655 52.56104),,,,,,,,,,...,,,,,,,,,52.561039,13.356547
way,1429788653,POINT (13.42016 52.47043),,,,Nanowald,,,,,,...,,,,,,,,,52.470430,13.420161


In [92]:
print(clubs_gdf.notnull().sum().sort_values(ascending=False).head(30))


geometry                6494
latitude                6494
longitude               6494
leisure                 4632
name                    3399
garden:type             2168
access                  2151
addr:street             1925
addr:housenumber        1909
addr:postcode           1843
addr:city               1800
amenity                 1614
addr:country            1297
addr:suburb             1283
website                 1237
wheelchair              1072
sport                    951
operator                 903
contact:website          771
building                 761
opening_hours            685
check_date               528
phone                    506
community_centre         479
contact:phone            436
club                     384
building:levels          334
contact:email            309
community_centre:for     308
wikidata                 299
dtype: int64


In [93]:
# Select important columns
important_cols = [
    "name",               
    "club",                 
    "leisure",             
    "sport",                
    "amenity",               
    "addr:street",           
    "addr:housenumber",
    "addr:suburb",      
    "addr:postcode",         
    "addr:city",
    "addr:country",            
    "website",              
    "phone",             
    "email",               
    "opening_hours",         
    "geometry" ,
    "wheelchair",
    "latitude",
    "longitude"              
]

In [94]:
clubs_df = clubs_gdf[important_cols].copy()

print(clubs_df.head(10))

                                      name   club leisure sport       amenity  \
element id                                                                      
node    30012753               Umspannwerk    NaN     NaN   NaN  events_venue   
        60775321                      KW76  poker     NaN   NaN           NaN   
        66917094     Friedrichstadt-Palast    NaN     NaN   NaN       theatre   
        66917098       Quatsch Comedy Club    NaN     NaN   NaN       theatre   
        66917115   Kabarett-Theater Distel    NaN     NaN   NaN       theatre   
        66917188            Admiralspalast    NaN     NaN   NaN       theatre   
        79808389             Die Wühlmäuse    NaN     NaN   NaN       theatre   
        173985100   HAU 2 (Hebbel am Ufer)    NaN     NaN   NaN       theatre   
        229948256              Sophiensæle    NaN     NaN   NaN       theatre   
        257709121       Kulturhaus Spandau    NaN     NaN   NaN   arts_centre   

                          a

In [95]:
# Rename map for only the columns that need renaming
rename_map = {
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "addr:country": "country",
    "addr:suburb": "district"
}

In [96]:
# Rename the columns
clubs_df = clubs_df.rename(columns=rename_map)

In [97]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,district,postcode,city,country,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,30012753,Umspannwerk,,,,events_venue,,,,,,,,,,,POINT (13.42919 52.49404),yes,52.494042,13.429187
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,,,,,POINT (13.48162 52.53862),,52.538623,13.481623
node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107.0,Mitte,10117.0,Berlin,DE,https://www.palast.berlin/,+49 30 23262326,,,POINT (13.38888 52.52392),yes,52.523922,13.388879
node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107.0,Mitte,10117.0,Berlin,DE,https://www.quatsch-comedy-club.de/,,,,POINT (13.38862 52.52362),limited,52.523624,13.388621
node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101.0,Mitte,10117.0,Berlin,DE,http://www.distel-berlin.de,,,,POINT (13.38851 52.52067),yes,52.520667,13.388505


In [98]:
for col in [ "name", "club", "leisure", "sport", "amenity"]:
    print(f"\n--- {col.upper()} ---")
    print(clubs_df[col].dropna().unique())


--- NAME ---
['Umspannwerk' 'KW76' 'Friedrichstadt-Palast' ... 'Gemeinschaftsbeet'
 'Hochbeet Annie Heuser Waldorfschule'
 'Begegnungszentrum im Kölner Viertel']

--- CLUB ---
['poker' 'scout' 'sport' 'social' 'yes' 'dance' 'amateur_radio'
 'automobile' 'fishing' 'Körperschaft_des_Öffentlichen_Rechts' 'culture'
 'fan' 'animals' 'elderly' 'bonsai' 'dog' 'freemasonry' 'student'
 'business' 'game' 'music' 'ethnic' 'Agrarbörse Deutschland Ost' 'linux'
 'history' 'education' 'computer' 'religion' 'art' 'politics'
 'board_games' 'youth_movement' 'archive' 'chess' 'sailing' 'science'
 'humanist' 'charity' 'nature' 'hdk_0' 'youth' 'academic' 'motorcycle'
 'allotment_club' 'allotments' 'TC Berolina Biesdorf' 'gardening']

--- LEISURE ---
['hackerspace' 'fitness_centre' 'sports_centre' 'garden' 'dance'
 'music_venue' 'pitch' 'playground' 'stadium' 'ice_rink' 'marina' 'track']

--- SPORT ---
['bowling' '10pin' 'rowing' 'fitness' 'soccer' 'yoga' 'pilates'
 'gymnastics' 'hapkido' 'karate' 'swimmin

In [99]:
print(clubs_df["amenity"].unique())

['events_venue' nan 'theatre' 'arts_centre' 'community_centre'
 'social_centre' 'studio' 'dojo' 'music_school' 'pub' 'nightclub'
 'dancing_school' 'cafe' 'restaurant' 'music_venue' 'bicycle_parking'
 'photo_booth' 'school' 'social_club']


In [100]:
# Define lists of allowed values for 'amenity' and 'leisure' categories
# Keep only rows that match these categories or have a non-empty 'club' field
# This filters out irrelevant OSM features like restaurants, pubs, etc.
allowed_amenities = [
    'arts_centre', 'community_centre', 'events_venue', 'music_venue',
    'social_centre', 'studio', 'theatre', 'dojo', 'music_school',
    'social_club', 'dancing_school'
]

allowed_leisure = [
    'hackerspace' 'fitness_centre' 'sports_centre' 'garden' 'dance'
 'music_venue' 'pitch' 'playground' 'stadium' 'ice_rink' 'marina' 'track'
]



clubs_df = clubs_df[
    (clubs_df['amenity'].isin(allowed_amenities)) |
    (clubs_df['leisure'].isin(allowed_leisure)) |
    (clubs_df['club'].notna())
]


print(clubs_df[['name', 'amenity', 'leisure', 'club']].head())
print(len(clubs_df), "clubs/activities after filtering")

                                     name       amenity leisure   club
element id                                                            
node    30012753              Umspannwerk  events_venue     NaN    NaN
        60775321                     KW76           NaN     NaN  poker
        66917094    Friedrichstadt-Palast       theatre     NaN    NaN
        66917098      Quatsch Comedy Club       theatre     NaN    NaN
        66917115  Kabarett-Theater Distel       theatre     NaN    NaN
1926 clubs/activities after filtering


In [101]:

clubs_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1926 entries, ('node', np.int64(30012753)) to ('way', np.int64(1423837870))
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   name           1876 non-null   object  
 1   club           384 non-null    object  
 2   leisure        64 non-null     object  
 3   sport          226 non-null    object  
 4   amenity        1596 non-null   object  
 5   street         1266 non-null   object  
 6   housenumber    1256 non-null   object  
 7   district       842 non-null    object  
 8   postcode       1211 non-null   object  
 9   city           1194 non-null   object  
 10  country        853 non-null    object  
 11  website        818 non-null    object  
 12  phone          327 non-null    object  
 13  email          197 non-null    object  
 14  opening_hours  382 non-null    object  
 15  geometry       1926 non-null   geometry
 16  wheelchair     584

In [102]:
clubs_df = clubs_df.drop_duplicates()
clubs_df = clubs_df.drop_duplicates(subset=['name', 'street', 'housenumber'])
clubs_df = clubs_df.dropna(subset=['name', 'geometry'])

In [103]:

clubs_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1847 entries, ('node', np.int64(30012753)) to ('way', np.int64(1423837870))
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   name           1847 non-null   object  
 1   club           350 non-null    object  
 2   leisure        63 non-null     object  
 3   sport          215 non-null    object  
 4   amenity        1545 non-null   object  
 5   street         1247 non-null   object  
 6   housenumber    1237 non-null   object  
 7   district       834 non-null    object  
 8   postcode       1195 non-null   object  
 9   city           1179 non-null   object  
 10  country        845 non-null    object  
 11  website        804 non-null    object  
 12  phone          324 non-null    object  
 13  email          196 non-null    object  
 14  opening_hours  377 non-null    object  
 15  geometry       1847 non-null   geometry
 16  wheelchair     580

In [104]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,district,postcode,city,country,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,30012753,Umspannwerk,,,,events_venue,,,,,,,,,,,POINT (13.42919 52.49404),yes,52.494042,13.429187
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,,,,,POINT (13.48162 52.53862),,52.538623,13.481623
node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107.0,Mitte,10117.0,Berlin,DE,https://www.palast.berlin/,+49 30 23262326,,,POINT (13.38888 52.52392),yes,52.523922,13.388879
node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107.0,Mitte,10117.0,Berlin,DE,https://www.quatsch-comedy-club.de/,,,,POINT (13.38862 52.52362),limited,52.523624,13.388621
node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101.0,Mitte,10117.0,Berlin,DE,http://www.distel-berlin.de,,,,POINT (13.38851 52.52067),yes,52.520667,13.388505


# Geometry sanity checks

In [105]:
print("Missing geometries:", clubs_df.geometry.isna().sum())

Missing geometries: 0


In [106]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", clubs_df["latitude"].min(), "to", clubs_df["latitude"].max())

print("Longitude range:", clubs_df["longitude"].min(), "to", clubs_df["longitude"].max())

Latitude range: 52.37387955 to 52.6448252
Longitude range: 13.12237797012892 to 13.7311336


## 1.3 Prepare the /sources Directory
## Raw Data Files:

- **clubs_raw.geojson** (includes geometry)
- **clubs_raw.csv** (tabular only, no geometry)
- **README.md** in /sources will contain:

**Data sources used.**
**Planned transformation steps.**

In [107]:
# Save locally
clubs_gdf.to_file("clubs_raw.geojson", driver="GeoJSON")
clubs_gdf.drop(columns="geometry").to_csv("clubs_raw.csv", index=False)



# Step 2: Data Transformation

In [108]:
# Standardize column names

clubs_df.columns = clubs_df.columns.str.lower().str.strip().str.replace(" ", "_").str.replace("-", "_")

# Normalize yes/no columns into Boolean (True/False)

clubs_df["wheelchair"] = clubs_df["wheelchair"].map({"yes": True, "no": False})

In [109]:
print(clubs_df.dtypes)

name               object
club               object
leisure            object
sport              object
amenity            object
street             object
housenumber        object
district           object
postcode           object
city               object
country            object
website            object
phone              object
email              object
opening_hours      object
geometry         geometry
wheelchair         object
latitude          float64
longitude         float64
dtype: object


## Drop irrelevant / redundant columns

In [110]:
clubs_df.drop(columns=["city","district" ,"country", "geometry"], inplace=True)

## Normalize categories

In [111]:
clubs_df["wheelchair"] = clubs_df["wheelchair"].fillna("unknown").astype(str).str.strip().str.lower()

In [112]:

clubs_df["opening_hours"] = clubs_df["opening_hours"].fillna("unknown").astype(str).str.strip().str.lower()



In [113]:
clubs_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1847 entries, ('node', np.int64(30012753)) to ('way', np.int64(1423837870))
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           1847 non-null   object 
 1   club           350 non-null    object 
 2   leisure        63 non-null     object 
 3   sport          215 non-null    object 
 4   amenity        1545 non-null   object 
 5   street         1247 non-null   object 
 6   housenumber    1237 non-null   object 
 7   postcode       1195 non-null   object 
 8   website        804 non-null    object 
 9   phone          324 non-null    object 
 10  email          196 non-null    object 
 11  opening_hours  1847 non-null   object 
 12  wheelchair     1847 non-null   object 
 13  latitude       1847 non-null   float64
 14  longitude      1847 non-null   float64
dtypes: float64(2), object(13)
memory usage: 534.1+ KB


In [114]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,postcode,website,phone,email,opening_hours,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
node,30012753,Umspannwerk,,,,events_venue,,,,,,,unknown,true,52.494042,13.429187
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,unknown,unknown,52.538623,13.481623
node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107.0,10117.0,https://www.palast.berlin/,+49 30 23262326,,unknown,true,52.523922,13.388879
node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107.0,10117.0,https://www.quatsch-comedy-club.de/,,,unknown,unknown,52.523624,13.388621
node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101.0,10117.0,http://www.distel-berlin.de,,,unknown,true,52.520667,13.388505


### Add district and district_id to the data frame

In [115]:
# conda install -c conda-forge geopy

In [116]:
import geopandas as gpd

gdf_clubs = gpd.GeoDataFrame(
    clubs_df,  # assumes you already built df_unique with station_id
    geometry=gpd.points_from_xy(clubs_df.longitude, clubs_df.latitude),
    crs="EPSG:4326"
)


neighborhoods = gpd.read_file(
    "/sources/lor_ortsteile.geojson"
).to_crs("EPSG:4326")


#harmonizing column names coming from the GeoJSON
neighborhoods = neighborhoods.rename(columns={
    "BEZIRK": "district",
    "OTEIL": "neighborhood",
    "spatial_name": "neighborhood_id"
})

gdf_with_districts = gpd.sjoin(
    gdf_clubs,
    neighborhoods[["district", "neighborhood_id", "neighborhood", "geometry"]],
    how="left",
    predicate="within"
)

df_final = gdf_with_districts.drop(columns=["geometry", "index_right"])

In [117]:
df_final = df_final.reset_index()

# Rename the "id" column to "club_id"

df_final = df_final.rename(columns={"id": "club_id"})

# Change bank_id column type to string

df_final["club_id"] = df_final["club_id"].astype(str)

In [118]:
df_final.head()

Unnamed: 0,element,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,website,phone,email,opening_hours,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood
0,node,30012753,Umspannwerk,,,,events_venue,,,,,,,unknown,true,52.494042,13.429187,Friedrichshain-Kreuzberg,202,Kreuzberg
1,node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,unknown,unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen
2,node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107.0,10117.0,https://www.palast.berlin/,+49 30 23262326,,unknown,true,52.523922,13.388879,Mitte,101,Mitte
3,node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107.0,10117.0,https://www.quatsch-comedy-club.de/,,,unknown,unknown,52.523624,13.388621,Mitte,101,Mitte
4,node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101.0,10117.0,http://www.distel-berlin.de,,,unknown,true,52.520667,13.388505,Mitte,101,Mitte


In [119]:
# Reverse Geolocation
import requests
import time

def get_address(lat, lon):
    """Retrieve full formatted address from Nominatim"""
    url = "https://nominatim.openstreetmap.org/reverse"
    params = {"lat": lat, "lon": lon, "format": "json", "addressdetails": 1}
    headers = {"User-Agent": "berlin-venues-scraper/1.0"}
    try:
        r = requests.get(url, params=params, headers=headers, timeout=10)
        r.raise_for_status()
        data = r.json()
        return data.get("display_name")
    except requests.exceptions.RequestException as e:
        logging.warning(f"Error fetching address for ({lat}, {lon}): {e}")
        return None

# Apply reverse geolocation with throttling (to respect Nominatim usage policy)
full_addresses = []
for i, row in df_final.iterrows():
    print(f"fetching missing data for {i}")
    lat, lon = row["latitude"], row["longitude"]
    if pd.notna(lat) and pd.notna(lon):
        
        full_addresses.append(get_address(lat, lon))
        time.sleep(1)  # polite delay between requests
    else:
        
        full_addresses.append(None)


df_final["full_address"] = full_addresses

fetching missing data for 0
fetching missing data for 1
fetching missing data for 2
fetching missing data for 3
fetching missing data for 4
fetching missing data for 5
fetching missing data for 6
fetching missing data for 7
fetching missing data for 8
fetching missing data for 9
fetching missing data for 10
fetching missing data for 11
fetching missing data for 12
fetching missing data for 13
fetching missing data for 14
fetching missing data for 15
fetching missing data for 16
fetching missing data for 17
fetching missing data for 18
fetching missing data for 19
fetching missing data for 20
fetching missing data for 21
fetching missing data for 22
fetching missing data for 23
fetching missing data for 24
fetching missing data for 25
fetching missing data for 26
fetching missing data for 27
fetching missing data for 28
fetching missing data for 29
fetching missing data for 30
fetching missing data for 31
fetching missing data for 32
fetching missing data for 33
fetching missing data fo

In [124]:
df_final

Unnamed: 0,element,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,...,email,opening_hours,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood,full_address,district_id
0,node,30012753,Umspannwerk,,,,events_venue,,,,...,,unknown,true,52.494042,13.429187,Friedrichshain-Kreuzberg,0202,Kreuzberg,"Umspannwerk, Ohlauer Straße, Luisenstadt, Kreu...",11002002
1,node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76,,...,,unknown,unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen,"KW76, 76, Konrad-Wolf-Straße, Wilhelmsberg, Al...",11011011
2,node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107,10117,...,,unknown,true,52.523922,13.388879,Mitte,0101,Mitte,"Friedrichstadt-Palast, 107, Friedrichstraße, D...",11001001
3,node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107,10117,...,,unknown,unknown,52.523624,13.388621,Mitte,0101,Mitte,"Quatsch Comedy Club, 107, Friedrichstraße, Dor...",11001001
4,node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101,10117,...,,unknown,true,52.520667,13.388505,Mitte,0101,Mitte,"Kabarett-Theater Distel, 101, Friedrichstraße,...",11001001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1842,way,1352576785,DAV OG Berlin Oberschöneweide e.V.,fishing,,,,Nalepastraße,56,,...,,unknown,unknown,52.477217,13.497736,Treptow-Köpenick,0909,Oberschöneweide,"DAV OG Berlin Oberschöneweide e.V., 56, Nalepa...",11009009
1843,way,1353882816,Haus Wolfgang Raeder,gardening,,,community_centre,Leonberger Ring,54,,...,,unknown,unknown,52.427209,13.428828,Neukölln,0802,Britz,"Haus Wolfgang Raeder, 54, Leonberger Ring, Alt...",11008008
1844,way,1387340919,Kulturzentrum Alte Schule,,,,community_centre,,,,...,,unknown,unknown,52.438961,13.549769,Treptow-Köpenick,0907,Adlershof,"Kulturzentrum Alte Schule, Dörpfeldstraße, Sie...",11009009
1845,way,1413964279,Vereinsheim Treue Seele,,,,community_centre,,,,...,,unknown,unknown,52.473617,13.462736,Neukölln,0801,Neukölln,"Narzissenweg, Kleingartenanlage Treue Seele, N...",11008008


In [127]:
# District mapping (official codes as strings)
district_mapping = {
    'Mitte': '11001001',
    'Friedrichshain-Kreuzberg': '11002002',
    'Pankow': '11003003',
    'Charlottenburg-Wilmersdorf': '11004004',
    'Spandau': '11005005',
    'Steglitz-Zehlendorf': '11006006',
    'Tempelhof-Schöneberg': '11007007',
    'Neukölln': '11008008',
    'Treptow-Köpenick': '11009009',
    'Marzahn-Hellersdorf': '11010010',
    'Lichtenberg': '11011011',
    'Reinickendorf': '11012012'
}

# Apply mapping to create district_id column
df_final['district_id'] = (
    df_final['district']
    .map(district_mapping)
    .astype(str)
)

In [128]:
df_final

Unnamed: 0,element,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,...,email,opening_hours,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood,full_address,district_id
0,node,30012753,Umspannwerk,,,,events_venue,,,,...,,unknown,true,52.494042,13.429187,Friedrichshain-Kreuzberg,0202,Kreuzberg,"Umspannwerk, Ohlauer Straße, Luisenstadt, Kreu...",11002002
1,node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76,,...,,unknown,unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen,"KW76, 76, Konrad-Wolf-Straße, Wilhelmsberg, Al...",11011011
2,node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107,10117,...,,unknown,true,52.523922,13.388879,Mitte,0101,Mitte,"Friedrichstadt-Palast, 107, Friedrichstraße, D...",11001001
3,node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107,10117,...,,unknown,unknown,52.523624,13.388621,Mitte,0101,Mitte,"Quatsch Comedy Club, 107, Friedrichstraße, Dor...",11001001
4,node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101,10117,...,,unknown,true,52.520667,13.388505,Mitte,0101,Mitte,"Kabarett-Theater Distel, 101, Friedrichstraße,...",11001001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1842,way,1352576785,DAV OG Berlin Oberschöneweide e.V.,fishing,,,,Nalepastraße,56,,...,,unknown,unknown,52.477217,13.497736,Treptow-Köpenick,0909,Oberschöneweide,"DAV OG Berlin Oberschöneweide e.V., 56, Nalepa...",11009009
1843,way,1353882816,Haus Wolfgang Raeder,gardening,,,community_centre,Leonberger Ring,54,,...,,unknown,unknown,52.427209,13.428828,Neukölln,0802,Britz,"Haus Wolfgang Raeder, 54, Leonberger Ring, Alt...",11008008
1844,way,1387340919,Kulturzentrum Alte Schule,,,,community_centre,,,,...,,unknown,unknown,52.438961,13.549769,Treptow-Köpenick,0907,Adlershof,"Kulturzentrum Alte Schule, Dörpfeldstraße, Sie...",11009009
1845,way,1413964279,Vereinsheim Treue Seele,,,,community_centre,,,,...,,unknown,unknown,52.473617,13.462736,Neukölln,0801,Neukölln,"Narzissenweg, Kleingartenanlage Treue Seele, N...",11008008


In [132]:
df_final = df_final.drop(columns=["element"])

In [133]:
# (Optional) Save enriched dataset for later use
df_final.to_csv("clubs_with_districts.csv", index=False)

### Final Summary of Cleaned and Transformed Data

In [134]:
print("✅ Dataset after Steps cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {df_final.shape[0]}")
print(f"Number of columns: {df_final.shape[1]}")

# Column list
print("\nRemaining columns:")
print(df_final.columns.tolist())

# Missing values check
missing = df_final.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

✅ Dataset after Steps cleaning and transforming

Number of rows: 1847
Number of columns: 21

Remaining columns:
['club_id', 'name', 'club', 'leisure', 'sport', 'amenity', 'street', 'housenumber', 'postcode', 'website', 'phone', 'email', 'opening_hours', 'wheelchair', 'latitude', 'longitude', 'district', 'neighborhood_id', 'neighborhood', 'full_address', 'district_id']

Missing values after cleaning and transforming :
club_id               0
name                  0
club               1497
leisure            1784
sport              1632
amenity             302
street              600
housenumber         610
postcode            652
website            1043
phone              1523
email              1651
opening_hours         0
wheelchair            0
latitude              0
longitude             0
district              0
neighborhood_id       0
neighborhood          0
full_address          0
district_id           0
dtype: int64


### Step 3: Populate Database