### 🧪 Step 1: Research & Data Modelling  
PR Branch Name: clubs-data-modelling  

This notebook documents the process for Step 1 of the "Clubs & Social Activities in Berlin" project:  

- **1.1 Data Source Discovery**  
- **1.2 Modelling & Planning**  
- **1.3 Prepare the /sources Directory**  
- **1.4 Review**  

---

### 🎯 Goal  
- Identify and document relevant data sources.  
- Select the key parameters for our use case.  
- Draft the planned table schema.  
- Plan cleaning and transformation steps before database population.  

---

## 1.1 Data Source Discovery  

**Topic:** Clubs & Social Activities in Berlin  

**Main source:**  
- **Name:** OpenStreetMap (OSM) via OSMnx / Overpass API  
- **Source and origin:** Public crowdsourced geospatial database  
- **Update frequency:** Continuous (dynamic)  
- **Data type:** Dynamic (API query using tags such as `club=*`, `leisure=*`, `sport=*`, `community_centre=*`)  
- **Reason for selection:**  
  - Covers a wide variety of sports clubs, cultural clubs, and social activity centers in Berlin  
  - Includes geospatial data (coordinates, polygons), names, addresses, and attributes  
  - Open, free, and queryable programmatically  

**Optional additional sources:**  
- **Name:** Berliner Turn- und Freizeitsport-Bund (BTFB)  (https://btfb.de/vereinsservice/vereinssuche/#Vereine-im-Portrait)
  - Source: Official Berlin sports association website  
  - Type: Static (manual export / scraping)  
  - Use: Provides official structured list of sports clubs in Berlin  

- **Name:** Berlin Open Data Portal (daten.berlin.de)  
  - Source: Berlin city government  
  - Type: Static or semi-static (CSV, GeoJSON)  
  - Use: Enrichment with official district boundaries or metadata  

  **Enrichment potential:**  
- Use Berlin shapefiles (districts, neighborhoods) for spatial joins.  


---

## 1.2 Modelling & Planning  

**Key Parameters (planned):**  
- Identification: `name`, `club`, `category`, `subcategory`  
- Location: `address`, `district`, `geometry (lat/lon)`  
- Contact: `website`, `phone`, `email`  
- Attributes: `opening_hours`, `membership`, `fees`, `sport` / `leisure type`  
- Metadata: `source`, `last_updated`  

**Integration with existing tables:**  
- Join on `district_id` from the Berlin districts reference table.  


**Planned table schema:**  
```sql
CREATE TABLE berlin_clubs (
    club_id SERIAL PRIMARY KEY,
    name TEXT,
    category TEXT,
    subcategory TEXT,
    street TEXT,
    housenumber TEXT,
    postcode TEXT,
    district TEXT,
    city TEXT,
    country TEXT,
    district_id INT REFERENCES berlin_districts(district_id),
    latitude FLOAT,
    longitude FLOAT,
    website TEXT,
    phone TEXT,
    email TEXT,
    opening_hours TEXT,
    wheelchair TEXT
   
);

In [69]:
# Install Libraries

# !pip install osmnx geopandas

In [70]:
# Import Libraries

import osmnx as ox
import geopandas as gpd
import pandas as pd

In [71]:
ox.settings.use_cache = False

In [72]:
# Define multiple tags
tags = {
    "amenity": ["community_centre", "arts_centre", "youth_centre", "music_school"],
    "leisure": ["sports_centre", "fitness_centre", "dance"],
    "club": True  # will capture all club types
}

clubs_gdf = ox.features_from_place("Berlin, Germany", tags)

print(clubs_gdf.head())
print(len(clubs_gdf), "clubs/activities found in Berlin")

                                    geometry addr:housenumber  \
element id                                                      
node    60775321   POINT (13.48162 52.53862)               76   
        257709121  POINT (13.20231 52.53548)              NaN   
        266630320  POINT (13.61206 52.51314)               91   
        268915262    POINT (13.34154 52.531)               22   
        268915306  POINT (13.41931 52.48888)                5   

                              addr:street   club                  name  \
element id                                                               
node    60775321       Konrad-Wolf-Straße  poker                  KW76   
        257709121                     NaN    NaN    Kulturhaus Spandau   
        266630320          Hönower Straße    NaN  Buergeramt Mahlsdorf   
        268915262  Wilhelmshavener Straße    NaN           Karame e.V.   
        268915306              Jahnstraße    NaN             Biberzahn   

                         

In [73]:
clubs_gdf.describe()

Unnamed: 0,geometry,addr:housenumber,addr:street,club,name,amenity,contact:phone,contact:website,toilets:wheelchair,wheelchair,...,climbing,source:geometry,name:he,name:it,nickname,construction,type,length,maxdepth,nudism
count,2691,1506,1520,385,2418,1045,322,596,118,767,...,1,1,1,1,1,3,19,1,1,1
unique,2690,405,936,47,2187,15,315,587,2,3,...,1,1,1,1,1,1,1,1,1,1
top,POINT (13.3852669 52.5579387),1,Schönhauser Allee,sport,Sporthalle,community_centre,+49 30 22190011,https://www.teledisko.com/,yes,yes,...,boulder,Geoportal Berlin / ALKIS Berlin - Flurstücke,בית תרבויות העולם,La casa delle culture del mondo,Schwangere Auster,yes,multipolygon,50,12,no
freq,2,41,13,204,20,775,5,3,80,358,...,1,1,1,1,1,3,19,1,1,1


In [74]:
clubs_gdf.columns

Index(['geometry', 'addr:housenumber', 'addr:street', 'club', 'name',
       'amenity', 'contact:phone', 'contact:website', 'toilets:wheelchair',
       'wheelchair',
       ...
       'climbing', 'source:geometry', 'name:he', 'name:it', 'nickname',
       'construction', 'type', 'length', 'maxdepth', 'nudism'],
      dtype='object', length=387)

In [75]:
clubs_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 2691 entries, ('node', np.int64(60775321)) to ('way', np.int64(1427539730))
Columns: 387 entries, geometry to nudism
dtypes: geometry(1), object(386)
memory usage: 8.0+ MB


In [76]:
print(clubs_gdf.notnull().sum().sort_values(ascending=False).head(30))

geometry                    2691
name                        2418
addr:street                 1520
addr:housenumber            1506
addr:postcode               1450
addr:city                   1416
leisure                     1377
addr:country                1046
amenity                     1045
addr:suburb                 1037
website                      924
sport                        907
wheelchair                   767
building                     692
operator                     681
opening_hours                639
contact:website              596
community_centre             479
phone                        408
club                         385
check_date                   377
contact:phone                322
community_centre:for         308
building:levels              291
email                        232
check_date:opening_hours     219
contact:email                212
source                       188
description                  184
opening_hours:signed         162
dtype: int

In [77]:
print(clubs_gdf.columns.tolist())

['geometry', 'addr:housenumber', 'addr:street', 'club', 'name', 'amenity', 'contact:phone', 'contact:website', 'toilets:wheelchair', 'wheelchair', 'wheelchair:description', 'operator', 'website', 'addr:city', 'addr:country', 'addr:postcode', 'addr:suburb', 'check_date:opening_hours', 'community_centre', 'community_centre:for', 'description', 'opening_hours', 'fax', 'health_facility:type', 'lgbtq', 'phone', 'provided_for:homosexual', 'check_date', 'opening_hours:signed', 'operator:type', 'brand', 'brand:wikidata', 'brand:wikipedia', 'email', 'leisure', 'contact:email', 'contact:fax', 'sport', 'addr:floor', 'level', 'dance:teaching', 'mapillary', 'wikidata', 'wikimedia_commons', 'wikipedia', 'disused:name', 'disused:short_name', 'addr:housename', 'denomination', 'source', 'disused', 'outdoor_seating', 'smoking', 'dispensing', 'old_name', 'disused:shop', 'alt_name', 'description:en', 'fixme', 'short_name', 'access', 'contact:email_1', 'contact:name', 'fee', 'identity', 'indoor', 'interact

In [78]:
important_cols = [
    "name",               
    "club",                 
    "leisure",             
    "sport",                
    "amenity",               
    "addr:street",           
    "addr:housenumber",
    "addr:suburb",      
    "addr:postcode",         
    "addr:city",            
    "website",              
    "phone",             
    "email",               
    "opening_hours",         
    "geometry" ,
    "wheelchair"               
]

In [79]:
clubs_clean = clubs_gdf[important_cols].copy()

print(clubs_clean.head(10))

                                                          name   club leisure  \
element id                                                                      
node    60775321                                          KW76  poker     NaN   
        257709121                           Kulturhaus Spandau    NaN     NaN   
        266630320                         Buergeramt Mahlsdorf    NaN     NaN   
        268915262                                  Karame e.V.    NaN     NaN   
        268915306                                    Biberzahn    NaN     NaN   
        268915454                            K3 Kiez-Kids-Klub    NaN     NaN   
        268916576  Katholische Pfarrei St. Matthias Schöneberg    NaN     NaN   
        268917395                      Schwulenberatung Berlin    NaN     NaN   
        270742541                                Max und Marek    NaN     NaN   
        270919303                                Stamm Kimbern  scout     NaN   

                  sport    

In [80]:
clubs_clean.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 2691 entries, ('node', np.int64(60775321)) to ('way', np.int64(1427539730))
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   name              2418 non-null   object  
 1   club              385 non-null    object  
 2   leisure           1377 non-null   object  
 3   sport             907 non-null    object  
 4   amenity           1045 non-null   object  
 5   addr:street       1520 non-null   object  
 6   addr:housenumber  1506 non-null   object  
 7   addr:suburb       1037 non-null   object  
 8   addr:postcode     1450 non-null   object  
 9   addr:city         1416 non-null   object  
 10  website           924 non-null    object  
 11  phone             408 non-null    object  
 12  email             232 non-null    object  
 13  opening_hours     639 non-null    object  
 14  geometry          2691 non-null   geometry
 15  wheelchair  

In [81]:
# Ensure geometry type is Point for lat/lon extraction

clubs_clean = clubs_clean.to_crs(epsg=4326)

In [82]:
clubs_clean['geometry'] = clubs_clean['geometry'].apply(lambda geom: geom if geom.geom_type == 'Point' else geom.representative_point())
#Extract latitude and longitude
clubs_clean["latitude"] = clubs_clean.geometry.y
clubs_clean["longitude"] = clubs_clean.geometry.x
clubs_clean

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,addr:street,addr:housenumber,addr:suburb,addr:postcode,addr:city,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76,,,,,,,,POINT (13.48162 52.53862),,52.538623,13.481623
node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,,,,,,,POINT (13.20231 52.53548),yes,52.535479,13.202312
node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91,Mahlsdorf,12623,Berlin,,,,,POINT (13.61206 52.51314),yes,52.513140,13.612063
node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22,Moabit,10551,Berlin,,,,"Mo-Fr 13:00-18:00; Sa-Su,PH off",POINT (13.34154 52.531),no,52.531002,13.341544
node,268915306,Biberzahn,,,,community_centre,Jahnstraße,5,Kreuzberg,10967,Berlin,,,,,POINT (13.41931 52.48888),no,52.488882,13.419308
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
way,1391342949,,,sports_centre,,,,,,,,,,,,POINT (13.30022 52.56874),,52.568739,13.300225
way,1413458990,Spirit Joga,,fitness_centre,yoga,,Goethestraße,2-3,Charlottenburg,10623,Berlin,,,,,POINT (13.32365 52.50891),no,52.508913,13.323650
way,1413964279,Vereinsheim Treue Seele,,,,community_centre,,,,,,,,,,POINT (13.46274 52.47362),,52.473617,13.462736
way,1423837870,Begegnungszentrum im Kölner Viertel,,,,community_centre,Müngersdorfer Straße,18,Altglienicke,12524,Berlin,https://bik-verein.de,,,,POINT (13.54737 52.4038),,52.403802,13.547368


In [83]:
rename_map = {
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "addr:country": "country",
    "addr:suburb": "district"
}

In [84]:
# Rename the columns
clubs_df = clubs_clean.rename(columns=rename_map)

In [85]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,district,postcode,city,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,,,,POINT (13.48162 52.53862),,52.538623,13.481623
node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,,,,,,,POINT (13.20231 52.53548),yes,52.535479,13.202312
node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91.0,Mahlsdorf,12623.0,Berlin,,,,,POINT (13.61206 52.51314),yes,52.51314,13.612063
node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22.0,Moabit,10551.0,Berlin,,,,"Mo-Fr 13:00-18:00; Sa-Su,PH off",POINT (13.34154 52.531),no,52.531002,13.341544
node,268915306,Biberzahn,,,,community_centre,Jahnstraße,5.0,Kreuzberg,10967.0,Berlin,,,,,POINT (13.41931 52.48888),no,52.488882,13.419308


In [86]:
for col in ["name", "club", "leisure", "sport", "amenity"]:
    print(f"\n--- {col.upper()} ---")
    print(clubs_df[col].dropna().unique())


--- NAME ---
['KW76' 'Kulturhaus Spandau' 'Buergeramt Mahlsdorf' ... 'Spirit Joga'
 'Vereinsheim Treue Seele' 'Begegnungszentrum im Kölner Viertel']

--- CLUB ---
['poker' 'scout' 'sport' 'social' 'yes' 'dance' 'amateur_radio'
 'automobile' 'fishing' 'Körperschaft_des_Öffentlichen_Rechts' 'culture'
 'fan' 'animals' 'elderly' 'bonsai' 'dog' 'freemasonry' 'student'
 'business' 'game' 'music' 'ethnic' 'Agrarbörse Deutschland Ost' 'linux'
 'history' 'education' 'computer' 'religion' 'art' 'politics'
 'board_games' 'youth_movement' 'archive' 'chess' 'sailing' 'science'
 'humanist' 'charity' 'nature' 'hdk_0' 'youth' 'academic' 'motorcycle'
 'allotment_club' 'allotments' 'TC Berolina Biesdorf' 'gardening']

--- LEISURE ---
['fitness_centre' 'sports_centre' 'dance' 'hackerspace' 'pitch'
 'playground' 'music_venue' 'garden' 'marina' 'track']

--- SPORT ---
['bowling' '10pin' 'rowing' 'fitness' 'soccer' 'yoga' 'pilates'
 'gymnastics' 'hapkido' 'swimming' 'kung_fu' 'fitness;weightlifting;yoga'
 

In [87]:
print(clubs_df.geometry.geom_type.value_counts())

Point    2691
Name: count, dtype: int64


In [88]:
print("Missing geometries:", clubs_df.geometry.isna().sum())

Missing geometries: 0


In [89]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", clubs_df["latitude"].min(), "to", clubs_df["latitude"].max())

print("Longitude range:", clubs_df["longitude"].min(), "to", clubs_df["longitude"].max())

Latitude range: 52.37387955 to 52.654758799999996
Longitude range: 13.12237797012892 to 13.7311336


## 1.3 Prepare the /sources Directory
# Raw Data Files:

- **clubs_raw.geojson** (includes geometry)
- **clubs_raw.csv** (tabular only, no geometry)
- **README.md** in /sources will contain:

**Data sources used.**
**Planned transformation steps.**

In [90]:
# Save locally
clubs_gdf.to_file("clubs_raw.geojson", driver="GeoJSON")
clubs_gdf.drop(columns="geometry").to_csv("clubs_raw.csv", index=False)

