# 🧪 Step 1: Research & Data Modelling
**PR Branch Name:** banks-data-modelling

This notebook documents the process for Step 1 of the "Banks in Berlin" project:
- **1.1 Data Source Discovery**
- **1.2 Modelling & Planning**
- **1.3 Prepare the /sources Directory**
- **1.4 Review**

Goal:
- Identify and document relevant data sources.
- Select the 23 key parameters for our use case.
- Draft the planned table schema.
- Plan cleaning and transformation steps before database population.


## 1.1 Data Source Discovery

**Topic:** Banks in Berlin

**Main source:**
- **Name:** OpenStreetMap (OSM) via OSMnx library
- **Source and origin:** Public crowdsourced geospatial database
- **Update frequency:** Continuous (dynamic)
- **Data type:** Dynamic (API query using `amenity=bank`)
- **Reason for selection:**  
  - Covers all banks in Berlin  
  - Includes coordinates, names, addresses, and other useful attributes  
  - Open, free, and easy to query programmatically

**Optional additional sources:**
- **Name:** Berlin Open Data Portal (daten.berlin.de)
- **Source and origin:** Official Berlin city government
- **Update frequency:** Varies per dataset
- **Data type:** Static or semi-static (download as CSV/GeoJSON)
- **Possible usage:** Enrich with official administrative boundaries or extra metadata

**Enrichment potential:**
- Neighborhood/district info from Berlin shapefiles (GeoJSON)
- Linking to local amenities for spatial context


In [3]:
# Install Libraries

# %pip install osmnx geopandas pandas --quiet

In [4]:
# Import Libraries

import osmnx as ox
import geopandas as gpd
import pandas as pd

In [None]:
# Fetch banks in Berlin from OSM using the tag "amenity=bank"
# tags filter for only features with 

tags = {"amenity": "bank"}

In [None]:
# Fetch geometries for Berlin
# bank-gdf = GeoDataFrame (DataFrame with geometry)

banks_gdf = ox.features_from_place("Berlin, Germany", tags)


In [7]:
# Display basic info

print(f"Number of bank entries fetched: {len(banks_gdf)}")
banks_gdf.head(3)

Number of bank entries fetched: 323


Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,addr:city,addr:country,addr:housenumber,addr:postcode,addr:street,addr:suburb,amenity,atm,branch,...,operator:type,start_date,building:levels,roof:levels,roof:shape,indoor,access,room,western_union,building:part
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,28968292,POINT (13.31972 52.48667),Berlin,DE,42.0,10713.0,Berliner Straße,Wilmersdorf,bank,yes,eG,...,,,,,,,,,,
node,60848455,POINT (13.47104 52.53033),Berlin,,13.0,10369.0,Anton-Saefkow-Platz,,bank,yes,,...,,,,,,,,,,
node,87040399,POINT (13.3888 52.51105),,,,,,,bank,,,...,,,,,,,,,,


In [8]:
banks_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 323 entries, ('node', np.int64(28968292)) to ('way', np.int64(611744021))
Data columns (total 100 columns):
 #   Column                                 Non-Null Count  Dtype   
---  ------                                 --------------  -----   
 0   geometry                               323 non-null    geometry
 1   addr:city                              227 non-null    object  
 2   addr:country                           141 non-null    object  
 3   addr:housenumber                       224 non-null    object  
 4   addr:postcode                          228 non-null    object  
 5   addr:street                            233 non-null    object  
 6   addr:suburb                            136 non-null    object  
 7   amenity                                323 non-null    object  
 8   atm                                    251 non-null    object  
 9   branch                                 9 non-null      object  
 10  b

## 1.2 Modelling & Planning

### Selected 23 Key Columns
1. osm_id
2. name
3. brand
4. operator
5. street
6. housenumber
7. postcode
8. city
9. country
10. phone
11. email
12. website
13. opening_hours
14. atm
15. wheelchair
16. building
17. latitude
18. longitude
19. geom_type
20. geom
21. neighbourhood
22. district
23. source

---

### How this connects to existing tables:
- **Coordinates (latitude, longitude, geom):** link to neighbourhood and district polygons.
- **Neighbourhood & district fields:** join with administrative boundaries table.
- **Source field:** ensures traceability.

---

### Planned Schema: `banks_in_berlin`
| Column Name     | Data Type | Description | Example |
|-----------------|-----------|-------------|---------|
| osm_id          | int       | Unique OSM ID | 12345678 |
| name            | text      | Bank name | Deutsche Bank |
| brand           | text      | Brand name if available | Sparkasse |
| operator        | text      | Entity operating the bank | Berliner Volksbank |
| street          | text      | Street name | Friedrichstraße |
| housenumber     | text      | House number | 45 |
| postcode        | text      | Postal code | 10117 |
| city            | text      | City name | Berlin |
| country         | text      | Country code | DE |
| phone           | text      | Contact phone | +49 30 123456 |
| email           | text      | Contact email | info@bank.de |
| website         | text      | Website URL | www.bank.de |
| opening_hours   | text      | Opening hours string | Mo-Fr 09:00-17:00 |
| atm             | text      | Presence of ATM | yes |
| wheelchair      | text      | Accessibility info | yes |
| building        | text      | Building type | yes |
| latitude        | float     | Latitude coordinate | 52.5200 |
| longitude       | float     | Longitude coordinate | 13.4050 |
| geom_type       | text      | Geometry type | Point |
| geom            | geometry  | Full geometry | (GeoJSON) |
| neighbourhood   | text      | Local neighbourhood name | Mitte |
| district        | text      | Berlin district | Mitte |
| source          | text      | Data source info | OSM |

---

### Known Data Issues
- Missing contact details for some entries.
- Inconsistent postcode and address formats.
- Neighbourhood and district not always included in raw OSM data.
- Opening hours in non-standard formats.

---

### Transformation Plan
1. Fetch data from OSM with filter `amenity=bank` (Berlin bounding box).
2. Clean column names → snake_case.
3. Normalize formats (phone, postcode, website URLs).
4. Enrich with neighbourhood/district via spatial join.
5. Save cleaned dataset (GeoJSON + CSV).


In [9]:
# Select 23 Columns & Add Coordinates

In [10]:
# Ensure geometry type is Point for lat/lon extraction

banks_gdf = banks_gdf.to_crs(epsg=4326)

In [11]:
# Extract latitude and longitude

# banks_gdf["latitude"] = banks_gdf.geometry.y
# banks_gdf["longitude"] = banks_gdf.geometry.x

In [12]:
# Select the 23 columns (fill missing with None if not present)

selected_columns = [
    "osmid", "name", "brand", "operator",
    "addr:street", "addr:housenumber", "addr:postcode", "addr:city", "addr:country",
    "phone", "email", "website", "opening_hours",
    "atm", "wheelchair", "building",
    "latitude", "longitude", "geometry",
    # placeholders for enrichment
    "neighbourhood", "district",
    # add source info
    "source"
]

In [None]:
# Rename map for only the columns that need renaming

rename_map = {
    "osmid": "osm_id",
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "addr:country": "country"
}

In [21]:
# Add missing columns if they don’t exist in the data
for col in selected_columns:
    if col not in banks_gdf.columns:
        banks_gdf[col] = None

In [22]:
# Select the columns in the right order
banks_df = banks_gdf[selected_columns]

In [23]:
# Rename the columns
banks_df = banks_df.rename(columns=rename_map)

In [24]:
# Preview the final DataFrame
banks_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,osm_id,name,brand,operator,street,housenumber,postcode,city,country,phone,...,opening_hours,atm,wheelchair,building,latitude,longitude,geometry,neighbourhood,district,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,28968292,,Berliner Volksbank,Berliner Volksbank,,Berliner Straße,42,10713.0,Berlin,DE,,...,"Mo-Fr 10:00-13:00, Mo 14:00-16:00, Tu,Th 14:00...",yes,yes,,,,POINT (13.31972 52.48667),,,
node,60848455,,Sparkasse,,Berliner Sparkasse,Anton-Saefkow-Platz,13,10369.0,Berlin,,,...,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",yes,limited,,,,POINT (13.47104 52.53033),,,
node,87040399,,DKB,,,,,,,,,...,,,limited,,,,POINT (13.3888 52.51105),,,
node,89274635,,Deutsche Bank,Deutsche Bank,Deutsche Bank,Alexanderstraße,5,10178.0,Berlin,,,...,Mo-Tu 10:00-18:00; We 10:00-16:00; Th 10:00-18...,yes,yes,,,,POINT (13.41575 52.52324),,,
node,203561614,,Sparkasse,,Berliner Sparkasse,Helene-Weigel-Platz,1/2,12681.0,Berlin,,,...,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",yes,yes,,,,POINT (13.53833 52.52769),,,


In [None]:
# Info about the final DataFrame
banks_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 323 entries, ('node', np.int64(28968292)) to ('way', np.int64(611744021))
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   osm_id         0 non-null      object  
 1   name           322 non-null    object  
 2   brand          194 non-null    object  
 3   operator       174 non-null    object  
 4   street         233 non-null    object  
 5   housenumber    224 non-null    object  
 6   postcode       228 non-null    object  
 7   city           227 non-null    object  
 8   country        141 non-null    object  
 9   phone          28 non-null     object  
 10  email          2 non-null      object  
 11  website        46 non-null     object  
 12  opening_hours  278 non-null    object  
 13  atm            251 non-null    object  
 14  wheelchair     287 non-null    object  
 15  building       8 non-null      object  
 16  latitude       0 non

In [None]:
# How many rows and columns?
# banks_df.shape

print("Rows, Columns:", banks_df.shape)

Rows, Columns: (323, 22)


In [None]:
# What are the column names (in order)?
# banks_df.columns.tolist()

print("\nColumns:", banks_df.columns.tolist())


Columns: ['osm_id', 'name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'city', 'country', 'phone', 'email', 'website', 'opening_hours', 'atm', 'wheelchair', 'building', 'latitude', 'longitude', 'geometry', 'neighbourhood', 'district', 'source']


In [None]:
# What data types does pandas/GeoPandas see?
# banks_df.dtypes

print("\nDtypes:\n", banks_df.dtypes)


Dtypes:
 osm_id             object
name               object
brand              object
operator           object
street             object
housenumber        object
postcode           object
city               object
country            object
phone              object
email              object
website            object
opening_hours      object
atm                object
wheelchair         object
building           object
latitude           object
longitude          object
geometry         geometry
neighbourhood      object
district           object
source             object
dtype: object


In [36]:
# Count the rows

row_count = len(banks_df)
print(row_count)

323


In [None]:
# Count missing values (NaN/None) in each column

missing_count = banks_df.isna().sum().sort_values(ascending=False)
print()


## 1.3 Prepare the /sources Directory

- **Raw Data Files:**  
    - `banks_raw.geojson` (includes geometry)  
    - `banks_raw.csv` (tabular only, no geometry)  

- **README.md** in `/sources` will contain:
    - Data sources used.
    - Planned transformation steps.


In [42]:
# Save as GeoJSON (keeps geometry) and CSV

raw_geojson_path = "../sources/banks_raw.geojson"
raw_csv_path = "../sources/banks_raw.csv"


banks_gdf.to_file(raw_geojson_path, driver="GeoJSON")
banks_gdf.drop(columns="geometry").to_csv(raw_csv_path, index=False)

print(f"Raw data saved to: {raw_geojson_path} and {raw_csv_path}")

Raw data saved to: ../sources/banks_raw.geojson and ../sources/banks_raw.csv


## 1.4 Review

- All 23 target columns defined.
- Data sources identified and documented.
- Schema draft created.
- Data fetched and stored in `/sources`.
- Data cleaning & enrichment plan in place.

**Next Step:** Step 2 — Fetch & Transform data.
