# Goal: Create a Pharmacies Table

# Procedures

## 🧪 Step 1: Research & Data Modelling
## 🛠 Step 2: Data Transformation
## 🧩 Step 3: Populate Database

# Result:

### Selected 16 Key Columns
1. pharmacy_id (primary_key)
2. district_id (foreign_key)
3. name
5. street
6. housenumber
7. postal_code
8. district
9. neighborhood
10. phone_number
11. website
12. service_offered 
13. openinghours
14. wheelchair_acessible
15. latitude
16. longitude
17. coordinate

### Planned Schema: `pharmacy_in_berlin`
| Column Name     | Data Type | Description | Example |
|-----------------|-----------|-------------|---------|
| pharmacy_id     | int       | Unique OSM ID | 12345678 |
| district_id     | int       | Unique  ID | 11011011 |
| name            | text      | pharmacy name | reichenberger apotheke |      
| street          | text      | Street name | Friedrichstraße |
| housenumber     | text      | House number | 45 |
| post_code       | text      | Postal code | 10117 |
| district        | text      | text.        | Mitte |
| neighbourhood   | text      | Local neighbourhood name | Mitte |
| phone_number    | text      | Contact phone | +49 30 123456 |
| website         | text      | Website URL | www.apotheke.de |
| service_offered | text      | dispensing info | yes |
| openinghours    | text      | Opening hours string | mo-fr 09:00-17:00 |
| wheelchair_accessible      | text      | Accessibility info | yes |
| latitude        | float     | Latitude coordinate | 52.5200 |
| longitude       | float     | Longitude coordinate | 13.4050 |
| coordinate      | text      | Geometry type | Point |


### Table created Logistic
- Keep a unique primary key: pharmacy_id
- Remove columns with more than 85% missing values (e.g:emails)
- Retain address-related information
- Retain contact information: phone_number, website
- Retain key attributes: services_offered, wheelchair_accessible, openinghours
- Add location hierarchy fields: district_id(used as foreign keys), district, neighborhood 
- Add geographic information: latitude, longitude, and coordinatested Logistic







# 🧪 Step 1: Research & Data Modelling
**PR Branch Name:** pharmacies-data-modelling

This notebook documents the process for Step 1 of the "Pharmacies in Berlin" project:
- **1.1 Data Source Discovery**
- **1.2 Modelling & Planning**
- **1.3 Prepare the /sources Directory**
- **1.4 Review**

Goal:
- Identify and document relevant data sources.
- Select the key parameters for our use case.
- Draft the planned table schema.
- Plan cleaning and transformation steps before database population.


## 1.1 Data Source Discovery

**Topic:** Pharmacies in Berlin

**Main source:**
- **Name:** OpenStreetMap (OSM) via OSMnx library
- **Source and origin:** Public crowdsourced geospatial database
- **Update frequency:** Continuous (dynamic)
- **Data type:** Dynamic (API query using `amenity=pharmacies`)
- **Reason for selection:**  
  - Covers all pharmacies in Berlin  
  - Includes coordinates, names, addresses, and other useful attributes  
  - Open, free, and easy to query programmatically

**Optional additional sources:**
- **Name:** Berlin Open Data Portal (daten.berlin.de)
- **Source and origin:** Official Berlin city government
- **Update frequency:** Varies per dataset
- **Data type:** Static or semi-static (download as CSV/GeoJSON)
- **Possible usage:** Enrich with official administrative boundaries or extra metadata

**Enrichment potential:**
- Neighborhood/district info from Berlin shapefiles (GeoJSON)
- Linking to local amenities for spatial context


In [1]:
# Install Libraries

! pip install osmnx geopandas pandas --quiet

In [2]:
# Import Libraries

import osmnx as ox # to fetch data from OpenStreetMap
import geopandas as gpd # to work with geospatial data
import pandas as pd

In [3]:
# Fetch banks in Berlin from OSM using the tag "amenity=bank"
# tags filter for only features with 

tags = {"amenity": "pharmacy"}

In [4]:
# Fetch geometries for Berlin
# pharmacy-gdf = GeoDataFrame (DataFrame with geometry)

pharmacy_gdf = ox.features_from_place("Berlin, Germany", tags)


In [5]:
# Display basic info

print(f"Number of pharmacy entries fetched: {len(pharmacy_gdf)}")
pharmacy_gdf.head()

Number of pharmacy entries fetched: 675


Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,addr:city,addr:country,addr:housenumber,addr:postcode,addr:street,addr:suburb,amenity,check_date:opening_hours,dispensing,...,opening_hours:covid19,opening_hours:url,indoor,building,building:colour,building:roof,roof:shape,access,room,payment:american_express
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,60775323,POINT (13.4866 52.54074),Berlin,DE,3.0,13055.0,Reichenberger Straße,Alt-Hohenschönhausen,pharmacy,2024-09-26,yes,...,,,,,,,,,,
node,60848447,POINT (13.46965 52.53184),,,,,,,pharmacy,2025-02-11,yes,...,,,,,,,,,,
node,60852928,POINT (13.46851 52.52756),Berlin,DE,11.0,10369.0,Rudolf-Seiffert-Straße,Fennpfuhl,pharmacy,2024-10-16,yes,...,,,,,,,,,,
node,68437791,POINT (13.45057 52.48939),Berlin,,46.0,12435.0,Karl-Kunger-Straße,Alt-Treptow,pharmacy,2024-06-04,yes,...,,,,,,,,,,
node,69226035,POINT (13.39586 52.51056),Berlin,DE,43.0,10117.0,Leipziger Straße,Mitte,pharmacy,,yes,...,,,,,,,,,,


In [6]:
pharmacy_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 675 entries, ('node', np.int64(60775323)) to ('way', np.int64(1410373889))
Columns: 103 entries, geometry to payment:american_express
dtypes: geometry(1), object(102)
memory usage: 567.1+ KB


In [7]:
# List the columns names of the GeoDataFrame
pharmacy_gdf.columns.to_list()


['geometry',
 'addr:city',
 'addr:country',
 'addr:housenumber',
 'addr:postcode',
 'addr:street',
 'addr:suburb',
 'amenity',
 'check_date:opening_hours',
 'dispensing',
 'healthcare',
 'name',
 'opening_hours',
 'phone',
 'toilets:wheelchair',
 'website',
 'wheelchair',
 'check_date',
 'man_made',
 'payment:mastercard',
 'payment:visa',
 'surveillance',
 'contact:website',
 'level',
 'email',
 'fax',
 'contact:phone',
 'drinking_water:refill',
 'operator',
 'addr:city:fa',
 'drive_through',
 'contact:email',
 'owner',
 'contact:fax',
 'wheelchair:description',
 'start_date',
 'source',
 'health_facility:type',
 'medical_system:western',
 'payment:cash',
 'payment:credit_cards',
 'payment:debit_cards',
 'name_old',
 'building:levels',
 'old_name',
 'wheelchair:source',
 'ref:vatin',
 'access:covid19',
 'delivery:covid19',
 'brand',
 'brand:wikidata',
 'name:en',
 'addr:place',
 'dog',
 'opening_hours:signed',
 'takeaway:covid19',
 'payment:girocard',
 'addr:housename',
 'wheelchair:de

In [8]:
# Explore all columns 
pharmacy_gdf.describe(include="all").T


Unnamed: 0,count,unique,top,freq
geometry,675,675,POINT (13.4866016 52.5407412),1
addr:city,545,1,Berlin,545
addr:country,425,1,DE,425
addr:housenumber,561,240,1,18
addr:postcode,555,175,12043,9
...,...,...,...,...
building:roof,1,1,flat,1
roof:shape,1,1,flat,1
access,2,1,customers,2
room,3,2,shop,2


In [9]:
# check missing values in each column
missing_count = pharmacy_gdf.isna().sum().sort_values(ascending=False)
# List missing values count when missing values greater than 100
print(missing_count[missing_count > 200])


payment:american_express    674
wheelchair:source           674
width                       674
branch                      674
drinking_water:refill       674
                           ... 
website                     387
phone                       375
check_date:opening_hours    336
addr:country                250
addr:suburb                 231
Length: 92, dtype: int64


In [10]:
# check unique values in 'barand' column
pharmacy_gdf['brand'].value_counts()

brand
easyApotheke        12
Linden Apotheken     1
Name: count, dtype: int64

In [11]:
#expland all columns to see more details
pd.set_option('display.max_columns', None)
print(pharmacy_gdf.head(3))

                                   geometry addr:city addr:country  \
element id                                                           
node    60775323   POINT (13.4866 52.54074)    Berlin           DE   
        60848447  POINT (13.46965 52.53184)       NaN          NaN   
        60852928  POINT (13.46851 52.52756)    Berlin           DE   

                 addr:housenumber addr:postcode             addr:street  \
element id                                                                
node    60775323                3         13055    Reichenberger Straße   
        60848447              NaN           NaN                     NaN   
        60852928               11         10369  Rudolf-Seiffert-Straße   

                           addr:suburb   amenity check_date:opening_hours  \
element id                                                                  
node    60775323  Alt-Hohenschönhausen  pharmacy               2024-09-26   
        60848447                   NaN  ph

## 1.2 Modelling & Planning

### Selected  Key Columns

- Remove columns with more than 85% missing values (e.g:emails)
- Retain address-related information
- Retain contact information: phone_number, website
- Retain key attributes: openhours, dispensing, wheelchair_acessible and etc



---

### How this connects to existing tables:
- **Coordinates (latitude, longitude, geom):** link to neighbourhood and district polygons.
- **Neighbourhood & district fields:** join with administrative boundaries table.
- **Source field:** ensures traceability.

---

### Planned Schema: `pharmacy_in_berlin`


---

### Known Data Issues
- Missing contact details for some entries.
- Inconsistent phone formats.
- Neighbourhood and district not always included in raw OSM data.
- Opening hours in non-standard formats.


---

### Transformation Plan
1. Fetch data from OSM with filter `amenity=pharmacy` (Berlin bounding box).
2. Clean column names → snake_case.
3. Normalize formats (phone, postcode, website URLs).
4. Enrich with neighbourhood/district via spatial join.
5. Save cleaned dataset (GeoJSON + CSV).





In [None]:
# Selected Columns & Add Coordinates

In [12]:
# Ensure geometry type is Point for lat/lon extraction

pharmacy_gdf = pharmacy_gdf.to_crs(epsg=4326)


In [13]:
pharmacy_gdf['geometry'] = pharmacy_gdf['geometry'].apply(lambda geom: geom if geom.geom_type == 'Point' else geom.representative_point())
#Extract latitude and longitude
pharmacy_gdf["latitude"] = pharmacy_gdf.geometry.y
pharmacy_gdf["longitude"] = pharmacy_gdf.geometry.x
pharmacy_gdf

Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,addr:city,addr:country,addr:housenumber,addr:postcode,addr:street,addr:suburb,amenity,check_date:opening_hours,dispensing,healthcare,name,opening_hours,phone,toilets:wheelchair,website,wheelchair,check_date,man_made,payment:mastercard,payment:visa,surveillance,contact:website,level,email,fax,contact:phone,drinking_water:refill,operator,addr:city:fa,drive_through,contact:email,owner,contact:fax,wheelchair:description,start_date,source,health_facility:type,medical_system:western,payment:cash,payment:credit_cards,payment:debit_cards,name_old,building:levels,old_name,wheelchair:source,ref:vatin,access:covid19,delivery:covid19,brand,brand:wikidata,name:en,addr:place,dog,opening_hours:signed,takeaway:covid19,payment:girocard,addr:housename,wheelchair:description:de,wheelchair:description:en,language:en,language:es,language:tr,entrance,description,payment:contactless,operator:type,contact:whatsapp,air_conditioning,wikidata,wikipedia,fixme,alt_name,payment:maestro,internet_access,delivery,instagram,name:de,note,contact:facebook,branch,contact:instagram,addr:floor,width,addr:flats,network,layer,service:bicycle:pump,service:bicycle:retail,service:bicycle:second_hand,mapillary,addr:inclusion,disused:amenity,opening_hours:covid19,opening_hours:url,indoor,building,building:colour,building:roof,roof:shape,access,room,payment:american_express,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1
node,60775323,POINT (13.4866 52.54074),Berlin,DE,3,13055,Reichenberger Straße,Alt-Hohenschönhausen,pharmacy,2024-09-26,yes,pharmacy,Reichenberger Apotheke,Mo-Th 08:00-19:00; Fr 08:00-18:30; Sa 09:00-13:00,+49 30 9713807,no,https://reichenbergerapotheke.de/,yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.540741,13.486602
node,60848447,POINT (13.46965 52.53184),,,,,,,pharmacy,2025-02-11,yes,pharmacy,Castello-Apotheke,Mo-Fr 08:30-19:00; Sa 08:30-14:00,,,,yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.531835,13.469654
node,60852928,POINT (13.46851 52.52756),Berlin,DE,11,10369,Rudolf-Seiffert-Straße,Fennpfuhl,pharmacy,2024-10-16,yes,pharmacy,Rosen Apotheke,Mo-Fr 08:00-19:00; Sa 08:00-12:00,+49 30 9759449,,https://www.zurrose.de/,yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.527555,13.468513
node,68437791,POINT (13.45057 52.48939),Berlin,,46,12435,Karl-Kunger-Straße,Alt-Treptow,pharmacy,2024-06-04,yes,pharmacy,Margareten-Apotheke,Mo-Fr 08:30-18:30; Sa 08:30-13:00,+49 30 5337855,,http://www.apotheke.borchert-online.de/,no,2024-09-06,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.489390,13.450570
node,69226035,POINT (13.39586 52.51056),Berlin,DE,43,10117,Leipziger Straße,Mitte,pharmacy,,yes,pharmacy,Leipziger Apotheke,Mo-Fr 08:00-19:00; Sa 08:00-14:00,,,https://www.leipziger-apotheke.de/,yes,2022-08-22,surveillance,yes,yes,outdoor,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.510556,13.395863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
way,510479858,POINT (13.30532 52.61841),Berlin,,47,13467,Heinsestraße,,pharmacy,,,pharmacy,Hirsch-Apotheke,Mo-Fr 08:30-18:30; Sa 08:30-14:00,,,,yes,,,,,,,,,,,,Christiane von Dallwitz,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,52.618413,13.305325
way,739645093,POINT (13.44714 52.50581),,,,,,,pharmacy,,yes,pharmacy,Arena Apotheke,"Mo-Fr 08:00-20:00; Sa 09:00-20:00; Su,PH off",+49 30 4226620,,https://www.arena-apotheke.de/,yes,2023-07-31,,yes,yes,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,yes,52.505810,13.447141
way,1002286243,POINT (13.60546 52.53736),Berlin,DE,24,12627,Stendaler Straße,Hellersdorf,pharmacy,,yes,pharmacy,Kastanien Apotheke,"Mo-Fr 08:30-20:00, Sa 10:00-20:00; PH off",,,,yes,,,,,,https://www.kastanien-apotheke.de/,0,,,+49 30 9939169,,Sandra Bouvain e.K.,,,info@kastanien-apotheke.de,,+49 30 9939174,,,,pharmacy,yes,,,,,,,,,,,,,Kastanien Pharmacy,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,room,,,,,,,,52.537358,13.605459
way,1208913688,POINT (13.58979 52.51307),,,,,,,pharmacy,,yes,pharmacy,Prinzen Apotheke,"Mo-Fr 08:00-18:30, Sa 09:00-12:00; PH off",+49 30 5638146,,https://prinzen-apotheke-berlin.de/,yes,,,,,,,,Prinzen-Apo@t-online.de,+49 30 5638147,,,Karin Neumann e.K.,,,,,,,,,pharmacy,yes,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,commercial,,,,,,,52.513065,13.589787


In [14]:
# Select the 25 columns (fill missing with None if not present)

selected_columns = [
    #"osmid",
    "name", "brand", "operator",
    "addr:street", "addr:housenumber", "addr:postcode", "addr:suburb","addr:city", "addr:country",
    "phone", "email", "website", "opening_hours",
    "payment:visa", "payment:mastercard","payment:girocard", "dispensing", "delivery","surveillance","wheelchair", "building",
    "latitude", "longitude", "geometry",
    # placeholders for enrichment
    #"neighbourhood", "district",
    # add source info
    "source"
]

In [15]:
# Rename map for only the columns that need renaming

rename_map = {
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:suburb": "suburb",
    "addr:city": "city",
    "addr:country": "country",
    "payment:visa": "payment_visa",
    "payment:mastercard": "payment_mastercard",
    "payment:girocard": "payment_girocard",
    "opening_hours": "openinghours",
    "wheelchair": "wheelchair_accessible",
    "building": "building_type"
}

In [15]:
# # Add missing columns if they don’t exist in the data
# for col in selected_columns:
#     if col not in pharmacy_gdf.columns:
#         banks_gdf[col] = None

In [16]:
# Select the columns in the right order
pharmacy_df = pharmacy_gdf[selected_columns]

In [17]:
# Rename the columns
pharmacy_df = pharmacy_df.rename(columns=rename_map)

In [18]:
# Preview the final DataFrame
pharmacy_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,suburb,city,country,phone,email,website,openinghours,payment_visa,payment_mastercard,payment_girocard,dispensing,delivery,surveillance,wheelchair_accessible,building_type,latitude,longitude,geometry,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
node,60775323,Reichenberger Apotheke,,,Reichenberger Straße,3.0,13055.0,Alt-Hohenschönhausen,Berlin,DE,+49 30 9713807,,https://reichenbergerapotheke.de/,Mo-Th 08:00-19:00; Fr 08:00-18:30; Sa 09:00-13:00,,,,yes,,,yes,,52.540741,13.486602,POINT (13.4866 52.54074),
node,60848447,Castello-Apotheke,,,,,,,,,,,,Mo-Fr 08:30-19:00; Sa 08:30-14:00,,,,yes,,,yes,,52.531835,13.469654,POINT (13.46965 52.53184),
node,60852928,Rosen Apotheke,,,Rudolf-Seiffert-Straße,11.0,10369.0,Fennpfuhl,Berlin,DE,+49 30 9759449,,https://www.zurrose.de/,Mo-Fr 08:00-19:00; Sa 08:00-12:00,,,,yes,,,yes,,52.527555,13.468513,POINT (13.46851 52.52756),
node,68437791,Margareten-Apotheke,,,Karl-Kunger-Straße,46.0,12435.0,Alt-Treptow,Berlin,,+49 30 5337855,,http://www.apotheke.borchert-online.de/,Mo-Fr 08:30-18:30; Sa 08:30-13:00,,,,yes,,,no,,52.48939,13.45057,POINT (13.45057 52.48939),
node,69226035,Leipziger Apotheke,,,Leipziger Straße,43.0,10117.0,Mitte,Berlin,DE,,,https://www.leipziger-apotheke.de/,Mo-Fr 08:00-19:00; Sa 08:00-14:00,yes,yes,,yes,,outdoor,yes,,52.510556,13.395863,POINT (13.39586 52.51056),


## Step 1 Review and A–F Data Familiarization

### A) Quick overview

In [19]:
# How many rows and columns?
# pharmacy_df.shape

print("Rows, Columns:", pharmacy_df.shape)

Rows, Columns: (675, 25)


In [20]:
# What are the column names (in order)?
# banks_df.columns.tolist()

print("\nColumns:", pharmacy_df.columns.tolist())


Columns: ['name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'suburb', 'city', 'country', 'phone', 'email', 'website', 'openinghours', 'payment_visa', 'payment_mastercard', 'payment_girocard', 'dispensing', 'delivery', 'surveillance', 'wheelchair_accessible', 'building_type', 'latitude', 'longitude', 'geometry', 'source']


In [21]:
# Data types and non-null counts

pharmacy_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 675 entries, ('node', np.int64(60775323)) to ('way', np.int64(1410373889))
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   name                   673 non-null    object  
 1   brand                  13 non-null     object  
 2   operator               141 non-null    object  
 3   street                 565 non-null    object  
 4   housenumber            561 non-null    object  
 5   postcode               555 non-null    object  
 6   suburb                 444 non-null    object  
 7   city                   545 non-null    object  
 8   country                425 non-null    object  
 9   phone                  300 non-null    object  
 10  email                  91 non-null     object  
 11  website                288 non-null    object  
 12  openinghours           633 non-null    object  
 13  payment_visa           18 non-null

### B) Missing values per column

In [22]:
# Count missing values (NaN/None) in each column
# I need this to compute percentages of missing values below

missing_count = pharmacy_df.isna().sum().sort_values(ascending=False)
print(missing_count)


delivery                 673
building_type            670
payment_girocard         669
surveillance             668
brand                    662
payment_mastercard       657
payment_visa             657
source                   651
email                    584
operator                 534
website                  387
phone                    375
country                  250
suburb                   231
city                     130
postcode                 120
housenumber              114
street                   110
dispensing                90
wheelchair_accessible     57
openinghours              42
name                       2
latitude                   0
longitude                  0
geometry                   0
dtype: int64


In [23]:
# Number of rows (observations, banks)
# I need this to compute percentages of missing values below

row_count = len(pharmacy_df)
print(row_count)


675


In [24]:
# Build table with counts and % of missing values
# What does pd.DataFrame({...}) do? It converts that dictionary into a DataFrame (like an Excel table).
# The keys become column names.
# The values become column data.

missing = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": (missing_count / row_count * 100).round(1)
}).sort_values(by="missing_pct", ascending=False)

print(missing)

                       missing_count  missing_pct
delivery                         673         99.7
building_type                    670         99.3
payment_girocard                 669         99.1
surveillance                     668         99.0
brand                            662         98.1
payment_mastercard               657         97.3
payment_visa                     657         97.3
source                           651         96.4
email                            584         86.5
operator                         534         79.1
website                          387         57.3
phone                            375         55.6
country                          250         37.0
suburb                           231         34.2
city                             130         19.3
postcode                         120         17.8
housenumber                      114         16.9
street                           110         16.3
dispensing                        90         13.3


### C) Distinct values per column

In [25]:
# Number of unique values per column
# Goal: See the “variety” of each column.


distinct = pharmacy_df.nunique().sort_values(ascending=False)
print(distinct)

# Concusion:
# latitude, longitude and geometry are diverse  => Columns to keep but use mainly for mapping
# country, city , email , maybe source  => Columns I might drop/ignore later (in Step2)
# brand, operator, postcode, wheelchair, atm, maybe opening_hours => Columns that will be most useful in Step 2


latitude                 675
geometry                 675
longitude                675
name                     629
street                   369
openinghours             339
phone                    300
website                  286
housenumber              240
postcode                 175
operator                 137
email                     91
suburb                    84
source                    10
dispensing                 3
surveillance               3
wheelchair_accessible      3
building_type              3
brand                      2
payment_mastercard         1
city                       1
delivery                   1
payment_visa               1
country                    1
payment_girocard           1
dtype: int64


### D) Most common values in key columns

In [26]:
# Goal: Peek at distributions, not just counts.

# Example: top 10 brands
print("\nTop 10 brands:")
print(pharmacy_df["brand"].value_counts().head(10))


Top 10 brands:
brand
easyApotheke        12
Linden Apotheken     1
Name: count, dtype: int64


In [27]:
# Example: top 10 operators
print("\nTop 10 operators:")
print(pharmacy_df["operator"].value_counts().head(10))



Top 10 operators:
operator
Lars Rieck e.K.          2
Ralf Goepfert e.K.       2
Witzleben Apotheke       2
Christian Melzer e.K.    2
Herr S. Stoof            1
Stefan Rezepa e.K        1
Ute Nitzer               1
Claudia Spieß e.K.       1
Jürgen Drescher          1
Anna Fredrich e.K.       1
Name: count, dtype: int64


In [28]:
# Example: most common street 
print("\nTop street:")
print(pharmacy_df["street"].value_counts().head(10))


Top street:
street
Karl-Marx-Straße        11
Schloßstraße             9
Müllerstraße             7
Hauptstraße              7
Badstraße                6
Mariendorfer Damm        6
Kurfürstendamm           6
Hermannstraße            6
Wilmersdorfer Straße     6
Bahnhofstraße            6
Name: count, dtype: int64


In [29]:
# Example: most common postcode 
print("\nTop postcode:")
print(pharmacy_df["postcode"].value_counts().head(10))


Top postcode:
postcode
12043    9
10719    8
10117    8
12163    8
12627    7
13353    7
10627    7
13357    7
14199    7
10365    6
Name: count, dtype: int64


In [30]:
# Example: most common opening_hours
print("\nTop opening_hours:")
print(pharmacy_df["openinghours"].value_counts().head(10))


Top opening_hours:
openinghours
Mo-Fr 08:30-18:30; Sa 08:30-13:00    35
Mo-Fr 08:30-19:00; Sa 08:30-14:00    21
Mo-Fr 08:30-18:30; Sa 08:30-13:30    17
Mo-Fr 08:30-18:30; Sa 09:00-14:00    13
Mo-Fr 08:00-19:00; Sa 09:00-14:00    12
Mo-Fr 08:00-18:30; Sa 08:00-13:00    12
Mo-Fr 08:30-18:30; Sa 09:00-13:00    12
Mo-Fr 08:00-19:00; Sa 08:00-13:00    10
Mo-Fr 08:30-19:00; Sa 09:00-14:00    10
Mo-Fr 09:00-18:00; Sa 09:00-13:00     9
Name: count, dtype: int64


In [30]:
# Example : most commmon building_type
print("\nTop building_type:")
print(pharmacy_df["building_type"].value_counts().head(10))


Top building_type:
building_type
yes           2
commercial    2
apartments    1
Name: count, dtype: int64


In [31]:
# Example: most commen wheelchair_accessible
print("\nTop wheelchair_accessible:")
print(pharmacy_df["wheelchair_accessible"].value_counts().head(10))


Top wheelchair_accessible:
wheelchair_accessible
yes        461
no          89
limited     68
Name: count, dtype: int64


In [32]:
# Example: Unique suriveillance types
print("\nUnique suriveillance:")
print(pharmacy_df["surveillance"].unique())


Unique suriveillance:
[nan 'outdoor' 'indoor' 'yes']


In [33]:
# Example: Unique delivery types
print("\nUnique delivery:")
print(pharmacy_df["delivery"].unique())


Unique delivery:
[nan 'yes']


In [34]:
# Example: Unique payment methods
print("\nUnique payment_visa:")
print(pharmacy_df["payment_visa"].unique())     
# Example: Unique payment_mastercard methods
print("\nUnique payment_mastercard:")
print(pharmacy_df["payment_mastercard"].unique())     
# Example: Unique payment_girocard methods
print("\nUnique payment_girocard:")
print(pharmacy_df["payment_girocard"].unique()) 


Unique payment_visa:
[nan 'yes']

Unique payment_mastercard:
[nan 'yes']

Unique payment_girocard:
[nan 'yes']


In [35]:
# check the unique values in dispensing
print("\nUnique dispensing:")
print(pharmacy_df["dispensing"].unique())
# check the unique values in building_type
print("\nUnique building_type:")
print(pharmacy_df["building_type"].unique())


Unique dispensing:
['yes' nan 'no' 'Apotheke in Nikolassee']

Unique building_type:
[nan 'yes' 'commercial' 'apartments']


### E) Geometry sanity checks

In [36]:
# Goal: Ensure spatial data makes sense.

# Unique geometry types (Point, Polygon/LineString). 
# If some are Polygon/LineString, I already handled them with .representative_point() (somewhere above in Step 1.2).

print(pharmacy_df.geometry.geom_type.value_counts())

Point    675
Name: count, dtype: int64


In [37]:
# Any missing geometries?
# Why? Missing geometry would be a problem for maps.

print("Missing geometries:", pharmacy_df.geometry.isna().sum())

Missing geometries: 0


### F) Latitude/Longitude checks

In [38]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", pharmacy_df["latitude"].min(), "to", pharmacy_df["latitude"].max())

print("Longitude range:", pharmacy_df["longitude"].min(), "to", pharmacy_df["longitude"].max())


Latitude range: 52.3865161 to 52.6360319
Longitude range: 13.1422529 to 13.715511


## 1.3 Prepare the /sources Directory

- **Raw Data Files:**  
    - `pharmacies_raw.geojson` (includes geometry)  
    - `pharmacies_raw.csv` (tabular only, no geometry)  

- **README.md** in `/sources` will contain:
    - Data sources used.
    - Planned transformation steps.


In [40]:
import os

In [39]:
# Define file paths
raw_geojson_path = "../sources/pharmacies_raw.geojson"
raw_csv_path = "../sources/pharmacies_raw.csv"


In [41]:
# create folder to save dta
os.makedirs(os.path.dirname(raw_geojson_path), exist_ok=True)

In [None]:
# Save as GeoJSON (keeps geometry) and CSV


pharmacy_gdf.to_file(raw_geojson_path, driver="GeoJSON")
pharmacy_gdf.drop(columns="geometry").to_csv(raw_csv_path, index=False)

print(f"Raw data saved to: {raw_geojson_path} and {raw_csv_path}")

Raw data saved to: ../sources/pharmacies_raw.geojson and ../sources/pharmacies_raw.csv


## 1.4 Review

- All draft 23 target columns defined.
- Data sources identified and documented.
- Schema draft created.
- Data fetched and stored in `/sources`.
- Data cleaning & enrichment plan in place.

**Next Step:** Step 2 — Fetch & Transform data.


# 🛠 Step 2: Data Transformation

In [42]:
#check the unique values in dispensing and delivery and wheelchair_accessible
print("\nUnique dispensing:")
print(pharmacy_df["dispensing"].unique())
print("\nUnique delivery:")
print(pharmacy_df["delivery"].unique())     
print("\nUnique wheelchair_accessible:")
print(pharmacy_df["wheelchair_accessible"].unique())




Unique dispensing:
['yes' nan 'no' 'Apotheke in Nikolassee']

Unique delivery:
[nan 'yes']

Unique wheelchair_accessible:
['yes' 'no' 'limited' nan]


### A) Standardize column names and types

In [44]:
# Standardize column names

pharmacy_df.columns = pharmacy_df.columns.str.lower().str.strip().str.replace(" ", "_").str.replace("-", "_")

# Convert certain columns to correct type

pharmacy_df["housenumber"] = pharmacy_df["housenumber"].astype(str)   # ensure text

pharmacy_df["postcode"] = pharmacy_df["postcode"].astype(str)         # keep leading zeros

# Normalize yes/no columns into Boolean (True/False)

pharmacy_df["payment_visa"] = pharmacy_df["payment_visa"].map({"yes": True, "no": False})
pharmacy_df["payment_mastercard"] = pharmacy_df["payment_mastercard"].map({"yes": True, "no": False})
pharmacy_df["payment_girocard"] = pharmacy_df["payment_girocard"].map({"yes": True, "no": False})


pharmacy_df["surveillance"] = pharmacy_df["surveillance"].map({"yes": True, "no": False})
pharmacy_df["delivery"] = pharmacy_df["delivery"].map({"yes": True, "no": False})


# Make text values consistent (lowercase to avoid duplicates like "Sparkasse" vs "sparkasse")
# See "opening_hours" normalization in Step 2 E)


text_cols = ["name", "street", "city", "country", "website", "operator", "brand", "phone", "email", "source", "building"]
for col in text_cols:
    if col in pharmacy_df.columns:
        pharmacy_df[col] = pharmacy_df[col].astype(str).str.strip().str.lower()

Unique dispensing:
['yes' nan 'no' 'Apotheke in Nikolassee']

Unique delivery:
[nan 'yes']

Unique wheelchair_accessible:
['yes' 'no' 'limited' nan]

In [45]:
# Check the  datatypes after Step 2 A)

print(pharmacy_df.dtypes)   

name                       object
brand                      object
operator                   object
street                     object
housenumber                object
postcode                   object
suburb                     object
city                       object
country                    object
phone                      object
email                      object
website                    object
openinghours               object
payment_visa               object
payment_mastercard         object
payment_girocard           object
dispensing                 object
delivery                   object
surveillance               object
wheelchair_accessible      object
building_type              object
latitude                  float64
longitude                 float64
geometry                 geometry
source                     object
dtype: object


In [46]:
# See first rows after Step 2 A)

pharmacy_df.head() 


Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,suburb,city,country,phone,email,website,openinghours,payment_visa,payment_mastercard,payment_girocard,dispensing,delivery,surveillance,wheelchair_accessible,building_type,latitude,longitude,geometry,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
node,60775323,reichenberger apotheke,,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,berlin,de,+49 30 9713807,,https://reichenbergerapotheke.de/,Mo-Th 08:00-19:00; Fr 08:00-18:30; Sa 09:00-13:00,,,,yes,,,yes,,52.540741,13.486602,POINT (13.4866 52.54074),
node,60848447,castello-apotheke,,,,,,,,,,,,Mo-Fr 08:30-19:00; Sa 08:30-14:00,,,,yes,,,yes,,52.531835,13.469654,POINT (13.46965 52.53184),
node,60852928,rosen apotheke,,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,berlin,de,+49 30 9759449,,https://www.zurrose.de/,Mo-Fr 08:00-19:00; Sa 08:00-12:00,,,,yes,,,yes,,52.527555,13.468513,POINT (13.46851 52.52756),
node,68437791,margareten-apotheke,,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,berlin,,+49 30 5337855,,http://www.apotheke.borchert-online.de/,Mo-Fr 08:30-18:30; Sa 08:30-13:00,,,,yes,,,no,,52.48939,13.45057,POINT (13.45057 52.48939),
node,69226035,leipziger apotheke,,,leipziger straße,43.0,10117.0,Mitte,berlin,de,,,https://www.leipziger-apotheke.de/,Mo-Fr 08:00-19:00; Sa 08:00-14:00,,,,yes,,,yes,,52.510556,13.395863,POINT (13.39586 52.51056),


In [None]:
# Seeing more than head()

# pharmacy_df.head(20)              # first 20 rows
# pharmacy_df.tail(10)              # last 10 rows
# pharmacy_df.sample(10, random_state=0)  # 10 random rows
# pharmacy_df[["brand","operator","atm","wheelchair"]].sample(15, random_state=1)
# pharmacy_df["brand"].value_counts(dropna=False).head(20)


In [47]:
# Seeing more than head()

pharmacy_df.sample(10, random_state=0)  # 10 random rows

Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,suburb,city,country,phone,email,website,openinghours,payment_visa,payment_mastercard,payment_girocard,dispensing,delivery,surveillance,wheelchair_accessible,building_type,latitude,longitude,geometry,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
node,4381928183,apotheke am bundesplatz,,michaela kröger,bundesplatz,3.0,10715.0,,berlin,,+49 30 85405670,,https://www.apotheke-bundesplatz.de/,Mo-Fr 08:30-18:30; Sa 08:30-13:00,,,,yes,,,yes,,52.479681,13.327688,POINT (13.32769 52.47968),
node,263699380,robinson apotheke,,,lion-feuchtwanger-straße,22.0,12619.0,Hellersdorf,berlin,de,,,,"Mo-Fr 08:00-18:30, Sa 08:00-13:00; PH off",,,,yes,,,yes,,52.517681,13.5829,POINT (13.5829 52.51768),
node,1038217551,falken-apotheke,,,siegener straße,59.0,13583.0,Falkenhagener Feld,berlin,de,,,,"Mo,Tu,Th 08:00-19:00; We,Fr 08:00-18:30; Sa 08...",,,,yes,,,yes,,52.54755,13.177198,POINT (13.1772 52.54755),
node,454167250,bartels apotheke,,,,,,,,,+49 30 47301356,,https://www.bartels-apotheke.de/,Mo-Fr 08:00-19:00; Sa 08:00-13:00,,,,yes,,,yes,,52.558703,13.413785,POINT (13.41378 52.5587),
node,1624585364,vital-apotheke,,,skalitzer straße,15.0,10999.0,Kreuzberg,berlin,de,,,,Mo-Fr 08:30-19:00; Sa 09:00-16:00,,,,yes,,,yes,,52.49888,13.420308,POINT (13.42031 52.49888),
node,738996191,kurmark apotheke,,,kurfürstenstraße,154.0,10785.0,Schöneberg,berlin,de,+4930 26 555 477,,http://www.apo-net.de/kurmark/index2.htm,Mo-Fr 08:30-19:00; Sa 09:00-13:30,,,,yes,,,no,,52.499449,13.363796,POINT (13.3638 52.49945),
node,833390882,easyapotheke,easyapotheke,mehmet geyik e.k.,schloßstraße,1.0,12163.0,Steglitz,berlin,de,+49 30 79016052,,https://forum-steglitz.easyapotheken.de/,Mo-Fr 09:00-20:00; Sa 10:00-20:00,,,,yes,,,yes,,52.464665,13.326185,POINT (13.32619 52.46467),
node,87036261,nordland apotheke,,tina töllner e.k.,invalidenstraße,114.0,10115.0,,berlin,,,info@nordlandapotheke-berlin.de,https://nordlandapotheke-berlin.de/,Mo-Fr 08:00-19:00,,,,yes,,,yes,,52.530698,13.384125,POINT (13.38412 52.5307),
node,627415668,gesundbrunnen- apotheke,,,badstraße,64.0,13357.0,Gesundbrunnen,berlin,de,,,,Mo-Fr 08:30-18:30; Sa 09:00-13:00,,,,yes,,,yes,,52.55057,13.384249,POINT (13.38425 52.55057),
node,442705525,hermann apotheke,,,hermannstraße,116.0,12051.0,Neukölln,berlin,de,+493062981014,,,"Mo,Tu,Th 09:00-18:30, Sa 09:00-13:00, We,Fr 09...",,,,yes,,,yes,,52.466503,13.431654,POINT (13.43165 52.4665),


In [48]:
# Seeing more than head()

pharmacy_df[["brand","operator","wheelchair_accessible","delivery"]].sample(15, random_state=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,brand,operator,wheelchair_accessible,delivery
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
node,6006371223,,,yes,
node,1906515104,,,yes,
node,3353252595,,,yes,
node,1617419904,,,yes,
node,313933591,,,,
node,287852225,,,yes,
node,3556800140,,,no,
node,297512117,,norbert peter e.k.,limited,
node,983454566,,,no,
node,667211398,,,yes,


### B) Drop irrelevant / redundant columns

In [49]:
# # Drop redundant columns
columns_to_drop_in_2B = ["city", "country", "source"]

# Keep only the ones that really exist in the dataframe
columns_to_drop_in_2B = [col for col in columns_to_drop_in_2B if col in pharmacy_df.columns]

print("Dropping in Step 2B:", columns_to_drop_in_2B)
pharmacy_df = pharmacy_df.drop(columns=columns_to_drop_in_2B)

print("\nRemaining columns after Step 2B:")
print(pharmacy_df.columns.tolist())

Dropping in Step 2B: ['city', 'country', 'source']

Remaining columns after Step 2B:
['name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'suburb', 'phone', 'email', 'website', 'openinghours', 'payment_visa', 'payment_mastercard', 'payment_girocard', 'dispensing', 'delivery', 'surveillance', 'wheelchair_accessible', 'building_type', 'latitude', 'longitude', 'geometry']


### C) Handle missing values

In [50]:
# Drop columns with too many missing values => See table with counts and % of missing values in Step 1 B)
# delivery > 90% missing;
# building_type > 90% missing
# payment_girocard > 90% missing
# payment_visa > 90% missing
# payment_mastercard > 90% missing
# surveillance > 90% missing
# brand > 90% missing
# email > 80% missing


columns_to_drop_in_2C = ["delivery", "building_type", "payment_girocard", "payment_visa", "payment_mastercard", "surveillance", "brand", "email"]

columns_to_drop_in_2C = [col for col in columns_to_drop_in_2C if col in pharmacy_df.columns]

print("Dropping in Step 2C:", columns_to_drop_in_2C)
pharmacy_df = pharmacy_df.drop(columns=columns_to_drop_in_2C)

print("\nRemaining columns after Step 2C:")
print(pharmacy_df.columns.tolist())

Dropping in Step 2C: ['delivery', 'building_type', 'payment_girocard', 'payment_visa', 'payment_mastercard', 'surveillance', 'brand', 'email']

Remaining columns after Step 2C:
['name', 'operator', 'street', 'housenumber', 'postcode', 'suburb', 'phone', 'website', 'openinghours', 'dispensing', 'wheelchair_accessible', 'latitude', 'longitude', 'geometry']


In [51]:
# check the pharmacy_df after Step 2 C)
pharmacy_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,operator,street,housenumber,postcode,suburb,phone,website,openinghours,dispensing,wheelchair_accessible,latitude,longitude,geometry
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
node,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,+49 30 9713807,https://reichenbergerapotheke.de/,Mo-Th 08:00-19:00; Fr 08:00-18:30; Sa 09:00-13:00,yes,yes,52.540741,13.486602,POINT (13.4866 52.54074)
node,60848447,castello-apotheke,,,,,,,,Mo-Fr 08:30-19:00; Sa 08:30-14:00,yes,yes,52.531835,13.469654,POINT (13.46965 52.53184)
node,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,+49 30 9759449,https://www.zurrose.de/,Mo-Fr 08:00-19:00; Sa 08:00-12:00,yes,yes,52.527555,13.468513,POINT (13.46851 52.52756)
node,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,+49 30 5337855,http://www.apotheke.borchert-online.de/,Mo-Fr 08:30-18:30; Sa 08:30-13:00,yes,no,52.48939,13.45057,POINT (13.45057 52.48939)
node,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,Mo-Fr 08:00-19:00; Sa 08:00-14:00,yes,yes,52.510556,13.395863,POINT (13.39586 52.51056)


In [52]:
# check missing values after Step 2 C)
missing_count = pharmacy_df.isna().sum().sort_values(ascending=False)
print(missing_count[missing_count > 0])

suburb                   231
dispensing                90
wheelchair_accessible     57
openinghours              42
dtype: int64


### D) Normalize categories

### E) Opening hours normalization

In [53]:
#check the unique values in openinghours
print("\nUnique openinghours:")
print(pharmacy_df["openinghours"].unique()) 


Unique openinghours:
['Mo-Th 08:00-19:00; Fr 08:00-18:30; Sa 09:00-13:00'
 'Mo-Fr 08:30-19:00; Sa 08:30-14:00' 'Mo-Fr 08:00-19:00; Sa 08:00-12:00'
 'Mo-Fr 08:30-18:30; Sa 08:30-13:00' 'Mo-Fr 08:00-19:00; Sa 08:00-14:00'
 'Mo-Fr 08:00-18:30; Sa 08:00-13:00' 'Mo-Fr 08:30-18:30; Sa 09:00-13:00'
 'Mo-Fr 09:00-19:00; Sa 09:00-14:00' 'Mo-Fr 08:30-19:00; Sa 09:00-13:00'
 'Mo-Fr 09:00-19:00; Sa 10:00-19:00' 'Mo-Fr 08:30-18:00, Sa 08:30-13:00'
 'Mo-Fr 08:30-18:30, Sa 08:30-13:00' 'Mo-Fr 08:30-18:30'
 'Mo-Fr 08:00-19:00' 'Mo-Fr 08:00-19:00; Sa 09:00-13:00'
 'Mo-Fr 08:00-20:00; Sa 08:30-18:00' 'Mo-Fr 09:00-19:30; Sa 10:00-19:00'
 'Mo-Fr 08:00-19:00; Sa 08:30-14:00' nan
 'Mo-Fr 08:30-19:00;Sa 08:30-13:00;Su,PH off'
 'Mo-Fr 09:00-18:30; Sa 09:00-13:00'
 'Mo,Tu 08:30-19:00; We 08:30-18:30; Th 08:30-19:00; Fr 08:30-18:30'
 'Mo-Fr 08:00-18:30; Sa 09:00-14:00' 'Mo-Fr 09:00-18:30, Sa 10:30-16:30'
 'Mo-Fr 09:00-18:00; Sa 09:00-14:00'
 'Mo-Fr 08:00-20:00; Sa 09:00-15:00; PH off'
 'Mo-Fr 08:00-19:00, Sa 0

In [54]:
# Normalize text format

if "openinghours" in pharmacy_df.columns:
    pharmacy_df["openinghours"] = pharmacy_df["openinghours"].astype(str).str.strip().str.lower()

print("\nSample opening hours values:")
print(pharmacy_df["openinghours"].head(10) if "openinghours" in pharmacy_df.columns else "No column")


Sample opening hours values:
element  id      
node     60775323    mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00
         60848447                    mo-fr 08:30-19:00; sa 08:30-14:00
         60852928                    mo-fr 08:00-19:00; sa 08:00-12:00
         68437791                    mo-fr 08:30-18:30; sa 08:30-13:00
         69226035                    mo-fr 08:00-19:00; sa 08:00-14:00
         76507297                    mo-fr 08:00-18:30; sa 08:00-13:00
         76519952                    mo-fr 08:30-18:30; sa 09:00-13:00
         76596388                    mo-fr 09:00-19:00; sa 09:00-14:00
         78244254                    mo-fr 08:30-19:00; sa 09:00-13:00
         79603436                    mo-fr 09:00-19:00; sa 10:00-19:00
Name: openinghours, dtype: object


In [55]:
# Quick preview

pharmacy_df.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,name,operator,street,housenumber,postcode,suburb,phone,website,openinghours,dispensing,wheelchair_accessible,latitude,longitude,geometry
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
node,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,+49 30 9713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,yes,52.540741,13.486602,POINT (13.4866 52.54074)
node,60848447,castello-apotheke,,,,,,,,mo-fr 08:30-19:00; sa 08:30-14:00,yes,yes,52.531835,13.469654,POINT (13.46965 52.53184)
node,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,+49 30 9759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,yes,yes,52.527555,13.468513,POINT (13.46851 52.52756)
node,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,+49 30 5337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,yes,no,52.48939,13.45057,POINT (13.45057 52.48939)
node,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,yes,yes,52.510556,13.395863,POINT (13.39586 52.51056)


In [56]:
#check shape
pharmacy_df.shape

(675, 14)

### F) Add district and district_id to the data frame

In [57]:
# Load official Berlin districts GeoDataFrame from lor_ortsteile.geojson
berlin_districts_gdf = gpd.read_file("../sources/lor_ortsteile.geojson")


In [58]:
print(berlin_districts_gdf.columns)


Index(['gml_id', 'spatial_name', 'spatial_alias', 'spatial_type', 'OTEIL',
       'BEZIRK', 'FLAECHE_HA', 'geometry'],
      dtype='object')


In [59]:
# Spatial join with corrected column names
pharmacy_df_district = gpd.sjoin(
    pharmacy_df,
    berlin_districts_gdf[["BEZIRK", "OTEIL","geometry"]],
    how="left",
    predicate="within"
)



In [60]:
# Rename columns to something clear
pharmacy_df_district = pharmacy_df_district.rename(columns={
    "BEZIRK": "district"
    ,"OTEIL": "neighbourhood"
}).drop(columns=["index_right"])  # drop district_number if not needed


In [61]:
# Preview the pharmacy_gdf
pharmacy_df_district.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,name,operator,street,housenumber,postcode,suburb,phone,website,openinghours,dispensing,wheelchair_accessible,latitude,longitude,geometry,district,neighbourhood
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
node,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,+49 30 9713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,yes,52.540741,13.486602,POINT (13.4866 52.54074),Lichtenberg,Alt-Hohenschönhausen
node,60848447,castello-apotheke,,,,,,,,mo-fr 08:30-19:00; sa 08:30-14:00,yes,yes,52.531835,13.469654,POINT (13.46965 52.53184),Lichtenberg,Fennpfuhl
node,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,+49 30 9759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,yes,yes,52.527555,13.468513,POINT (13.46851 52.52756),Lichtenberg,Fennpfuhl
node,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,+49 30 5337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,yes,no,52.48939,13.45057,POINT (13.45057 52.48939),Treptow-Köpenick,Alt-Treptow
node,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,yes,yes,52.510556,13.395863,POINT (13.39586 52.51056),Mitte,Mitte


In [None]:
# pharmacy_df = pharmacy_df.drop(columns="district_id")

In [62]:
# Generating district ids
# https://www.regionalstatistik.de

# District mapping (official codes as strings)
district_mapping = {
    'Mitte': '11001001',
    'Friedrichshain-Kreuzberg': '11002002',
    'Pankow': '11003003',
    'Charlottenburg-Wilmersdorf': '11004004',
    'Spandau': '11005005',
    'Steglitz-Zehlendorf': '11006006',
    'Tempelhof-Schöneberg': '11007007',
    'Neukölln': '11008008',
    'Treptow-Köpenick': '11009009',
    'Marzahn-Hellersdorf': '11010010',
    'Lichtenberg': '11011011',
    'Reinickendorf': '11012012'
}

# Apply mapping to create district_id column (string)
pharmacy_df_district['district_id'] = pharmacy_df_district['district'].map(district_mapping).astype(str)

# (Optional) Check if some districts were not mapped
#unmapped = df[~df['district'].isin(district_mapping.keys())]['district'].unique()
#if len(unmapped) > 0:
    #print("⚠️ Unmapped districts found:", unmapped)

### G)  Reset index, drop columns "element" and "geometry", rename "id" to "banks_id"

In [None]:
# Reset index
pharmacy_df_district= pharmacy_df_district.drop(columns=["geometry"]).reset_index()

In [64]:
#Preview the final DataFrame
pharmacy_df_district.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,operator,street,housenumber,postcode,suburb,phone,website,openinghours,dispensing,wheelchair_accessible,latitude,longitude,geometry,district,neighbourhood,district_id
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
node,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,+49 30 9713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,yes,52.540741,13.486602,POINT (13.4866 52.54074),Lichtenberg,Alt-Hohenschönhausen,11011011
node,60848447,castello-apotheke,,,,,,,,mo-fr 08:30-19:00; sa 08:30-14:00,yes,yes,52.531835,13.469654,POINT (13.46965 52.53184),Lichtenberg,Fennpfuhl,11011011
node,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,+49 30 9759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,yes,yes,52.527555,13.468513,POINT (13.46851 52.52756),Lichtenberg,Fennpfuhl,11011011
node,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,+49 30 5337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,yes,no,52.48939,13.45057,POINT (13.45057 52.48939),Treptow-Köpenick,Alt-Treptow,11009009
node,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,yes,yes,52.510556,13.395863,POINT (13.39586 52.51056),Mitte,Mitte,11001001


In [81]:
# save the tentative final DataFrame to CSV
pharmacy_df_district.to_csv("../sources/pharmacies_with_districts.csv", index=False)



In [82]:
# Reset the index
pharmacy_df_district= pharmacy_df_district.reset_index()

# Rename the "id" column to "pharmacy_id"
pharmacy_df_district = pharmacy_df_district.rename(columns={"id": "pharmacy_id"})  
# set the pharmacy_id to string
pharmacy_df_district["pharmacy_id"] = pharmacy_df_district["pharmacy_id"].astype(str)
#  
# Drop the redundant column "element"
pharmacy_df_district= pharmacy_df_district.drop(columns=["element"],errors='ignore')

In [83]:
pharmacy_df_district.head()

Unnamed: 0,pharmacy_id,name,operator,street,housenumber,postcode,suburb,phone,website,openinghours,dispensing,wheelchair_accessible,latitude,longitude,geometry,district,neighbourhood,district_id
0,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,+49 30 9713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,yes,52.540741,13.486602,POINT (13.4866 52.54074),Lichtenberg,Alt-Hohenschönhausen,11011011
1,60848447,castello-apotheke,,,,,,,,mo-fr 08:30-19:00; sa 08:30-14:00,yes,yes,52.531835,13.469654,POINT (13.46965 52.53184),Lichtenberg,Fennpfuhl,11011011
2,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,+49 30 9759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,yes,yes,52.527555,13.468513,POINT (13.46851 52.52756),Lichtenberg,Fennpfuhl,11011011
3,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,+49 30 5337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,yes,no,52.48939,13.45057,POINT (13.45057 52.48939),Treptow-Köpenick,Alt-Treptow,11009009
4,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,yes,yes,52.510556,13.395863,POINT (13.39586 52.51056),Mitte,Mitte,11001001


In [85]:
print(pharmacy_df_district.columns.tolist())


['pharmacy_id', 'name', 'operator', 'street', 'housenumber', 'postcode', 'suburb', 'phone', 'website', 'openinghours', 'dispensing', 'wheelchair_accessible', 'latitude', 'longitude', 'geometry', 'district', 'neighbourhood', 'district_id']


### H)  Final Summary of Cleaned and Transformed Data

In [120]:
print("✅ Dataset after Steps A - G cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {pharmacy_df_district.shape[0]}")
print(f"Number of columns: {pharmacy_df_district.shape[1]}")

# Column list
print("\nRemaining columns:")
print(pharmacy_df_district.columns.tolist())

# Missing values check
missing = pharmacy_df_district.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

✅ Dataset after Steps A - G cleaning and transforming

Number of rows: 675
Number of columns: 18

Remaining columns:
['pharmacy_id', 'name', 'operator', 'street', 'housenumber', 'postal_code', 'suburb', 'phone_number', 'website', 'openinghours', 'services_offered', 'wheelchair_accessible', 'latitude', 'longitude', 'geometry', 'district', 'neighborhood', 'district_id']

Missing values after cleaning and transforming :
pharmacy_id                0
name                       0
operator                   0
street                     0
housenumber                0
postal_code                0
suburb                   231
phone_number               0
website                    0
openinghours               0
services_offered          90
wheelchair_accessible     57
latitude                   0
longitude                  0
geometry                   0
district                   0
neighborhood               0
district_id                0
dtype: int64


In [141]:
# Data types and non-null counts

pharmacy_df_district.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   pharmacy_id            675 non-null    object  
 1   name                   675 non-null    object  
 2   operator               675 non-null    object  
 3   street                 675 non-null    object  
 4   housenumber            675 non-null    object  
 5   postal_code            675 non-null    object  
 6   suburb                 444 non-null    object  
 7   phone_number           675 non-null    object  
 8   website                675 non-null    object  
 9   openinghours           675 non-null    object  
 10  services_offered       585 non-null    object  
 11  wheelchair_accessible  618 non-null    object  
 12  latitude               675 non-null    float64 
 13  longitude              675 non-null    float64 
 14  geometry               675 non-nul

In [136]:
pharmacy_df_district.to_csv("final_pharmacies_with_districts.csv")

In [138]:
# Normalize the phone numbers add country code +49 for Germany if missing
import re
import numpy as np

In [None]:
def normalize_phone(phone):
    if pd.isna(phone) or str(phone).strip() == "":
        return np.nan
    phone = phone.strip()
    phone = re.sub(r"(?!^\+)\D", "", phone)
    
    # Remove entries that are just '+' or too short
    if phone in ["", "+"] or len(re.sub(r"\D", "", phone)) < 7:
        return np.nan
    
    if not phone.startswith("+"):
        if phone.startswith("0"):
            phone = "+49" + phone[1:]
        else:
            phone = "+49" + phone
    return phone


In [151]:
# apply normalization
if "phone_number" in pharmacy_df_district.columns:
    pharmacy_df_district["phone_number"] = pharmacy_df_district["phone_number"].apply(normalize_phone)  
    print("\nSample normalized phone numbers:")
    print(pharmacy_df_district["phone_number"].head(10))
else:
    print("No phone column to normalize.")      




Sample normalized phone numbers:
0     +49309713807
1              NaN
2     +49309759449
3     +49305337855
4              NaN
5     +49308227190
6     +49308541307
7    +493042852020
8    +493025767820
9     +49304985750
Name: phone_number, dtype: object


# Table creation Logistic
(1) pharmacy_id (primary key)

(2) address information: street, housenumber, postal_code

(3) contact informmation: phone_number, website, without email(missing values> 80%)

(4) openinghours

(5) service_offered: dispensing

(6) wheelchair_accessible

(7) neighborhood

(8) district

(9) district_id (foreign key)

(10) geometry information: coordinates, latitude, longitude


In [144]:
# Rename neighbourhood to neighborhood
pharmacy_df_district = pharmacy_df_district.rename(columns={"neighbourhood": "neighborhood"})
pharmacy_df_district.head()

Unnamed: 0,pharmacy_id,name,operator,street,housenumber,postal_code,suburb,phone_number,website,openinghours,services_offered,wheelchair_accessible,latitude,longitude,geometry,district,neighborhood,district_id
0,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,49309713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,dispensing_yes,yes,52.540741,13.486602,POINT (13.4866 52.54074),Lichtenberg,Alt-Hohenschönhausen,11011011
1,60848447,castello-apotheke,,,,,,49,,mo-fr 08:30-19:00; sa 08:30-14:00,dispensing_yes,yes,52.531835,13.469654,POINT (13.46965 52.53184),Lichtenberg,Fennpfuhl,11011011
2,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,49309759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,dispensing_yes,yes,52.527555,13.468513,POINT (13.46851 52.52756),Lichtenberg,Fennpfuhl,11011011
3,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,49305337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,dispensing_yes,no,52.48939,13.45057,POINT (13.45057 52.48939),Treptow-Köpenick,Alt-Treptow,11009009
4,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,49,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,dispensing_yes,yes,52.510556,13.395863,POINT (13.39586 52.51056),Mitte,Mitte,11001001


In [145]:
#check the pharmacy_df after phone normalization
pharmacy_df_district.head()

Unnamed: 0,pharmacy_id,name,operator,street,housenumber,postal_code,suburb,phone_number,website,openinghours,services_offered,wheelchair_accessible,latitude,longitude,geometry,district,neighborhood,district_id
0,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,49309713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,dispensing_yes,yes,52.540741,13.486602,POINT (13.4866 52.54074),Lichtenberg,Alt-Hohenschönhausen,11011011
1,60848447,castello-apotheke,,,,,,49,,mo-fr 08:30-19:00; sa 08:30-14:00,dispensing_yes,yes,52.531835,13.469654,POINT (13.46965 52.53184),Lichtenberg,Fennpfuhl,11011011
2,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,49309759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,dispensing_yes,yes,52.527555,13.468513,POINT (13.46851 52.52756),Lichtenberg,Fennpfuhl,11011011
3,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,49305337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,dispensing_yes,no,52.48939,13.45057,POINT (13.45057 52.48939),Treptow-Köpenick,Alt-Treptow,11009009
4,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,49,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,dispensing_yes,yes,52.510556,13.395863,POINT (13.39586 52.51056),Mitte,Mitte,11001001


In [146]:
# change the name of dispensing to services_offered
pharmacy_df_district = pharmacy_df_district.rename(columns={"dispensing": "services_offered"})
pharmacy_df_district.head()

Unnamed: 0,pharmacy_id,name,operator,street,housenumber,postal_code,suburb,phone_number,website,openinghours,services_offered,wheelchair_accessible,latitude,longitude,geometry,district,neighborhood,district_id
0,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,49309713807,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,dispensing_yes,yes,52.540741,13.486602,POINT (13.4866 52.54074),Lichtenberg,Alt-Hohenschönhausen,11011011
1,60848447,castello-apotheke,,,,,,49,,mo-fr 08:30-19:00; sa 08:30-14:00,dispensing_yes,yes,52.531835,13.469654,POINT (13.46965 52.53184),Lichtenberg,Fennpfuhl,11011011
2,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,49309759449,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,dispensing_yes,yes,52.527555,13.468513,POINT (13.46851 52.52756),Lichtenberg,Fennpfuhl,11011011
3,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,49305337855,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,dispensing_yes,no,52.48939,13.45057,POINT (13.45057 52.48939),Treptow-Köpenick,Alt-Treptow,11009009
4,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,49,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,dispensing_yes,yes,52.510556,13.395863,POINT (13.39586 52.51056),Mitte,Mitte,11001001


In [147]:
#show unique values in services_offered
print("\nUnique services_offered:")
print(pharmacy_df_district["services_offered"].unique())
# add the values "despensing_yes" and "despensing_no" to "yes" and "no"
pharmacy_df_district["services_offered"] = pharmacy_df_district["services_offered"].replace({"yes": "dispensing_yes", "no": "dispensing_no", "Apotheke in Nikolassee": "dispensing_in Apotheke in Nikolassee"})
print("\nUnique services_offered after replacement:")
print(pharmacy_df_district["services_offered"].unique())


Unique services_offered:
['dispensing_yes' nan 'dispensing_no'
 'dispensing_in Apotheke in Nikolassee']

Unique services_offered after replacement:
['dispensing_yes' nan 'dispensing_no'
 'dispensing_in Apotheke in Nikolassee']


In [153]:
#change the column name to meet the schema, postcode to postal_code, phone to phone_number
pharmacy_df_district = pharmacy_df_district.rename(columns={"postcode": "postal_code", "phone": "phone_number"})
#delete the column "geometry"
pharmacy_df_district = pharmacy_df_district.drop(columns=["geometry"])
pharmacy_df_district.head()


Unnamed: 0,pharmacy_id,name,operator,street,housenumber,postal_code,suburb,phone_number,website,openinghours,services_offered,wheelchair_accessible,latitude,longitude,district,neighborhood,district_id
0,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,49309713807.0,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,dispensing_yes,yes,52.540741,13.486602,Lichtenberg,Alt-Hohenschönhausen,11011011
1,60848447,castello-apotheke,,,,,,,,mo-fr 08:30-19:00; sa 08:30-14:00,dispensing_yes,yes,52.531835,13.469654,Lichtenberg,Fennpfuhl,11011011
2,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,49309759449.0,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,dispensing_yes,yes,52.527555,13.468513,Lichtenberg,Fennpfuhl,11011011
3,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,49305337855.0,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,dispensing_yes,no,52.48939,13.45057,Treptow-Köpenick,Alt-Treptow,11009009
4,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,dispensing_yes,yes,52.510556,13.395863,Mitte,Mitte,11001001


In [185]:
#add coordinates column
pharmacy_df_district["coordinates"] = pharmacy_df_district.apply(lambda row: f"POINT({row['longitude']} {row['latitude']})", axis=1)
pharmacy_df_district.head()



Unnamed: 0,pharmacy_id,name,operator,street,housenumber,postal_code,suburb,phone_number,website,openinghours,services_offered,wheelchair_accessible,latitude,longitude,district,neighborhood,district_id,coordinates
0,60775323,reichenberger apotheke,,reichenberger straße,3.0,13055.0,Alt-Hohenschönhausen,49309713807.0,https://reichenbergerapotheke.de/,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,dispensing_yes,yes,52.540741,13.486602,Lichtenberg,Alt-Hohenschönhausen,11011011,POINT(13.4866016 52.5407412)
1,60848447,castello-apotheke,,,,,,,,mo-fr 08:30-19:00; sa 08:30-14:00,dispensing_yes,yes,52.531835,13.469654,Lichtenberg,Fennpfuhl,11011011,POINT(13.4696536 52.5318353)
2,60852928,rosen apotheke,,rudolf-seiffert-straße,11.0,10369.0,Fennpfuhl,49309759449.0,https://www.zurrose.de/,mo-fr 08:00-19:00; sa 08:00-12:00,dispensing_yes,yes,52.527555,13.468513,Lichtenberg,Fennpfuhl,11011011,POINT(13.4685134 52.5275552)
3,68437791,margareten-apotheke,,karl-kunger-straße,46.0,12435.0,Alt-Treptow,49305337855.0,http://www.apotheke.borchert-online.de/,mo-fr 08:30-18:30; sa 08:30-13:00,dispensing_yes,no,52.48939,13.45057,Treptow-Köpenick,Alt-Treptow,11009009,POINT(13.4505704 52.4893898)
4,69226035,leipziger apotheke,,leipziger straße,43.0,10117.0,Mitte,,https://www.leipziger-apotheke.de/,mo-fr 08:00-19:00; sa 08:00-14:00,dispensing_yes,yes,52.510556,13.395863,Mitte,Mitte,11001001,POINT(13.3958628 52.5105561)


In [186]:
# save the final dataframe
pharmacy_df_district.to_csv("final_pharmacies_with_districts.csv", index=False)

# 🧩 Step 3: Populate Database

In [179]:
!pip install psycopg2-binary  # for Postgres
!pip install pymysql           # for MySQL




In [187]:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine, text
import warnings

warnings.filterwarnings("ignore")

In [188]:
from sqlalchemy import create_engine

In [224]:
# create a neon DB connection to test
#  # DB connection setup using hardcoded credentials 
conn = psycopg2.connect(
    dbname="neondb",
    user="neondb_owner",
    password="a9Am7Yy5r9_T7h4OF2GN",
    host="ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",
    port="5432",
    sslmode="require"
)
cur = conn.cursor()

In [242]:
# check the coplumn names
print(pharmacy_df_district.columns.tolist())

['pharmacy_id', 'name', 'operator', 'street', 'housenumber', 'postal_code', 'suburb', 'phone_number', 'website', 'openinghours', 'services_offered', 'wheelchair_accessible', 'latitude', 'longitude', 'district', 'neighborhood', 'district_id', 'coordinates']


In [243]:
# select the columns to the table pharmacey_test
pharmacy_test=pharmacy_df_district[["pharmacy_id", "district_id", "name", "street", "housenumber", "postal_code", "district", "neighborhood", "phone_number", "website", "services_offered", "openinghours","wheelchair_accessible","latitude", "longitude", "coordinates"]]
pharmacy_test.head()


Unnamed: 0,pharmacy_id,district_id,name,street,housenumber,postal_code,district,neighborhood,phone_number,website,services_offered,openinghours,wheelchair_accessible,latitude,longitude,coordinates
0,60775323,11011011,reichenberger apotheke,reichenberger straße,3.0,13055.0,Lichtenberg,Alt-Hohenschönhausen,49309713807.0,https://reichenbergerapotheke.de/,dispensing_yes,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,52.540741,13.486602,POINT(13.4866016 52.5407412)
1,60848447,11011011,castello-apotheke,,,,Lichtenberg,Fennpfuhl,,,dispensing_yes,mo-fr 08:30-19:00; sa 08:30-14:00,yes,52.531835,13.469654,POINT(13.4696536 52.5318353)
2,60852928,11011011,rosen apotheke,rudolf-seiffert-straße,11.0,10369.0,Lichtenberg,Fennpfuhl,49309759449.0,https://www.zurrose.de/,dispensing_yes,mo-fr 08:00-19:00; sa 08:00-12:00,yes,52.527555,13.468513,POINT(13.4685134 52.5275552)
3,68437791,11009009,margareten-apotheke,karl-kunger-straße,46.0,12435.0,Treptow-Köpenick,Alt-Treptow,49305337855.0,http://www.apotheke.borchert-online.de/,dispensing_yes,mo-fr 08:30-18:30; sa 08:30-13:00,no,52.48939,13.45057,POINT(13.4505704 52.4893898)
4,69226035,11001001,leipziger apotheke,leipziger straße,43.0,10117.0,Mitte,Mitte,,https://www.leipziger-apotheke.de/,dispensing_yes,mo-fr 08:00-19:00; sa 08:00-14:00,yes,52.510556,13.395863,POINT(13.3958628 52.5105561)


In [244]:
#save the pharmacy_test to csv
pharmacy_test.to_csv("pharmacy_test.csv", index=False)

# A) Test neonDB connection setup using test_berlin_data

In [248]:
engine = create_engine(
    "postgresql+psycopg2://neondb_owner:a9Am7Yy5r9_T7h4OF2GN@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb?sslmode=require"
)

In [None]:
#this is where you create table with constraints and references first
create_table_query = f"""
CREATE TABLE IF NOT EXISTS pharmacies_test (
    pharmacy_id VARCHAR(20) PRIMARY KEY,
    district_id VARCHAR(20),
    name VARCHAR(200),
    street VARCHAR(200),
    housenumber VARCHAR(20),
    postal_code VARCHAR(10),
    phone_number VARCHAR(50),
    openinghours VARCHAR(100),
    website VARCHAR(200),
    coordinates VARCHAR(200),
    latitude DECIMAL(9,6),
    longitude DECIMAL(9,6),
    neighborhood VARCHAR(100),
    district VARCHAR(100),
    services_offered VARCHAR(200),
    wheelchair_accessible VARCHAR(20),
    CONSTRAINT district_id_fk FOREIGN KEY (district_id)
        REFERENCES test_berlin_data.districts(district_id)
        ON DELETE RESTRICT
        ON UPDATE CASCADE
);
"""
# Execute the create table query
with engine.begin() as connection:
    connection.execute(text(create_table_query))
    connection.commit()
    print("Table 'pharmacies' created or already exists.")  

Table 'pharmacies' created or already exists.


# B) Write pharmacies_test to test_berlin_data using sql

In [254]:


# to sql write into datatbase
pharmacy_test.to_sql(
    'pharmacies_test',      
    con=engine,   
    if_exists='replace',
    schema='test_berlin_data',
    index=False
)


675

# C) Validation Result from the test_insert

In [255]:
# inquire the pharmacies_test

df= pd.read_sql("SELECT * FROM test_berlin_data.pharmacies_test",engine)
print(df.head())

  pharmacy_id district_id                    name                  street  \
0    60775323    11011011  reichenberger apotheke    reichenberger straße   
1    60848447    11011011       castello-apotheke                     nan   
2    60852928    11011011          rosen apotheke  rudolf-seiffert-straße   
3    68437791    11009009     margareten-apotheke      karl-kunger-straße   
4    69226035    11001001      leipziger apotheke        leipziger straße   

  housenumber postal_code          district          neighborhood  \
0           3       13055       Lichtenberg  Alt-Hohenschönhausen   
1         nan         nan       Lichtenberg             Fennpfuhl   
2          11       10369       Lichtenberg             Fennpfuhl   
3          46       12435  Treptow-Köpenick           Alt-Treptow   
4          43       10117             Mitte                 Mitte   

   phone_number                                  website services_offered  \
0  +49309713807        https://reichenbergera

In [247]:
#change the pharmacies_test to pharmacies for csv
pharmacy_test.to_csv("pharmacies.csv", index=False)
# load the pharmacies table to check the data
pd.read_csv("pharmacies.csv")



Unnamed: 0,pharmacy_id,district_id,name,street,housenumber,postal_code,district,neighborhood,phone_number,website,services_offered,openinghours,wheelchair_accessible,latitude,longitude,coordinates
0,60775323,11011011,reichenberger apotheke,reichenberger straße,3,13055.0,Lichtenberg,Alt-Hohenschönhausen,4.930971e+10,https://reichenbergerapotheke.de/,dispensing_yes,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,52.540741,13.486602,POINT(13.4866016 52.5407412)
1,60848447,11011011,castello-apotheke,,,,Lichtenberg,Fennpfuhl,,,dispensing_yes,mo-fr 08:30-19:00; sa 08:30-14:00,yes,52.531835,13.469654,POINT(13.4696536 52.5318353)
2,60852928,11011011,rosen apotheke,rudolf-seiffert-straße,11,10369.0,Lichtenberg,Fennpfuhl,4.930976e+10,https://www.zurrose.de/,dispensing_yes,mo-fr 08:00-19:00; sa 08:00-12:00,yes,52.527555,13.468513,POINT(13.4685134 52.5275552)
3,68437791,11009009,margareten-apotheke,karl-kunger-straße,46,12435.0,Treptow-Köpenick,Alt-Treptow,4.930534e+10,http://www.apotheke.borchert-online.de/,dispensing_yes,mo-fr 08:30-18:30; sa 08:30-13:00,no,52.489390,13.450570,POINT(13.4505704 52.4893898)
4,69226035,11001001,leipziger apotheke,leipziger straße,43,10117.0,Mitte,Mitte,,https://www.leipziger-apotheke.de/,dispensing_yes,mo-fr 08:00-19:00; sa 08:00-14:00,yes,52.510556,13.395863,POINT(13.3958628 52.5105561)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
670,510479858,11012012,hirsch-apotheke,heinsestraße,47,13467.0,Reinickendorf,Hermsdorf,,,,mo-fr 08:30-18:30; sa 08:30-14:00,yes,52.618413,13.305325,POINT(13.305324546749377 52.6184127)
671,739645093,11002002,arena apotheke,,,,Friedrichshain-Kreuzberg,Friedrichshain,4.930423e+10,https://www.arena-apotheke.de/,dispensing_yes,"mo-fr 08:00-20:00; sa 09:00-20:00; su,ph off",yes,52.505810,13.447141,POINT(13.447141357890773 52.5058097)
672,1002286243,11010010,kastanien apotheke,stendaler straße,24,12627.0,Marzahn-Hellersdorf,Hellersdorf,,,dispensing_yes,"mo-fr 08:30-20:00, sa 10:00-20:00; ph off",yes,52.537358,13.605459,POINT(13.605459251299257 52.53735795)
673,1208913688,11010010,prinzen apotheke,,,,Marzahn-Hellersdorf,Kaulsdorf,4.930564e+10,https://prinzen-apotheke-berlin.de/,dispensing_yes,"mo-fr 08:00-18:30, sa 09:00-12:00; ph off",yes,52.513065,13.589787,POINT(13.58978681699029 52.5130653)


# D) Write the pharmacies table into layereddb

In [189]:
user_name='mei_fang_chen'
password='EX56tZ05VavUp9hC'

In [233]:
# Conection
host = 'localhost'
port = '5433'
database = 'layereddb'
schema='berlin_source_data'

#connection to db after you opened tunnel
engine = create_engine(f'postgresql+psycopg2://{user_name}:{password}@{host}:{port}/{database}')

###  Setup Constraints and reference to the table

In [234]:
#this is where you create table with constraints and references first
create_table_query = f"""
CREATE TABLE IF NOT EXISTS pharmacies (
    pharmacy_id VARCHAR(20) PRIMARY KEY,
    district_id VARCHAR(2),
    name VARCHAR(200),
    street VARCHAR(200),
    housenumber VARCHAR(20),
    postal_code VARCHAR(10),
    phone_number VARCHAR(50),
    openinghours VARCHAR(100),
    website VARCHAR(200),
    coordinates VARCHAR(200),
    latitude DECIMAL(9,6),
    longitude DECIMAL(9,6),
    neighborhood VARCHAR(100),
    district VARCHAR(100),
    services_offered VARCHAR(200),
    wheelchair_accessible VARCHAR(20),
    CONSTRAINT district_id_fk FOREIGN KEY (district_id)
        REFERENCES berlin_source_data.districts(district_id)
        ON DELETE RESTRICT
        ON UPDATE CASCADE
);
"""
# Execute the create table query
with engine.connect() as connection:
    connection.execute(text(create_table_query))
    connection.commit()
    print("Table 'pharmacies' created or already exists.")  

Table 'pharmacies' created or already exists.


# E) Write the pharmacies to the layeredDB

In [240]:
#  Send the DataFrame to the database using .to_sql()
pharmacy_test.to_sql(
    'pharmacies',      
    engine,
    schema=schema,
    if_exists='append', # ✅ keeps table, just inserts data
    index=False
)

print("DataFrame sent to PostgreSQL using .to_sql() with psycopg2!")

DataFrame sent to PostgreSQL using .to_sql() with psycopg2!


# E) Query the pharmacies Table

In [241]:
##let's query test data!
query = f"""
SELECT * from berlin_source_data.pharmacies
"""

# Execute the query
with engine.connect() as conn:
    df= pd.read_sql(text(query), conn)
    conn.commit()  # commit the transaction
df

Unnamed: 0,pharmacy_id,district_id,name,street,housenumber,postal_code,district,neighborhood,phone_number,website,services_offered,openinghours,wheelchair_accessible,latitude,longitude,coordinates
0,60775323,11011011,reichenberger apotheke,reichenberger straße,3,13055,Lichtenberg,Alt-Hohenschönhausen,+49309713807,https://reichenbergerapotheke.de/,dispensing_yes,mo-th 08:00-19:00; fr 08:00-18:30; sa 09:00-13:00,yes,52.540741,13.486602,POINT(13.4866016 52.5407412)
1,60848447,11011011,castello-apotheke,,,,Lichtenberg,Fennpfuhl,,,dispensing_yes,mo-fr 08:30-19:00; sa 08:30-14:00,yes,52.531835,13.469654,POINT(13.4696536 52.5318353)
2,60852928,11011011,rosen apotheke,rudolf-seiffert-straße,11,10369,Lichtenberg,Fennpfuhl,+49309759449,https://www.zurrose.de/,dispensing_yes,mo-fr 08:00-19:00; sa 08:00-12:00,yes,52.527555,13.468513,POINT(13.4685134 52.5275552)
3,68437791,11009009,margareten-apotheke,karl-kunger-straße,46,12435,Treptow-Köpenick,Alt-Treptow,+49305337855,http://www.apotheke.borchert-online.de/,dispensing_yes,mo-fr 08:30-18:30; sa 08:30-13:00,no,52.489390,13.450570,POINT(13.4505704 52.4893898)
4,69226035,11001001,leipziger apotheke,leipziger straße,43,10117,Mitte,Mitte,,https://www.leipziger-apotheke.de/,dispensing_yes,mo-fr 08:00-19:00; sa 08:00-14:00,yes,52.510556,13.395863,POINT(13.3958628 52.5105561)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
670,510479858,11012012,hirsch-apotheke,heinsestraße,47,13467,Reinickendorf,Hermsdorf,,,,mo-fr 08:30-18:30; sa 08:30-14:00,yes,52.618413,13.305325,POINT(13.305324546749377 52.6184127)
671,739645093,11002002,arena apotheke,,,,Friedrichshain-Kreuzberg,Friedrichshain,+49304226620,https://www.arena-apotheke.de/,dispensing_yes,"mo-fr 08:00-20:00; sa 09:00-20:00; su,ph off",yes,52.505810,13.447141,POINT(13.447141357890773 52.5058097)
672,1002286243,11010010,kastanien apotheke,stendaler straße,24,12627,Marzahn-Hellersdorf,Hellersdorf,,,dispensing_yes,"mo-fr 08:30-20:00, sa 10:00-20:00; ph off",yes,52.537358,13.605459,POINT(13.605459251299257 52.53735795)
673,1208913688,11010010,prinzen apotheke,,,,Marzahn-Hellersdorf,Kaulsdorf,+49305638146,https://prinzen-apotheke-berlin.de/,dispensing_yes,"mo-fr 08:00-18:30, sa 09:00-12:00; ph off",yes,52.513065,13.589787,POINT(13.58978681699029 52.5130653)
