# Geo2025 – Instructor Baseline

This notebook is an **instructor-only reference solution** for the Geo2025 hackathon.  
It:

1. Loads and wrangles the provided geospatial datasets for Europe and South Holland  
2. Focuses on the city of Rotterdam  
3. Enriches points with:
   - Reverse–geocoded metadata from OpenStreetMap (class / type) via the Nominatim API  
   - RD (Rijksdriehoeks) coordinates from a precomputed lookup  
   - Lithology (soil type) at a chosen depth from a voxel model (for instructor ground truth only)
4. Trains simple baseline classifiers to predict lithology classes
5. Produces:
   - A **sample submission** file (`sample_submission.csv`) with `ID,prediction`
   - A **hidden ground truth** file (`ground_truth.csv`) for the leaderboard
   - A **student-safe voxel file** where the “truth” voxels used for the 18 test points are removed

The goal is to provide:

- An _instructor reference solution_
- A _baseline Macro-F1 score_
- The exact datasets that will be shipped to students (without leaking labels)

---

## Prior knowledge (covered in the bootcamp)

You have already seen:

- **Pandas / NumPy** for data wrangling (SLU01–06, BLU01–02)  
- **APIs with `requests`** (BLU03 – Data Sources)  
- **Train/test split** (`train_test_split`)  
  (SLU07–08, SLU09–10, SLU13–15)  
- **Classification models** from scikit-learn:
  - `LogisticRegression` (SLU09 – Classification with Logistic Regression)  
  - `DecisionTreeClassifier` (SLU11 – Tree-Based Models)  
  - `RandomForestClassifier` (SLU11 – Tree-Based Models)  
- **Metrics**:
  - `f1_score`, especially **macro-F1** (SLU08 – Metrics Regression, SLU10 – Metrics Classification)  
  - `classification_report` style summaries  
- **Pipelines & scaling**:
  - `Pipeline` / `make_pipeline`, `StandardScaler` (SLU15–16 – Hyperparameter Tuning & Workflow)

> **Important:**  
> This notebook may contain extra instructor-only steps, such as:
> - Generating `ground_truth.csv`  
> - Pruning the voxel model for students  
> These cells must **not** be copied into the student notebook.


## Import libraries, configure the paths to the files, and set constants

In this (instructor) notebook we:

- Import the libraries used throughout the baseline
- Point `DATA_DIR` to the local `data/` folder
- Define file paths and constants (depth, RNG seed, API URL, etc.)



In [1]:
from pathlib import Path
import math
import json
import random

import numpy as np
import pandas as pd
import requests

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KDTree


sns.set_style("whitegrid")

# For the instructor notebook you can point this to your local path.
# For the student version, this will be simply DATA_DIR = Path("data")
DATA_DIR = Path(r"C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data")
# DATA_DIR = Path("data")

CSV_GEO = DATA_DIR / "geocoordinates_europe_2025-11-12.csv"
RD_LOOKUP = DATA_DIR / "mapping_geo_coordinates_to_RD_coordinates.json"
GEO_VOXELS = DATA_DIR / "geotop_south_holland.parquet"  # full voxels (instructor)

# Precomputed reverse-geocoding for Rotterdam points (instructor cache)
INTERIM_CSV = DATA_DIR / "interim_result_steps-1to5.csv"

# Hackathon test set path (will be created from rotterdam_rd later)
TEST_CSV = DATA_DIR / "test_dataset.csv"

DEPTH_CHOICE = -20.0   # meters
DEPTH_TOLERANCE = 0.5  # meters
RNG_SEED = 2025

# API (OSM Nominatim reverse geocoding)
NOMINATIM_URL = "https://nominatim.openstreetmap.org/reverse"
CONTACT_EMAIL = "silviae.zieger@gmail.com"  # in student version: "your_email_here"

# Fix random seeds for reproducibility
random.seed(RNG_SEED)
np.random.seed(RNG_SEED)


## Load datasets

- `geocoordinates_europe_2025-11-12.csv` (tab-separated)
  - A table of European locations with latitude/longitude and some basic metadata.
  - We read it with `pd.read_csv(..., sep="\t")` since it is tab-separated.

- `mapping_geo_coordinates_to_RD_coordinates.json` (JSON lines)
  - A mapping from WGS84 coordinates (latitude, longitude) to Dutch RD coordinates (`RD_east`, `RD_north`), already precomputed for us.
  - We load it with `pd.read_json(..., orient="records", lines=True)` as you saw in BLU03.

- `geotop_south_holland.parquet` (Parquet)
  - A voxel model of the subsurface (geology) for South Holland, including x, y, z and lithology information.
  - We load it with `pd.read_parquet(...)`, which is an efficient columnar format you also met in BLU03.


In [2]:
# European geocoordinates (TSV-style CSV)
geo = pd.read_csv(CSV_GEO, sep="\t", usecols=[1, 2, 3, 4])

# Mapping from lat/lon to RD coordinates (JSON lines)
mapping_coordinates = pd.read_json(RD_LOOKUP, orient="records", lines=True)

# Geological voxel model (Parquet)
# NOTE: students need pyarrow or fastparquet installed to read parquet files
voxels = pd.read_parquet(GEO_VOXELS, engine="pyarrow")  # or engine="fastparquet"

print("geo:", geo.shape, "| rd:", mapping_coordinates.shape, "| voxels:", voxels.shape)
display(geo.head(), mapping_coordinates.head(), voxels.head())


geo: (8202, 4) | rd: (19, 4) | voxels: (391956, 14)


Unnamed: 0,countryCode,cityName,latitude,longitude
0,AD,les Escaldes,42.50729,1.53414
1,AD,Andorra la Vella,42.50779,1.52109
2,AL,Sarandë,39.87534,20.00477
3,AL,Pogradec,40.9025,20.6525
4,AL,Librazhd,41.17944,20.315


Unnamed: 0,latitude,longitude,RD_east,RD_north
0,51.912476,4.341611,83037.716304,436409.155066
1,51.84498,4.329021,82062.232477,428913.878985
2,52.001555,4.165362,71079.916971,446506.937484
3,51.91816,4.387831,86226.007827,436996.718946
4,51.958796,4.471991,92071.84146,441440.635968


Unnamed: 0,x,y,z,lithoklasse_id,lithoklasse,kans_1_veen,kans_2_klei,kans_3_kleiig_zand,kans_4_vervallen,kans_5_zand_fijn,kans_6_zand_matig_grof,kans_7_zand_grof,kans_8_grind,kans_9_schelpen
0,89550.0,432850.0,-7.75,0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,89550.0,432850.0,-8.25,3,kleiig_zand,0.06,0.07,0.27,0.0,0.33,0.16,0.11,0.0,0.0
2,89550.0,432850.0,-8.75,2,klei,0.14,0.52,0.34,0.0,0.0,0.0,0.0,0.0,0.0
3,89550.0,432850.0,-9.25,2,klei,0.12,0.57,0.31,0.0,0.0,0.0,0.0,0.0,0.0
4,89550.0,432850.0,-9.75,2,klei,0.13,0.58,0.29,0.0,0.0,0.0,0.0,0.0,0.0


## Step 1 - Select NL points and crop to the city of Rotterdam
The `geo` dataframe contains geocoordinates for many locations across Europe.  
In this step we:

1. **Filter to the Netherlands** using the `countryCode` column.  
2. **Restrict the points to a bounding box around Rotterdam** using latitude and longitude limits.  
3. Store the result in a new dataframe called `nl_rotterdam`.

We use simple boolean masks, just like in SLU02 (Subsetting Data in Pandas):

- `nl["latitude"] >= min_lat` & `<= max_lat`  
- `nl["longitude"] >= min_lon` & `<= max_lon`

We then set `cityName` as the index, just to make it easier to inspect and debug individual locations.

In [3]:
# Rough bounding box for Rotterdam
longitude_rotterdam = (3.962246, 4.614341)
latitude_rotterdam = (51.829961, 52.010604)

# Filter to Netherlands
nl = geo.query("countryCode == 'NL'").copy()

# Crop to Rotterdam bounding box
mask_lat = (nl["latitude"] >= min(latitude_rotterdam)) & (nl["latitude"] <= max(latitude_rotterdam))
mask_lon = (nl["longitude"] >= min(longitude_rotterdam)) & (nl["longitude"] <= max(longitude_rotterdam))

nl_rotterdam = nl[mask_lat & mask_lon].copy()
nl_rotterdam.set_index("cityName", inplace=True)

print("Number of points in Rotterdam bounding box:", nl_rotterdam.shape[0])
nl_rotterdam.head()


Number of points in Rotterdam bounding box: 20


Unnamed: 0_level_0,countryCode,latitude,longitude
cityName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Vlaardingen,NL,51.9125,4.34167
Spijkenisse,NL,51.845,4.32917
's-Gravenzande,NL,52.00167,4.16528
Schiedam,NL,51.91917,4.38889
Schiebroek,NL,51.95838,4.47124


## Step 2 – Get human-readable info for Rotterdam points (reverse geocoding)

Right now `nl_rotterdam` only has basic columns like `latitude`, `longitude`, and `cityName`.  
To understand the type of each location (e.g., residential, commercial, etc.), we use the OpenStreetMap **Nominatim** API to do **reverse geocoding**:

- Input: latitude & longitude  
- Output (JSON): a `display_name` and extra fields like `class` and `type`  

This is the same idea as in **BLU03 – Data Sources**, where we called web APIs with `requests` and parsed the JSON response.

In this step we:

1. Define a helper function `reverse_geocode_rotterdam(...)` that:
   - Loops over the Rotterdam points  
   - Calls the Nominatim API for each point  
   - Extracts `display_name`, `class`, and `type` from the JSON  
   - Builds a new dataframe called `rotterdam`

2. In the instructor notebook, we optionally cache the results in `interim_result_steps-1to5.csv`
   so we don’t hit the API every time.  
   In the student setup, this cache will **not** be shipped, so students will call the API themselves.


In [4]:
def reverse_geocode_rotterdam(rotterdam: pd.DataFrame) -> pd.DataFrame:
    """
    Call OSM Nominatim to get displayName, class, type for each Rotterdam point.
    """
    rows = []
    for city, row in rotterdam.iterrows():
        print(f"Processing {city}...")
        params = {
            "lat": row.latitude,
            "lon": row.longitude,
            "format": "json",
            "addressdetails": 1,
        }
        headers = {"User-Agent": f"ldsa-geo2025/1.0 ({CONTACT_EMAIL})"}
        resp = requests.get(NOMINATIM_URL, params=params, headers=headers)
        resp.raise_for_status()
        data = resp.json()

        rows.append([
            city,
            data.get("display_name", ""),
            float(data["lat"]),
            float(data["lon"]),
            data.get("class", None),
            data.get("type", None),
        ])

    rotterdam = pd.DataFrame(
        rows,
        columns=["cityName", "displayName", "latitude", "longitude", "class", "type"],
    )
    return rotterdam

# INSTRUCTOR NOTE:
# In this notebook we use a local cache (INTERIM_CSV) if it exists.
# In the student setup, this file will *not* be present, so they will always call the API.
if INTERIM_CSV.exists():
    print("Loading precomputed Rotterdam dataframe from", INTERIM_CSV)
    rotterdam = pd.read_csv(INTERIM_CSV)
else:
    print("No precomputed CSV found – calling the API. This may take a while.")
    rotterdam = reverse_geocode_rotterdam(nl)
    rotterdam.to_csv(INTERIM_CSV, index=False)

rotterdam.head()


Loading precomputed Rotterdam dataframe from C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\interim_result_steps-1to5.csv


Unnamed: 0,cityName,displayName,latitude,longitude,class,type
0,5254,"25, Diezerstraat, Binnenstad, Zwolle, Overijss...",52.512484,6.094462,place,house
1,5255,"65, Antoni van Leeuwenhoekstraat, Develpoort, ...",51.817471,4.633344,place,house
2,5256,"63, Polsbroek, Waterkwartier, De Hoven, Zutphe...",52.138315,6.201425,place,house
3,5257,"6, Dorpsstraat, Stadscentrum, Zoetermeer, Zuid...",52.057487,4.493211,place,house
4,5258,"36, Arnhemseweg, Lentemorgen Ⅰ, Molenwijk, Zev...",51.929841,6.070749,place,house


## Step 3 - Inspect OSM `type` and define "public" locations

The `rotterdam` dataframe now contains, for each point:

- `displayName` – a human-readable description from OpenStreetMap  
- `class` and `type` – categories assigned by OpenStreetMap  

Before moving on, we take a quick look at the different values of `type` that appear in the data.  
Based on a quick inspection, we will **treat everything as "public" except points of type `"residential"`**.

This is not used directly for the main ML task (which will predict lithology from the voxel model), but it is useful to:

- understand what kinds of locations we have, and  
- illustrate how you might define custom categories from raw OSM types.


In [5]:
# Look at the different OSM "type" values we have
unique_types = rotterdam["type"].unique()
classies=rotterdam["class"].unique()
print("Unique OSM types in Rotterdam:")
print(unique_types)

print("Unique OSM classes in Rotterdam:")
print(classies)

# Define a "public" subset: everything except 'residential'
rotterdam_public = rotterdam[rotterdam["type"] != "residential"].copy()
print("\nNumber of 'public' locations (non-residential):", rotterdam_public.shape[0])
rotterdam_public.head()


Unique OSM types in Rotterdam:
['house' 'bridge' 'unclassified' 'parking' 'castle' 'garden' 'restaurant'
 'optician' 'fast_food' 'residential' 'place_of_worship' 'school'
 'construction' 'tertiary' 'tunnel' 'artwork' 'cycleway' 'hearing_aids'
 'secondary' 'bicycle_parking' 'defibrillator' 'vending_machine' 'cafe'
 'clothes' 'florist' 'pedestrian' 'picnic_table' 'pub' 'chemist'
 'camp_site' 'floorer' 'butcher' 'mall' 'footway' 'fire_station'
 'kindergarten' 'vacant' 'yes' 'university' 'newsagent' 'primary']
Unique OSM classes in Rotterdam:
['place' 'man_made' 'highway' 'amenity' 'historic' 'leisure' 'shop'
 'tourism' 'emergency' 'craft' 'building']

Number of 'public' locations (non-residential): 224


Unnamed: 0,cityName,displayName,latitude,longitude,class,type
0,5254,"25, Diezerstraat, Binnenstad, Zwolle, Overijss...",52.512484,6.094462,place,house
1,5255,"65, Antoni van Leeuwenhoekstraat, Develpoort, ...",51.817471,4.633344,place,house
2,5256,"63, Polsbroek, Waterkwartier, De Hoven, Zutphe...",52.138315,6.201425,place,house
3,5257,"6, Dorpsstraat, Stadscentrum, Zoetermeer, Zuid...",52.057487,4.493211,place,house
4,5258,"36, Arnhemseweg, Lentemorgen Ⅰ, Molenwijk, Zev...",51.929841,6.070749,place,house


## Step 4 - Attach RD coordinates and create the test dataset

To connect our surface points in Rotterdam to the subsurface voxel model, we need to work in the **Dutch RD coordinate system**:

- The file `mapping_geo_coordinates_to_RD_coordinates.json` contains a mapping from  
  `(latitude, longitude)` → `(RD_east, RD_north)`.

In this step we:

1. **Merge** the filtered `rotterdam_public` dataframe with `mapping_coordinates` on `(latitude, longitude)` to obtain RD coordinates for each point.  
2. **Check for missing RD coordinates** and see how many points (if any) do not have a match.  
3. **Drop rows with missing RD coordinates** to avoid problems later.  
4. Create a clean **hackathon test set** with:
   - an `ID` column (0..N-1)
   - only the columns we want students to see
5. Save this clean test set as `test_dataset.csv` in `DATA_DIR`.

This `test_dataset.csv` is the **hackathon test set**:
- It is the file that will be shipped to students.
- It is also the file we use for the instructor baseline and the leaderboard submissions.


In [6]:
# mapping_coordinates has: latitude, longitude, RD_east, RD_north
rotterdam_rd = rotterdam_public.merge(
    mapping_coordinates,
    on=["latitude", "longitude"],
    how="left",
)

print("Rotterdam 'public' points with RD coordinates (before cleaning):", rotterdam_rd.shape)
rotterdam_rd.head()


Rotterdam 'public' points with RD coordinates (before cleaning): (224, 8)


Unnamed: 0,cityName,displayName,latitude,longitude,class,type,RD_east,RD_north
0,5254,"25, Diezerstraat, Binnenstad, Zwolle, Overijss...",52.512484,6.094462,place,house,,
1,5255,"65, Antoni van Leeuwenhoekstraat, Develpoort, ...",51.817471,4.633344,place,house,,
2,5256,"63, Polsbroek, Waterkwartier, De Hoven, Zutphe...",52.138315,6.201425,place,house,,
3,5257,"6, Dorpsstraat, Stadscentrum, Zoetermeer, Zuid...",52.057487,4.493211,place,house,,
4,5258,"36, Arnhemseweg, Lentemorgen Ⅰ, Molenwijk, Zev...",51.929841,6.070749,place,house,,


#### Check for missing RD coordinates

In [7]:
# Summary counts
missing_east = rotterdam_rd["RD_east"].isna().sum()
missing_north = rotterdam_rd["RD_north"].isna().sum()

print("Missing RD_east:", missing_east)
print("Missing RD_north:", missing_north)

# Rows with ANY missing RD coordinate
missing_any_mask = rotterdam_rd[["RD_east", "RD_north"]].isna().any(axis=1)
missing_any = rotterdam_rd[missing_any_mask]

print("Rows with missing RD coords:", missing_any.shape[0])
missing_any.head()


Missing RD_east: 206
Missing RD_north: 206
Rows with missing RD coords: 206


Unnamed: 0,cityName,displayName,latitude,longitude,class,type,RD_east,RD_north
0,5254,"25, Diezerstraat, Binnenstad, Zwolle, Overijss...",52.512484,6.094462,place,house,,
1,5255,"65, Antoni van Leeuwenhoekstraat, Develpoort, ...",51.817471,4.633344,place,house,,
2,5256,"63, Polsbroek, Waterkwartier, De Hoven, Zutphe...",52.138315,6.201425,place,house,,
3,5257,"6, Dorpsstraat, Stadscentrum, Zoetermeer, Zuid...",52.057487,4.493211,place,house,,
4,5258,"36, Arnhemseweg, Lentemorgen Ⅰ, Molenwijk, Zev...",51.929841,6.070749,place,house,,


#### Drop rows with missing RD coords

In [8]:
# Drop rows where RD coordinates are missing
rotterdam_rd = rotterdam_rd.dropna(subset=["RD_east", "RD_north"]).copy()
rotterdam_rd.reset_index(drop=True, inplace=True)

print("\nAfter dropping rows with missing RD coords:", rotterdam_rd.shape)
rotterdam_rd.head()



After dropping rows with missing RD coords: (18, 8)


Unnamed: 0,cityName,displayName,latitude,longitude,class,type,RD_east,RD_north
0,5283,"152, Groen van Prinstererstraat, Hoogstad, Vla...",51.912476,4.341611,place,house,83037.716304,436409.155066
1,5307,"533, Andries van Bronckhorstlaan, Centrum, Spi...",51.84498,4.329021,place,house,82062.232477,428913.878985
2,5314,"van Lennepstraat, 's- Gravenzande, 's-Gravenza...",52.001555,4.165362,amenity,parking,71079.916971,446506.937484
3,5317,"Van Smaleveltstraat, Schiedam, Zuid-Holland, N...",51.91816,4.387831,man_made,tunnel,86226.007827,436996.718946
4,5318,"304, Ganzerikplein, Schiebroek, Hillegersberg-...",51.958796,4.471991,place,house,92071.84146,441440.635968


#### Save as test_dataset.csv 

In [9]:
# Create ID column 0..N-1
rotterdam_rd = rotterdam_rd.reset_index(drop=True).copy()
rotterdam_rd["ID"] = np.arange(len(rotterdam_rd))

# Columns we want to ship to students in test_dataset.csv
cols_for_students = [
    "ID",
    "cityName",
    "displayName",
    "latitude",
    "longitude",
    "class",
    "type",
    "RD_east",
    "RD_north",
]

test_dataset = rotterdam_rd[cols_for_students].copy()

test_path = DATA_DIR / "test_dataset.csv"
test_dataset.to_csv(test_path, index=False)
print(f"\nSaved student test dataset to: {test_path}")
test_dataset.head()

# Create ID column 0..N-1
rotterdam= rotterdam_rd.reset_index(drop=True).copy()
rotterdam["ID"] = np.arange(len(rotterdam))

# Columns we want to ship to students in test_dataset.csv
cols_for_students = [
    "ID",
    "cityName",
    "displayName",
    "latitude",
    "longitude",
    "class",
    "type"
]

test_exclrd_dataset = rotterdam[cols_for_students].copy()

test_exclrd_path = DATA_DIR / "test_exclrd_dataset.csv"
test_exclrd_dataset.to_csv(test_exclrd_path, index=False)
print(f"\nSaved student test dataset to: {test_exclrd_path}")
test_exclrd_dataset.head()


Saved student test dataset to: C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\test_dataset.csv

Saved student test dataset to: C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\test_exclrd_dataset.csv


Unnamed: 0,ID,cityName,displayName,latitude,longitude,class,type
0,0,5283,"152, Groen van Prinstererstraat, Hoogstad, Vla...",51.912476,4.341611,place,house
1,1,5307,"533, Andries van Bronckhorstlaan, Centrum, Spi...",51.84498,4.329021,place,house
2,2,5314,"van Lennepstraat, 's- Gravenzande, 's-Gravenza...",52.001555,4.165362,amenity,parking
3,3,5317,"Van Smaleveltstraat, Schiedam, Zuid-Holland, N...",51.91816,4.387831,man_made,tunnel
4,4,5318,"304, Ganzerikplein, Schiebroek, Hillegersberg-...",51.958796,4.471991,place,house


## Step 5 - Build the training data from the voxel model

Now we switch to the **subsurface voxel model** stored in `geotop_south_holland.parquet`.

This dataset contains a 3D grid of points with at least the following columns:

- `x`, `y` – horizontal coordinates in the RD system  
- `z` – depth (in meters, negative below surface)  
- a lithology column (`lithoklasse` or `lithoclass`) that tells us the **soil/rock type**

For this hackathon, we train a model that predicts **lithology from coordinates** at a given depth:

- We choose a depth `DEPTH_CHOICE = -20.0` meters, with a tolerance of ± 0.5 m.  
- We select all voxels whose `z` is within that window:
  $
  |z - $DEPTH$_$CHOICE$| \leq 0.5
  $
- We use:
  - Features: `x`, `y`  
  - Target: the lithology column (`lithoklasse` or `lithoclass`)

This will be our **training data**. Later, we will:

- Train a baseline classifier on this grid, and  
- Apply the trained model to the Rotterdam points in `test_dataset.csv`.

> **Instructor note:**  
> We also use this `train_data` (and the full `voxels`) to:
> - Construct `ground_truth.csv` for the 18 test points  
> - Create a student-safe voxel file by removing the exact voxels used for ground truth.
> These steps will be marked as **INSTRUCTOR-ONLY** and are not part of the student notebook.


In [10]:
# Pick the lithology column name from voxels
if "lithoklasse" in voxels.columns:
    litho_col = "lithoklasse"
elif "lithoclass" in voxels.columns:
    litho_col = "lithoclass"
else:
    raise ValueError("Could not find a lithology column ('lithoklasse' or 'lithoclass') in voxels")

print("Using lithology column:", litho_col)

# Filter voxels at the chosen depth window DEPTH_CHOICE ± DEPTH_TOLERANCE
mask_depth = (voxels["z"].sub(DEPTH_CHOICE).abs() <= DEPTH_TOLERANCE)

train_data = voxels.loc[mask_depth, ["x", "y", "z", litho_col]].dropna(subset=[litho_col]).copy()

print("Training voxels at depth window:", train_data.shape)
train_data.head()


Using lithology column: lithoklasse
Training voxels at depth window: (8052, 4)


Unnamed: 0,x,y,z,lithoklasse
24,89550.0,432850.0,-19.75,zand_matig_grof
25,89550.0,432850.0,-20.25,zand_matig_grof
106,89550.0,432950.0,-19.75,zand_matig_grof
107,89550.0,432950.0,-20.25,zand_matig_grof
213,89550.0,433050.0,-19.75,zand_matig_grof


## Step 6 – Train a baseline classifier on the voxel grid

Now we train a simple baseline classifier to predict the lithology class from RD coordinates.

- Training data: `train_data` (voxel grid at depth `DEPTH_CHOICE ± DEPTH_TOLERANCE`)
- Features: `x`, `y`
- Target: `litho_col` (e.g. `lithoklasse`)
- Models we try:
  - Logistic Regression (with standardization)
  - Decision Tree
  - Random Forest

We will:
1. Split the voxel data into train/validation sets
2. Fit each model
3. Evaluate Macro-F1 on the validation set
4. Keep the best-performing model as our baseline


In [11]:

# Features and target from voxel grid
X = train_data[["x", "y"]].values
y = train_data[litho_col].values

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RNG_SEED,
    stratify=y,
)

models = {
    "logreg": make_pipeline(
        StandardScaler(),
        LogisticRegression(max_iter=1000, multi_class="auto"),
    ),
    "tree": DecisionTreeClassifier(random_state=RNG_SEED),
    "rf": RandomForestClassifier(
        n_estimators=200,
        random_state=RNG_SEED,
        n_jobs=-1,
    ),
}

results = {}
best_name = None
best_score = -1.0
best_model = None

for name, model in models.items():
    print(f"\nTraining model: {name}")
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    f1 = f1_score(y_val, y_val_pred, average="macro")
    results[name] = f1
    print(f"Macro-F1 ({name}): {f1:.4f}")
    
    if f1 > best_score:
        best_score = f1
        best_name = name
        best_model = model

print("\nValidation Macro-F1 scores:", results)
print(f"Best model: {best_name} with Macro-F1 = {best_score:.4f}")

# inspect detailed performance of best model
y_val_best = best_model.predict(X_val)
print("\nClassification report for best model:")
print(classification_report(y_val, y_val_best))



Training model: logreg
Macro-F1 (logreg): 0.2339

Training model: tree
Macro-F1 (tree): 0.7166

Training model: rf




Macro-F1 (rf): 0.7039

Validation Macro-F1 scores: {'logreg': np.float64(0.2339460799049931), 'tree': np.float64(0.716641575589384), 'rf': np.float64(0.7039381555243842)}
Best model: tree with Macro-F1 = 0.7166

Classification report for best model:
                 precision    recall  f1-score   support

           klei       0.69      0.58      0.63        19
    kleiig_zand       0.45      0.39      0.42        38
      zand_fijn       0.74      0.80      0.77       125
      zand_grof       0.85      0.87      0.86       522
zand_matig_grof       0.91      0.90      0.90       907

       accuracy                           0.86      1611
      macro avg       0.73      0.71      0.72      1611
   weighted avg       0.86      0.86      0.86      1611



## Step 7 – Predict lithology for the hackathon test set

We now use the best-performing baseline model (`best_model`) to predict the lithology
for the 18 Rotterdam locations in `test_dataset.csv`.

- Input features: `RD_east`, `RD_north`
- Output: `prediction` (lithology class)
- We save a `sample_submission.csv` with the required format:

`ID,prediction`


In [12]:
# Load the student-facing test dataset
test_dataset = pd.read_csv(TEST_CSV)
print("Test dataset shape:", test_dataset.shape)
display(test_dataset.head())

# Features for prediction: RD coordinates
X_test = test_dataset[["RD_east", "RD_north"]].values

# Predict lithology using the best baseline model
y_test_pred = best_model.predict(X_test)

# Build sample submission
sample_submission = pd.DataFrame({
    "ID": test_dataset["ID"],
    "prediction": y_test_pred,
})

sample_sub_path = DATA_DIR / "sample_submission.csv"
sample_submission.to_csv(sample_sub_path, index=False)

print("\nSaved sample submission to:", sample_sub_path)
sample_submission.head()


Test dataset shape: (18, 9)


Unnamed: 0,ID,cityName,displayName,latitude,longitude,class,type,RD_east,RD_north
0,0,5283,"152, Groen van Prinstererstraat, Hoogstad, Vla...",51.912476,4.341611,place,house,83037.716304,436409.155066
1,1,5307,"533, Andries van Bronckhorstlaan, Centrum, Spi...",51.84498,4.329021,place,house,82062.232477,428913.878985
2,2,5314,"van Lennepstraat, 's- Gravenzande, 's-Gravenza...",52.001555,4.165362,amenity,parking,71079.916971,446506.937484
3,3,5317,"Van Smaleveltstraat, Schiedam, Zuid-Holland, N...",51.91816,4.387831,man_made,tunnel,86226.007827,436996.718946
4,4,5318,"304, Ganzerikplein, Schiebroek, Hillegersberg-...",51.958796,4.471991,place,house,92071.84146,441440.635968



Saved sample submission to: C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\sample_submission.csv


Unnamed: 0,ID,prediction
0,0,zand_grof
1,1,zand_matig_grof
2,2,klei
3,3,zand_fijn
4,4,zand_grof


---

## INSTRUCTOR-ONLY: Build ground_truth.csv for the leaderboard

> This section is **instructor-only**.  
> Do **not** include this in the student notebook or in the student data package.

We define the ground-truth lithology for each test point as:

- The `lithoklasse` (or `lithoclass`) of the **nearest voxel** (in x, y)
- At depth `DEPTH_CHOICE ± DEPTH_TOLERANCE`, using the `train_data` slice

Steps:

1. Load `test_dataset.csv` (to get the IDs and RD coordinates)  
2. Build a KDTree on the `(x, y)` coordinates in `train_data`  
3. For each test point `(RD_east, RD_north)`, find the nearest voxel  
4. Assign its lithology as the **true label**  
5. Save `ground_truth.csv` with:

   - `ID`
   - `prediction` (true lithology)

This file will be used by the hackathon portal as `y_true` in `score.py`.


In [13]:
# INSTRUCTOR-ONLY CELL – do not ship to students

# 1) Reload the test set to be safe
test_df = pd.read_csv(TEST_CSV)
print("Test dataset shape:", test_df.shape)
display(test_df.head())

# 2) Build KDTree on voxel coordinates at chosen depth
vox_coords = train_data[["x", "y"]].values
tree = KDTree(vox_coords)

# 3) Query nearest voxel for each test point
test_coords = test_df[["RD_east", "RD_north"]].values
distances, indices = tree.query(test_coords, k=1)  # nearest neighbor

# 4) Get lithology of nearest voxel
nearest_lithology = train_data.iloc[indices[:, 0]][litho_col].values

# 5) Build ground truth dataframe
ground_truth = pd.DataFrame({
    "ID": test_df["ID"],
    "prediction": nearest_lithology,
})

gt_path = DATA_DIR / "ground_truth.csv"
ground_truth.to_csv(gt_path, index=False)

print("\nSaved ground truth to:", gt_path)
display(ground_truth.head())

print("\nClass distribution in ground truth:")
print(ground_truth["prediction"].value_counts(normalize=True))



Test dataset shape: (18, 9)


Unnamed: 0,ID,cityName,displayName,latitude,longitude,class,type,RD_east,RD_north
0,0,5283,"152, Groen van Prinstererstraat, Hoogstad, Vla...",51.912476,4.341611,place,house,83037.716304,436409.155066
1,1,5307,"533, Andries van Bronckhorstlaan, Centrum, Spi...",51.84498,4.329021,place,house,82062.232477,428913.878985
2,2,5314,"van Lennepstraat, 's- Gravenzande, 's-Gravenza...",52.001555,4.165362,amenity,parking,71079.916971,446506.937484
3,3,5317,"Van Smaleveltstraat, Schiedam, Zuid-Holland, N...",51.91816,4.387831,man_made,tunnel,86226.007827,436996.718946
4,4,5318,"304, Ganzerikplein, Schiebroek, Hillegersberg-...",51.958796,4.471991,place,house,92071.84146,441440.635968



Saved ground truth to: C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\ground_truth.csv


Unnamed: 0,ID,prediction
0,0,zand_grof
1,1,zand_matig_grof
2,2,zand_fijn
3,3,zand_fijn
4,4,zand_fijn



Class distribution in ground truth:
prediction
zand_fijn          0.444444
zand_grof          0.277778
zand_matig_grof    0.277778
Name: proportion, dtype: float64


---

## INSTRUCTOR-ONLY: Create student-safe voxel file

> ⚠️ Instructor-only. Do **not** include this cell in the student notebook.

To prevent students from trivially reconstructing `ground_truth.csv` by repeating the exact nearest-voxel lookup, we:

1. Collect the voxel coordinates `(x, y, z)` that were used as nearest neighbors for the 18 test points.
2. Remove those voxels from the full `voxels` dataframe.
3. Save the pruned version as `geotop_south_holland_student.parquet`.

The student data package will contain **only** this pruned voxel file, not the full original.


In [14]:
# INSTRUCTOR-ONLY CELL – do not ship to students

# 1) Get unique voxels that served as nearest neighbors (truth providers)
hidden_voxels = train_data.iloc[indices[:, 0]][["x", "y", "z"]].drop_duplicates()
print("Number of unique voxels used for ground truth:", hidden_voxels.shape[0])

hidden_voxels["to_drop"] = True

# 2) Mark voxels to drop in the full voxel grid
voxels_with_flag = voxels.merge(
    hidden_voxels,
    on=["x", "y", "z"],
    how="left",
)

print("Original voxels shape:", voxels_with_flag.shape)

# 3) Keep only voxels not used as ground truth
voxels_student = voxels_with_flag[voxels_with_flag["to_drop"].isna()].drop(columns=["to_drop"])

print("Student voxel model shape:", voxels_student.shape)

# 4) Save student-safe version
student_voxels_path = DATA_DIR / "geotop_south_holland_student.parquet"
voxels_student.to_parquet(student_voxels_path, engine="pyarrow")

print("\nSaved student voxel model to:", student_voxels_path)


Number of unique voxels used for ground truth: 14
Original voxels shape: (391956, 15)
Student voxel model shape: (391942, 14)

Saved student voxel model to: C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\geotop_south_holland_student.parquet


In [15]:
# INSTRUCTOR-ONLY CELL – REMOVE BEFORE SHARING WITH STUDENTS

# 1) Load the test dataset (Rotterdam points with RD coords)
test_df = pd.read_csv(TEST_CSV)
print("Test dataset shape:", test_df.shape)
display(test_df.head())

# Ensure we have an ID column; if not, create one and RESAVE test_dataset.csv
if "ID" not in test_df.columns:
    print("\nNo 'ID' column found – creating integer IDs 0..N-1 and updating test_dataset.csv")
    test_df["ID"] = np.arange(len(test_df))
    # overwrite the test_dataset.csv so students get the same IDs
    test_df.to_csv(TEST_CSV, index=False)
    print("Updated test_dataset.csv with new ID column.")

# Now we require ID + RD_east + RD_north
required_cols = {"ID", "RD_east", "RD_north"}
if not required_cols.issubset(test_df.columns):
    raise ValueError(
        f"test_dataset.csv must contain columns {required_cols}, "
        f"but found {list(test_df.columns)}"
    )

# 2) Coordinates of voxels at the chosen depth window
#    (we already built train_data earlier from voxels at DEPTH_CHOICE ± DEPTH_TOLERANCE)
vox_coords = train_data[["x", "y"]].values
print("\nNumber of voxels used for ground truth:", vox_coords.shape[0])

# 3) Build KDTree on voxel coordinates
tree = KDTree(vox_coords)

# 4) Coordinates of test (Rotterdam) points
test_coords = test_df[["RD_east", "RD_north"]].values

# 5) For each test point, find the nearest voxel
distances, indices = tree.query(test_coords, k=1)  # nearest neighbor

# 6) Get the lithology of the nearest voxel for each test point
nearest_lithology = train_data.iloc[indices[:, 0]][litho_col].values

# 7) Build ground truth dataframe with same label column name as submissions: 'prediction'
ground_truth = pd.DataFrame({
    "ID": test_df["ID"],
    "prediction": nearest_lithology,
})

gt_path = DATA_DIR / "ground_truth.csv"
ground_truth.to_csv(gt_path, index=False)

print("\nSaved ground truth to:", gt_path)
display(ground_truth)

print("\nClass distribution in ground truth:")
print(ground_truth["prediction"].value_counts(normalize=True))


Test dataset shape: (18, 9)


Unnamed: 0,ID,cityName,displayName,latitude,longitude,class,type,RD_east,RD_north
0,0,5283,"152, Groen van Prinstererstraat, Hoogstad, Vla...",51.912476,4.341611,place,house,83037.716304,436409.155066
1,1,5307,"533, Andries van Bronckhorstlaan, Centrum, Spi...",51.84498,4.329021,place,house,82062.232477,428913.878985
2,2,5314,"van Lennepstraat, 's- Gravenzande, 's-Gravenza...",52.001555,4.165362,amenity,parking,71079.916971,446506.937484
3,3,5317,"Van Smaleveltstraat, Schiedam, Zuid-Holland, N...",51.91816,4.387831,man_made,tunnel,86226.007827,436996.718946
4,4,5318,"304, Ganzerikplein, Schiebroek, Hillegersberg-...",51.958796,4.471991,place,house,92071.84146,441440.635968



Number of voxels used for ground truth: 8052

Saved ground truth to: C:\Users\KimPronk\PycharmProjects\LDSA\batch-instructors\S02 - Data Wrangling\HCKT02 - Data Wrangling\data\ground_truth.csv


Unnamed: 0,ID,prediction
0,0,zand_grof
1,1,zand_matig_grof
2,2,zand_fijn
3,3,zand_fijn
4,4,zand_fijn
5,5,zand_matig_grof
6,6,zand_matig_grof
7,7,zand_fijn
8,8,zand_fijn
9,9,zand_fijn



Class distribution in ground truth:
prediction
zand_fijn          0.444444
zand_grof          0.277778
zand_matig_grof    0.277778
Name: proportion, dtype: float64
