# Step 3 â€“ Cleaning & Standardization (Silver)

Creates a standardized staging table `stg_registrations` from `raw_registrations`.

Key steps:
- Ensure numeric measures are valid (`Count`, and optionally `ZS Anzahl`)
- Standardize text fields (trim, normalize)
- Remove invalid rows (e.g., null keys, negative counts)
- Create derived date fields (year, month)


2) Connect to DuckDB

In [12]:
import duckdb
from pathlib import Path

DB_PATH = Path("../data/duckdb/motorcycle.db")

# Open a single writable connection
con = duckdb.connect(str(DB_PATH), read_only=False)

# Sanity check
con.execute("SELECT COUNT(*) AS rows FROM raw_registrations").fetchdf()


Unnamed: 0,rows
0,128719


3) Inspect ZS Anzahl

In [13]:
con.execute("""
SELECT "ZS Anzahl", COUNT(*) AS n
FROM raw_registrations
GROUP BY 1
ORDER BY n DESC
LIMIT 25
""").fetchdf()


Unnamed: 0,ZS Anzahl,n
0,,128719


In [14]:
con.execute("""
SELECT
  COUNT(*) AS rows,
  SUM(CASE WHEN regexp_matches(TRIM(COALESCE("ZS Anzahl", '')), '^[0-9]+$') THEN 1 ELSE 0 END) AS numeric_like,
  SUM(CASE WHEN "ZS Anzahl" IS NULL OR TRIM("ZS Anzahl") = '' THEN 1 ELSE 0 END) AS null_or_blank
FROM raw_registrations
""").fetchdf()


Unnamed: 0,rows,numeric_like,null_or_blank
0,128719,0.0,128719.0


4) Create Silver tables (staging + clean)

In [15]:
con.execute("DROP TABLE IF EXISTS stg_registrations")

con.execute("""
CREATE TABLE stg_registrations AS
SELECT
  Report_date,
  EXTRACT(year FROM Report_date) AS report_year,
  EXTRACT(month FROM Report_date) AS report_month,

  TRIM(Manufacturer) AS manufacturer,
  TRIM(Trade_name) AS trade_name,
  TRIM(Type_key) AS type_key,
  TRIM(State) AS state,

  CAST(Count AS BIGINT) AS registrations_count,
  Object_Id
FROM raw_registrations
""")


<_duckdb.DuckDBPyConnection at 0x117b94e70>

In [16]:
con.execute("DROP TABLE IF EXISTS stg_registrations_clean")

con.execute("""
CREATE TABLE stg_registrations_clean AS
SELECT *
FROM stg_registrations
WHERE
  Report_date IS NOT NULL
  AND state IS NOT NULL AND state <> ''
  AND manufacturer IS NOT NULL AND manufacturer <> ''
  AND registrations_count IS NOT NULL
  AND registrations_count >= 0
""")


<_duckdb.DuckDBPyConnection at 0x117b94e70>

5) Quality checks

In [17]:
con.execute("""
SELECT
  (SELECT COUNT(*) FROM stg_registrations) AS stg_rows,
  (SELECT COUNT(*) FROM stg_registrations_clean) AS clean_rows,
  (SELECT MIN(Report_date) FROM stg_registrations_clean) AS min_date,
  (SELECT MAX(Report_date) FROM stg_registrations_clean) AS max_date
""").fetchdf()


Unnamed: 0,stg_rows,clean_rows,min_date,max_date
0,128719,128719,2023-01-01,2025-01-01


In [18]:
con.execute("""
SELECT
  COUNT(*) AS rows,
  COUNT(DISTINCT state) AS distinct_states,
  COUNT(DISTINCT manufacturer) AS distinct_manufacturers,
  COUNT(DISTINCT trade_name) AS distinct_trade_names,
  COUNT(DISTINCT type_key) AS distinct_type_keys
FROM stg_registrations_clean
""").fetchdf()


Unnamed: 0,rows,distinct_states,distinct_manufacturers,distinct_trade_names,distinct_type_keys
0,128719,17,83,2162,747


In [20]:
con.close()


In [19]:
df_raw["Report_date"].value_counts().head(20)


NameError: name 'df_raw' is not defined