# Step 1 – Data Audit & Profiling

This notebook performs an initial audit of the German motorcycle registration dataset.
The goal is to understand:
- Schema and data types
- Grain of the data
- Missing values and duplicates
- Time coverage

No transformations are applied in this step.


In [12]:
%pip install pandas numpy matplotlib duckdb pyarrow


Note: you may need to restart the kernel to use updated packages.


2️⃣ Imports & file discovery

In [1]:
import pandas as pd
import numpy as np
import duckdb
import matplotlib.pyplot as plt
from pathlib import Path

print("All imports OK ✅")


All imports OK ✅


3️⃣ Load the dataset

In [2]:
DATA_PATH = Path("../data/raw")
csv_files = list(DATA_PATH.glob("*.csv"))

csv_files

[PosixPath('../data/raw/SP_Hersteller_Handelsnamen_Krad_f73ec_-2904598668132282746.csv')]

In [3]:
CSV_FILE = csv_files[0]
df = pd.read_csv(CSV_FILE)

df.shape


(128719, 8)

4️⃣ Preview the data

In [4]:
df.head(5)


Unnamed: 0,Report_date,Manufacturer,Trade_name,Type_key,State,Count,ZS Anzahl,Object_Id
0,01.01.2023,AEON MOTOR (RC),,AAB,Schleswig-Holstein,7,,1
1,01.01.2023,AEON MOTOR (RC),,AAB,Niedersachsen,23,,2
2,01.01.2023,AEON MOTOR (RC),,AAB,Bremen,1,,3
3,01.01.2023,AEON MOTOR (RC),,AAB,Nordrhein-Westfalen,6,,4
4,01.01.2023,AEON MOTOR (RC),,AAB,Hessen,12,,5


5️⃣ Column names & schema

In [5]:
df.columns


Index(['Report_date', 'Manufacturer', 'Trade_name', 'Type_key', 'State',
       'Count', 'ZS Anzahl', 'Object_Id'],
      dtype='object')

In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128719 entries, 0 to 128718
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Report_date   128719 non-null  object 
 1   Manufacturer  128719 non-null  object 
 2   Trade_name    128430 non-null  object 
 3   Type_key      128719 non-null  object 
 4   State         128719 non-null  object 
 5   Count         128719 non-null  int64  
 6   ZS Anzahl     0 non-null       float64
 7   Object_Id     128719 non-null  int64  
dtypes: float64(1), int64(2), object(5)
memory usage: 7.9+ MB


6️⃣ Missing values

In [7]:
missing = df.isna().sum().sort_values(ascending=False)
missing[missing > 0]


ZS Anzahl     128719
Trade_name       289
dtype: int64

7️⃣ Duplicate rows

In [8]:
df.duplicated().sum()


np.int64(0)

8️⃣ Cardinality check

In [9]:
df.nunique().sort_values(ascending=False)


Object_Id       128719
Count             2210
Trade_name        2162
Type_key           747
Manufacturer        83
State               17
Report_date          3
ZS Anzahl            0
dtype: int64

9️⃣ Detect date/time column

In [10]:
date_like = []

for c in df.columns:
    sample = df[c].dropna().astype(str).head(50)
    if len(sample) and sample.str.contains(r"\d{4}[-./]\d{2}").mean() > 0.6:
        date_like.append(c)

date_like


[]

1️⃣0️⃣ Basic stats for numeric fields

In [11]:
df.describe(include="number").T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Count,128719.0,116.260428,845.897952,1.0,7.0,22.0,69.0,56232.0
ZS Anzahl,0.0,,,,,,,
Object_Id,128719.0,64360.0,37158.118987,1.0,32180.5,64360.0,96539.5,128719.0


In [14]:
import pandas as pd
import duckdb
from pathlib import Path

PROJECT_ROOT = Path("..")
RAW_PATH = PROJECT_ROOT / "data" / "raw"
DB_PATH = PROJECT_ROOT / "data" / "duckdb" / "motorcycle.db"

csv_files = list(RAW_PATH.glob("*.csv"))
CSV_FILE = csv_files[0]

CSV_FILE, DB_PATH


(PosixPath('../data/raw/SP_Hersteller_Handelsnamen_Krad_f73ec_-2904598668132282746.csv'),
 PosixPath('../data/duckdb/motorcycle.db'))

In [15]:
con = duckdb.connect(str(DB_PATH))

In [16]:
con.close()