# 03 – Data Cleaning (Library Borrowings)

**Purpose:** Create an analysis-ready dataset by applying a small set of clearly defined cleaning rules based on the sanity-check findings.

This notebook:
- loads the merged dataset,
- parses key columns (timestamps, numeric fields),
- removes records with critical inconsistencies
- saves a cleaned dataset for EDA.

**Not included:**
- no exploratory analysis or visualizations beyond basic before/after counts


In [8]:
from pathlib import Path
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

In [9]:
# --- load data ---
PROCESSED_DATA_PATH = Path('../dat/processed')
DATA_FILE = PROCESSED_DATA_PATH / "borrowings_2019_2025.csv"
OUTPUT_FILE = PROCESSED_DATA_PATH / "borrowings_2019_2025_cleaned.csv"

borrowings = pd.read_csv(
    DATA_FILE,
    sep=";",
    quotechar='"',
    encoding="utf-8"
)

print("Loaded shape:", borrowings.shape)

Loaded shape: (2407610, 16)


In [10]:
# --- preprocess relevant columns ---

# column names
ISSUE_COL = "Ausleihdatum/Uhrzeit"
RETURN_COL = "Rückgabedatum/Uhrzeit"

DURATION_COL = "Leihdauer"
EXT_COL = "Anzahl_Verlängerungen"
LATE_FLAG_COL = "Verspätet"
LATE_DAYS_COL = "Tage_zu_spät"

ID_COL = "issue_id"
USER_COL = "Benutzer-Systemnummer"
BARCODE_COL = "Barcode"

# timestamps
for c in [ISSUE_COL, RETURN_COL]:
    if c in borrowings.columns:
        borrowings[c] = pd.to_datetime(borrowings[c], errors="coerce")

# numeric columns
for c in [DURATION_COL, EXT_COL, LATE_DAYS_COL]:
    if c in borrowings.columns:
        borrowings[c] = pd.to_numeric(borrowings[c], errors="coerce")

# normalize late flag to boolean (Ja/Nein -> True/False); keep unknown as <NA>
if LATE_FLAG_COL in borrowings.columns:
    v = borrowings[LATE_FLAG_COL].astype(str).str.strip().str.lower()
    borrowings["late_bool"] = pd.Series(np.where(v == "ja", True, np.where(v == "nein", False, pd.NA)), dtype="boolean")

## Cleaning rules

We remove records that violate fundamental consistency constraints of borrowing transactions:

1. Missing return and leihdauer timestamp
2. Leihdauer > 365 days

**ATTENTION: So far, only rules for the loan period and return have been formulated.
Further rules should be defined for further analyses on other columns!**

No other filtering is applied at this stage.

### 1. Missing return timestamp and leihdauer timestamp

Records without a return timestamp, corresponding to items not yet returned at the time of data extraction, were excluded from further analysis.

In [11]:
# --- heck assumption: missing return timestamp ⇔ missing duration ---
mask_return_missing = borrowings[RETURN_COL].isna()
mask_duration_missing = borrowings[DURATION_COL].isna()

check_table = pd.crosstab(
    mask_return_missing,
    mask_duration_missing,
    rownames=["return_missing"],
    colnames=["duration_missing"]
)

display(check_table)

# verify assumption
if check_table.loc[True, False] == 0:
    print("Assumption holds: whenever return timestamp is missing, duration is also missing.")
else:
    print("WARNING: There are cases with missing return timestamp but present duration.")

# remove not-yet-returned items (missing return timestamp)
before_n = len(borrowings)
borrowings = borrowings.loc[~mask_return_missing].copy()
after_n = len(borrowings)

print(f"Removed {before_n - after_n} rows with missing return timestamp.")
print("Remaining rows:", after_n)


duration_missing,False,True
return_missing,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2358824,0
True,0,48786


Assumption holds: whenever return timestamp is missing, duration is also missing.
Removed 48786 rows with missing return timestamp.
Remaining rows: 2358824


### 2. Leihdauer > 365 days
Borrowings with a loan duration exceeding one year were excluded, as these correspond to special administrative cases rather than regular lending behavior.

In [12]:
# --- remove special cases: loan duration > 365 days ---
before_n = len(borrowings)

borrowings = borrowings.loc[borrowings["Leihdauer"] <= 365].copy()

after_n = len(borrowings)
removed_n = before_n - after_n

print(f"Removed {removed_n} rows with Leihdauer > 365 days.")
print("Remaining rows:", after_n)

Removed 1678 rows with Leihdauer > 365 days.
Remaining rows: 2357146


## Save cleaned Dataset

In [13]:
borrowings.to_csv(OUTPUT_FILE, index=False, sep=";", quotechar='"', encoding="utf-8")
print("Saved cleaned dataset to:", OUTPUT_FILE)

Saved cleaned dataset to: ../data/processed/borrowings_2019_2025_cleaned.csv


## Cleaning summary

- missing return stamp and leihdauer were removed, also loan period longer than 365 days
- The resulting dataset is intended as the input for EDA and modeling notebooks.

