## Analysis of late returns across the CCODE column
This notebook explores potential correlations between CCODE (or “Sammlungszeichen”) and late returns. CCODE is an internal library classification system and is more granular than the media type. In addition, the dataset includes user-related information.


Most of the analysis is done with ccodes with at least 100 borrowings. (Adjust the `MIN_BORROWINGS_PER_CCODE` variable in `03_data_cleaning.ipynb`)

Initialization
- import all dependencies
- read all borrowings from "borrowings" csv-file

In [18]:
%config InlineBackend.figure_format = 'retina'

import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import numpy as np
import plotly.express as px
import statsmodels.api as sm
from utils import setup_pandas, setup_plotting, log_pearson_spearman

input_file = Path('../dat/processed/borrowings_2019_2025_cleaned.csv')

data_frame = pd.read_csv(
    input_file,
    sep=';',
    quotechar='"',
    encoding='utf-8'
)

print(data_frame.shape)
data_frame.head()

CCODE_COL = "Sammlungszeichen/CCODE"

(1821455, 21)


## Overview
show the amount of borrowings per category

In [19]:
data_frame[CCODE_COL].unique()
data_frame[CCODE_COL].value_counts()

Sammlungszeichen/CCODE
kisl1                174042
dki                  147045
dvdsl                126966
kisl4.1              123870
esl                   96440
esac                  83574
kisl5.1               78708
eslkrimi/thriller     72232
kislcomic             70805
kisl                  60694
dvdki                 56860
kisa                  49077
jusl                  34521
esax                  27398
dsl                   27095
jusa                  26898
kisare                25710
juslmanga             24619
efr                   23677
ecom                  23646
esach                 22095
mag                   21101
dmPop                 20903
esav                  18990
esam                  18510
tonie                 18206
sp                    17774
dvdserie              17694
esao                  17313
eslunterha            16609
dvdsa                 16317
kifr                  14692
jusare                13638
dkisa                 12409
esay                  120

## First look at CCodes in combination with late returns
Now we want to examine correlations between ccodes and borrowings that were returned late. As a first step, we analyze the percentage of items that were returned late.

In [20]:
def getLateBorrowingsTable(input_data_frame):
    late_borrowings_per_type = []
    for ccode, ccode_group in input_data_frame.groupby(CCODE_COL):
        amount_of_total_entries = len(ccode_group)
        amount_of_late_entries = (ccode_group["Verspätet"] == "Ja").sum()
        percent_late = amount_of_late_entries / amount_of_total_entries * 100

        late_borrowings_per_type.append({
         CCODE_COL: ccode,
         "Anzahl_Ausleihen": amount_of_total_entries,
         "Anzahl_verspaetet": amount_of_late_entries,
         "Prozent_verspaetet": percent_late
        })

    late_borrowings_per_type_table = (
        pd.DataFrame(late_borrowings_per_type)
        .sort_values("Prozent_verspaetet", ascending=False)
        .reset_index(drop=True)
    )
    return late_borrowings_per_type_table

overview_table = getLateBorrowingsTable(data_frame)

display(overview_table)

Unnamed: 0,Sammlungszeichen/CCODE,Anzahl_Ausleihen,Anzahl_verspaetet,Prozent_verspaetet
0,dvdsa,16317,1479,9.064166
1,esan,11000,929,8.445455
2,dvdkisa,6261,508,8.11372
3,esaw,11281,893,7.915965
4,sk,6293,496,7.881773
5,dvdki,56860,4456,7.836792
6,esav,18990,1448,7.625066
7,esah,8262,626,7.576858
8,esag,6593,467,7.08327
9,esao,17313,1222,7.05828


The first takeaway is the large spread in late return percentages. There is about a 10 percentage point difference between the highest and lowest categories with a minimum of 100 borrowings per category.

After increasing the minimum number of borrowings to 1,000, the difference remains substantial, although slightly smaller. It is around 9 percentage points between the lowest and highest categories.

So there may be really interesting information in the CCODEs. Some CCODEs appear to be associated with higher late return rates than others

## Identifying Indicators for Late Returns

The next step is to find good indicators for late returns by analyzing loan duration characteristics and their relationship to overdue behavior across different ccodes. The table below summarizes key statistics per ccode.

### Column Descriptions

- **mean_loan_duration**: Average loan duration (in days).
- **median_loan_duration**: Median loan duration
- **p75 / p90 / p95**: Upper quantiles of the loan duration distribution, describing long borrowing periods.
- **late_rate_percent**:  Share of borrowings returned late.
- **median_late**: Median loan duration for borrowings that were returned late. Indicates how long overdue items are typically kept.
- **median_on_time**: Median loan duration for borrowings returned on time. Baseline for regular borrowing behavior.
- **delta_median**: Severity of late returns. Difference between late and on-time median loan duration. Highlights whether longer borrowing durations are associated with late returns.


In [21]:
import numpy as np
import pandas as pd

CCODE_COL = "Sammlungszeichen/CCODE"
LOAN_DUR_COL = "Leihdauer"
LATE_COL = "Verspätet"

def _q(q):
    return lambda s: pd.to_numeric(s, errors="coerce").quantile(q)

per_ccode_data = data_frame.copy()
per_ccode_data["loan_duration"] = pd.to_numeric(per_ccode_data[LOAN_DUR_COL], errors="coerce")
per_ccode_data["late_bool"] = (per_ccode_data[LATE_COL] == "Ja")

# 1) Overall loan duration stats + late rate per CCODE
base = (
    per_ccode_data.groupby(CCODE_COL)
    .agg(
        n_loans=(CCODE_COL, "size"),
        mean_loan_duration=("loan_duration", "mean"),
        median_loan_duration=("loan_duration", "median"),
        p75=("loan_duration", _q(0.75)),
        p90=("loan_duration", _q(0.90)),
        p95=("loan_duration", _q(0.95)),
        late_rate_percent=("late_bool", lambda x: 100 * x.mean()),
    )
    .reset_index()
)

# 2) Median loan duration for late borrowings
median_late = (
    per_ccode_data.loc[per_ccode_data["late_bool"]]
    .groupby(CCODE_COL)["loan_duration"]
    .median()
    .rename("median_late")
    .reset_index()
)

# 3) Median loan duration for on-time borrowings
median_on_time = (
    per_ccode_data.loc[~per_ccode_data["late_bool"]]
    .groupby(CCODE_COL)["loan_duration"]
    .median()
    .rename("median_on_time")
    .reset_index()
)

# 4) Merge + delta
ccode_table = (
    base
    .merge(median_late, on=CCODE_COL, how="left")
    .merge(median_on_time, on=CCODE_COL, how="left")
    .assign(
        delta_median=lambda d: d["median_late"] - d["median_on_time"]
    )
    .sort_values(["late_rate_percent", "n_loans"], ascending=[False, False])
)

ccode_table


Unnamed: 0,Sammlungszeichen/CCODE,n_loans,mean_loan_duration,median_loan_duration,p75,p90,p95,late_rate_percent,median_late,median_on_time,delta_median
12,dvdsa,16317,43.891585,28.0,63.0,98.0,130.0,9.064166,61.0,26.0,35.0
25,esan,11000,82.329,58.0,119.0,188.0,220.0,8.445455,113.0,56.0,57.0
11,dvdkisa,6261,42.140553,30.0,58.0,88.0,111.0,8.11372,39.0,29.0,10.0
32,esaw,11281,90.146086,61.0,141.0,205.0,256.0,7.915965,158.0,56.0,102.0
56,sk,6293,79.764977,55.0,119.0,189.0,227.0,7.881773,119.0,51.0,68.0
10,dvdki,56860,38.394566,26.0,52.0,85.0,109.0,7.836792,36.0,25.0,11.0
31,esav,18990,77.172617,52.0,113.0,186.0,210.0,7.625066,116.0,49.0,67.0
22,esah,8262,83.266279,56.0,125.0,194.0,225.0,7.576858,118.5,53.0,65.5
21,esag,6593,63.328075,41.0,85.0,160.0,193.0,7.08327,87.0,38.0,49.0
26,esao,17313,78.653902,54.0,112.0,186.0,218.0,7.05828,108.5,51.0,57.5


## Takeaways
the takeaways are similar to the media type column, so we look if we can find the same correlations

## Identifying correlations for properties of media types

To find correlations between properties of media types and late rate we look at some scatter plots.

In [27]:
input_data_frame = ccode_table.copy()

input_data_frame["size"] = 50 + 400 * np.sqrt(
    input_data_frame["n_loans"] / input_data_frame["n_loans"].max()
)

fig = px.scatter(
    input_data_frame,
    x="mean_loan_duration",
    y="late_rate_percent",
    size="size",
    hover_name="Sammlungszeichen/CCODE",
    hover_data={
        "n_loans": True,
        "mean_loan_duration": ":.1f",
        "late_rate_percent": ":.2f",
        "size": False,
    },
    labels={
        "mean_loan_duration": "Mean loan duration (days)",
        "late_rate_percent": "Late return rate (%)",
    },
    title="Late return rate vs mean loan duration across CCODEs",
    trendline="ols",
)

fig.update_traces(marker=dict(opacity=0.75))
fig.update_layout(width=650, height=600)
fig.show()

log_pearson_spearman(input_data_frame, "mean_loan_duration", "late_rate_percent")


mean_loan_duration vs late_rate_percent (n=60)
Pearson  r   = 0.6683   p-value ("null hypothesis: no correlation") = 5.38e-09
Spearman rho = 0.6904   p-value ("null hypothesis: no correlation") = = 1.04e-09


### MIN_BORROWINGS_PER_CCODE = 1000
Pearson  r   = 0.5070   p-value ("null hypothesis: no correlation") = 3.98e-07
Spearman rho = 0.5110   p-value ("null hypothesis: no correlation") = = 3.12e-07

Late return rates generally increase with the mean loan duration per CCODE, indicating a clear positive association. At the same time, there is substantial scatter, so CCODEs with similar mean loan durations can still have very different late rates, suggesting additional influencing factors. The most extreme values occur more often in CCODEs with fewer loans, which may partly reflect sampling noise

### MIN_BORROWINGS_PER_CCODE = 5000
To get a more robust picture, we apply a higher threshold. The relationship becomes clearer, but there is still substantial variance

Pearson  r   = 0.6683   p-value ("null hypothesis: no correlation") = 5.38e-09
Spearman rho = 0.6904   p-value ("null hypothesis: no correlation") = = 1.04e-09


In [28]:
input_data_frame = ccode_table.copy()

input_data_frame["size"] = 50 + 400 * np.sqrt(
    input_data_frame["n_loans"] / input_data_frame["n_loans"].max()
)

fig = px.scatter(
    input_data_frame,
    x="delta_median",
    y="late_rate_percent",
    size="size",
    hover_name="Sammlungszeichen/CCODE",
    hover_data={
        "n_loans": True,
        "delta_median": ":.1f",
        "late_rate_percent": ":.2f",
        "size": False,
    },
    labels={
        "delta_median": "delta_median",
        "late_rate_percent": "Late return rate (%)",
    },
    title="Late return rate vs mean loan duration across CCODEs",
    trendline="ols",
)

fig.update_traces(marker=dict(opacity=0.75))
fig.update_layout(width=650, height=600)
fig.show()

log_pearson_spearman(input_data_frame, "delta_median", "late_rate_percent")

delta_median vs late_rate_percent (n=60)
Pearson  r   = 0.0970   p-value ("null hypothesis: no correlation") = 0.461
Spearman rho = 0.1010   p-value ("null hypothesis: no correlation") = = 0.442


Across ccodes, there is no significant correlation between the median difference in loan duration (late vs. on-time) and the late-return rate.
Thus, the magnitude of the loan-duration difference in late cases (as a measure of “severity”) does not appear to be related to how frequently late returns occur.

In [29]:
input_data_frame = ccode_table.copy()

input_data_frame["size"] = 50 + 400 * np.sqrt(
    input_data_frame["n_loans"] / input_data_frame["n_loans"].max()
)

fig = px.scatter(
    input_data_frame,
    x="p90",
    y="late_rate_percent",
    size="size",
    hover_name="Sammlungszeichen/CCODE",
    hover_data={
        "n_loans": True,
        "p90": ":.1f",
        "late_rate_percent": ":.2f",
        "size": False,
    },
    labels={
        "p90": "p90",
        "late_rate_percent": "Late return rate (%)",
    },
    title="Late return rate vs mean loan duration across CCODEs",
    trendline="ols",
)

fig.update_traces(marker=dict(opacity=0.75))
fig.update_layout(width=650, height=600)
fig.show()

log_pearson_spearman(input_data_frame, "p90", "late_rate_percent")

p90 vs late_rate_percent (n=60)
Pearson  r   = 0.6257   p-value ("null hypothesis: no correlation") = 9e-08
Spearman rho = 0.6511   p-value ("null hypothesis: no correlation") = = 1.78e-08


Across CCODEs, there is a significant positive association between the P90 loan duration (as a measure of long-tail loan length) and the late-return rate.
This means that CCODEs with particularly long upper-tail loan durations tend to have higher late rates.

## Summary
CCODEs seem to behave similarly to media types, which is expected.