# Gini


In [1]:
%load_ext autoreload
%autoreload 2
import altair as alt
import fetch_data as fd
import pandas as pd
import numpy as np
import os

In [2]:
city_info = fd.get_city_info()

In [3]:
YEARS = [i for i in range(2019, 2025)]
df_dict = fd.get_dfs(YEARS)
df = pd.concat(df_dict.values(), ignore_index=True)

## Data Cleaning


One-hot encode the severity of accidents.


In [4]:
df = pd.get_dummies(df, columns=["UKATEGORIE"], prefix="inj", dtype=int)
df.rename(
    columns={
        "inj_3": "inj_light",
        "inj_2": "inj_serious",
        "inj_1": "inj_fatal",
    },
    inplace=True,
)

Group on the `Community_key` (city) and `UJAHR` (year). Additionally, aggregate metrics we care about.


In [5]:
agg_methods = {
    "inj_light": "sum",
    "inj_serious": "sum",
    "inj_fatal": "sum",
    "IstFuss": "sum",
    "IstRad": "sum",
    "IstKrad": "sum",
    "IstGkfz": "sum",
    "ULAND": "first",
}

df_grouped = df.groupby(["Community_key", "UJAHR"]).agg(agg_methods).reset_index()

Perform an inner join on `"regional key"` with `df_grouped` and `city_info`.


In [6]:
df_grouped.rename(columns={"Community_key": "regional key"}, inplace=True)
df_merged = df_grouped.merge(city_info, on="regional key", how="inner")

Calculate some metrics we care about.


In [7]:
# Calculate the total personal injury accidents
df_merged["inj_total"] = (
    df_merged["inj_light"] + df_merged["inj_serious"] + df_merged["inj_fatal"]
)

In [8]:
# Sanity check
df_merged[df_merged["UJAHR"] == 2024].head()

Unnamed: 0,regional key,UJAHR,inj_light,inj_serious,inj_fatal,IstFuss,IstRad,IstKrad,IstGkfz,ULAND,city,sq km,population,inj_total
5,1001000,2024,325,22,2,34,151,29,14,1,Flensburg,56.73,96326,349
11,1002000,2024,930,93,4,102,514,68,28,1,Kiel,118.65,252668,1027
17,1003000,2024,1008,109,1,104,611,93,28,1,Lübeck,214.19,216889,1118
23,1004000,2024,329,33,0,32,136,28,16,1,Neumünster,71.66,79809,362
29,1051011,2024,51,9,0,5,25,10,2,1,Brunsbüttel,65.21,12692,60


## Gini

We calculate the gini-index. Normally, the gini-index is used as a measure for wealth disparity. Here, we seek to use it as a metric to inform about the relative share of fatalities between cities. In this context, a gini-index of 0 would mean that fatalities are equally distributed between cities while a gini-index of 1 would mean that the fatalities are are concentrated in one city. On our dataset it is possible to obtain a negative gini-index. This can be interpreted as lower-population cities contributing more to the category than their higher-population counterparts. If all one cares about is to quantify the disparity disregarding the direction, one can take the absolute value of the gini-index so it is in the range `[0, 1]`.


In [9]:
pop_label = "population"

In [10]:
# @lucasboettcher - GitLab - https://gitlab.com/ComputationalScience/overdose-da/-/blob/main/county_plot/county_plots.ipynb
def calc_gini(
    df: pd.DataFrame, pop_label: str = pop_label, val_label: str = "IstRad"
) -> float:
    """Calculate the Gini index for a given dataframe.

    Args:
        df (pd.DataFrame): DataFrame containing population and value columns.
        pop_label (str): Column name for population data.
        val_label (str): Column name for value data.

    Returns:
        float: Gini index.
    """
    df_sorted = df.sort_values(pop_label, ascending=True)
    pop = df_sorted[pop_label].sum()
    val = df_sorted[val_label].sum()

    gini = 1 - 2 * np.trapezoid(
        x=[df_sorted[:i][pop_label].sum() / pop for i in range(1 + len(df_sorted))],
        y=[df_sorted[:i][val_label].sum() / val for i in range(1 + len(df_sorted))],
    )

    # NOTE: The gini-index is typically [0,1] for wealth. However, it is possible to
    # have negative values here, indicating that the lower population segments
    # contribute more than their "fair share" compared to their higher-population
    # counterparts.
    return gini

In [11]:
def plot_lorenz_curve(
    df: pd.DataFrame,
    year: int,
    pop_label: str = pop_label,
    val_label: str = "IstRad",
    val_title: None | str = None,
) -> alt.LayerChart:
    """Plot the Lorenz curve for the given DataFrame.

    Args:
        df (pd.DataFrame): DataFrame containing the data.
        year (int): Year of the data.
        pop_label (str, optional): Column name for population. Defaults to pop_label.
        val_label (str, optional): Column name for the value to plot. Defaults to "IstRad".
        val_title (str | None, optional): Title for the value. Defaults to None.

    Returns:
        alt.LayerChart: Lorenz curve plot.
    """
    df_sorted = df.sort_values(by=pop_label, ascending=True)
    pop = df_sorted[pop_label].sum()
    val = df_sorted[val_label].sum()
    val_title = val_label if val_title is None else val_title

    X = [df_sorted[:i][pop_label].sum() / pop for i in range(1 + len(df_sorted))]
    Y = [df_sorted[:i][val_label].sum() / val for i in range(1 + len(df_sorted))]

    title_font_size = 28
    axis_label_font_size = 24
    gini_font_size = 20

    lorenz = (
        alt.Chart(pd.DataFrame({"X": X, "Y": Y}))
        .mark_line(point=True)
        .encode(
            x=alt.X(
                "X",
                title="Proportion of Population",
                axis=alt.Axis(titleFontSize=axis_label_font_size),
            ),
            y=alt.Y(
                "Y",
                title=f"Proportion of {val_title}",
                axis=alt.Axis(titleFontSize=axis_label_font_size),
            ),
            tooltip=[
                alt.Tooltip("X", format=".2%", title="Percent of Population"),
                alt.Tooltip("Y", format=".2%", title=f"Percent of {val_title}"),
            ],
        )
        .properties(
            title={
                # "text": f"Lorenz Curve of {val_title} ({year})",
                "text": f"{val_title} ({year})",
                "fontSize": title_font_size,
            },
            width=500,
            height=500,
        )
        .interactive()
    )

    line = (
        alt.Chart(pd.DataFrame({"X": [0, 1], "Y": [0, 1]}))
        .mark_line(color="red", strokeDash=[5, 5])
        .encode(x="X", y="Y")
    )

    gini = calc_gini(df, pop_label=pop_label, val_label=val_label)
    gini_df = pd.DataFrame(
        {
            # Put the Gini index text near the end of the curve accounting for if the
            # curve is above or below the line of equality.
            "X": [X[-3] if X[-3] < Y[-3] else Y[-3]],
            "Y": [Y[-3]],
            "Gini": [f"Gini Index: {gini:.3f}"],
        }
    )

    gini_chart = (
        alt.Chart(gini_df)
        .mark_text(
            align="right",
            baseline="middle",
            dx=-25,
            fontSize=gini_font_size,
            fontWeight="bold",
            color="black",
        )
        .encode(
            x="X",
            y="Y",
            text="Gini:N",
        )
    )

    return lorenz + line + gini_chart

Plot the Lorenz Curve for certain categories and types.


In [12]:
# LUT for human-readable labels
vals = {
    # injury category
    "inj_total": "Total Injuries",
    "inj_light": "Light Injuries",
    "inj_serious": "Serious Injuries",
    "inj_fatal": "Fatal Injuries",
    # participant type
    "IstFuss": "Pedestrian Accidents",
    "IstRad": "Cyclist Accidents",
    "IstKrad": "Motorcycle Accidents",
    "IstGkfz": "Delivery Vehicle Accidents",
}

In [13]:
charts: list[alt.LayerChart] = []

for col, label in vals.items():
    df_yr = df_merged[df_merged["UJAHR"] == 2024]
    gini = calc_gini(df_yr, col)

    charts.append(
        plot_lorenz_curve(
            df_yr,
            2024,
            pop_label=pop_label,
            val_label=col,
            val_title=label,
        )
    )

chart = ((charts[0] | charts[1]) & (charts[2] | charts[3])).configure_concat(spacing=30)
chart.show()
chart.save(os.path.join("img", "gini_categories.png"))
chart = ((charts[4] | charts[5]) & (charts[6] | charts[7])).configure_concat(spacing=30)
chart.show()
chart.save(os.path.join("img", "gini_types.png"))

### Gini of Germany


In [14]:
gini_arr: list[float] = []
col = "IstRad"

for year in YEARS:
    df_yr = df_merged[df_merged["UJAHR"] == year][["city", pop_label, col]]

    gini = calc_gini(df_yr, val_label=col)
    gini_arr.append(gini)

chart = (
    alt.Chart(pd.DataFrame({"Year": YEARS, "Gini": gini_arr}))
    .mark_line(point=True)
    .properties(
        title={
            "text": f"Gini-Index of {vals[col]} per Year",
            "fontSize": 20,
        },
        width=600,
        height=400,
    )
    .encode(
        x=alt.X("Year:O", axis=alt.Axis(titleFontSize=16)),
        y=alt.Y("Gini:Q", axis=alt.Axis(titleFontSize=16)),
        tooltip=["Year", alt.Tooltip("Gini", format=".3f")],
    )
    .interactive()
)

chart.save(os.path.join("img", f"gini_index_{col.lower()}_de.png"))
chart.show()

### Gini per State


In [15]:
land_LUT: dict[int, str] = {
    1: "Schleswig-Holstein",
    2: "Hamburg",
    3: "Niedersachsen",
    4: "Bremen",
    5: "Nordrhein-Westfalen",  # data as from 2019
    6: "Hessen",
    7: "Rheinland-Pfalz",  # data as from 2017
    8: "Baden-Württemberg",
    9: "Bayern",
    10: "Saarland",  # data as from 2017
    11: "Berlin",  # data as from 2018
    12: "Brandenburg",  # data as from 2017
    13: "Mecklenburg-Vorpommern",  # data as from 2020
    14: "Sachsen",
    15: "Sachsen-Anhalt",
    16: "Thüringen",  # data as from 2019
}

In [16]:
gini_arr: list[dict[str, float | str | int]] = []

for year in YEARS:
    df_yr = df_merged[df_merged["UJAHR"] == year][[pop_label, col, "ULAND"]]

    for l_id, l_name in land_LUT.items():
        df_land = df_yr[df_yr["ULAND"] == f"{l_id:02}"]

        # Skip if we have no data for this land in this year e.g. NRW 2021
        if df_land.empty:
            continue

        gini = calc_gini(df_land, val_label=col)
        gini_arr.append(
            {
                "Year": year,
                "Land": l_name,
                "Gini": gini,
            }
        )

In [17]:
df_gini = pd.DataFrame(gini_arr)

chart = (
    alt.Chart(df_gini)
    .mark_line(point=True)
    .encode(
        x=alt.X("Year:O", title="Year", axis=alt.Axis(titleFontSize=16)),
        y=alt.Y("Gini:Q", title="Gini-Index", axis=alt.Axis(titleFontSize=16)),
        color=alt.Color("Land:N", title="State"),
        tooltip=[
            "Land",
            alt.Tooltip("Year:O", title="Year"),
            alt.Tooltip("Gini:Q", format=".3f"),
        ],
    )
    .properties(
        title={
            "text": f"Gini-Index of {vals[col]} by State per Year",
            "fontSize": 20,
        },
        width=800,
        height=400,
    )
    .interactive()
)

chart.save(os.path.join("img", f"gini_index_{col.lower()}_land.png"))
chart.show()