# Jacob

## Inquiry theme: Episodes of Democratic Change 

This analysis focuses on patterns common to democratic changes. This analysis will focus on 'episodes' of democratic change, defined as periods of significant shifts in key indicators. The goal is to identify key features of these episodes, and to understand how different dimensions of democracy interact during these periods.

## Analytic questions: 
1. Which components tend to lead or lag overall regime change? Do some freedoms consistently shift earlier or later during the process?
2. Which dimensions of democracy are most resistant to decline during episodes of autocratization, and which are the most vulnerable?
3. Are the temporal patterns of autocratization/democratization symmetric? Do gains tend to take place slowly over time while losses are abrupt, or vice versa?

## Task abstraction
**Question 1:** Which components tend to lead or lag overall regime change? Do some freedoms consistently shift earlier or later during the process?
- Filter (locate episodes of regime change, high delta magnitudes in selected indicator(s))
- Compute derived value (compute average delta for each component for each year relative to episode)
- Characterize distribution (compare trends in components)
- Sort (to locate components which move early and late) 
- Find extremum (determine components which move earliest and latest)

**Question 2:** Which dimensions of democracy are most resistant to decline during episodes of autocratization, and which are the most vulnerable?
- Filter (locate episodes of autocratization, strongly negative deltas in selected indicator(s))
- Compute derived value (compute percentage of episodes in which each component saw decline)
- Characterize distribution (compare trends in components)
- Sort (to locate components which frequently or infrequently decline) 
- Find anomalies (determine components which are unusually resilient or weak)

**Question 3:** Are the temporal patterns of autocratization/democratization symmetric? Do gains tend to take place slowly over time while losses are abrupt, or vice versa?
- Filter (locate episodes of regime change, high delta magnitudes in selected indicator(s))
- Compute derived value (compute aggregate change over time for each episode)
- Characterize distribution (examine trends for each episode type)


## Data loading

In [None]:
import pandas as pd
import altair as alt

In [None]:
vdem_raw = pd.read_csv("../../data/raw/V-Dem-CY-Full+Others-v15.csv")
print(f"Dataset shape: {vdem_raw.shape}")

## Data transformation

- Episode data is created for this analysis.
- An episode is defined as any period where the electoral democracy index (`v2x_polyarchy`) changes by a cumulative total of at least 0.1. To trigger the start of an episode, the index must change by at least 0.01 in a single year. An episode ends immediately before the index stagnates (changes by less than 0.01 for more than 5 years), or changes direction (single year reversal of at least 0.03 or cumulative reversal of at least 0.1).
- Episodes are classified as `democratization` (net positive change) or `autocratization` (net negative change).

In [None]:
vdem_raw["democracy_score"] = (vdem_raw["v2x_polyarchy"] + vdem_raw["v2x_libdem"] + vdem_raw["v2x_partipdem"] + vdem_raw["v2x_delibdem"] + vdem_raw["v2x_egaldem"]) / 5


episodes = pd.DataFrame(columns=["country_name", "start_year", "end_year", "type"])

start_threshold = 0.01
cum_threshold = 0.1
reverse_single_threshold = 0.03
reverse_cum_threshold = 0.1
stasis_years_threshold = 5

def verify_episode(start_year, country, type):
    global episodes
    country_data = vdem_raw[vdem_raw["country_name"] == country]

    year = start_year
    polyarchy_before = country_data[country_data["year"] == year - 1]["v2x_polyarchy"].values[0]

    polyarchy_prev = polyarchy_before
    delta_cum = 0
    validated = False
    before_reversing = year
    reverse_cum = 0
    stasis_count = 0
    while True:
        polyarchy = country_data[country_data["year"] == year]["v2x_polyarchy"]

        if pd.isnull(polyarchy).all():
            break

        polyarchy = polyarchy.values

        if len(polyarchy) != 1:
            print(f"Data issue for {country} in {year}: {polyarchy}")
            break

        if pd.isnull(polyarchy[0]):
            break

        delta = polyarchy[0] - polyarchy_prev

        correct_direction = (delta > 0 and type == "democratization") or (delta < 0 and type == "autocratization")

        delta_cum += delta

        if correct_direction:
            reverse_cum = 0

            if abs(delta) < start_threshold:
                stasis_count += 1
            else:
                before_reversing = year
                stasis_count = 0

                if (delta_cum >= cum_threshold and type == "democratization") or (delta_cum <= -cum_threshold and type == "autocratization"):
                    validated = True
        else:
            reverse_cum += delta

            if abs(reverse_cum) >= reverse_cum_threshold:
                break

            stasis_count += 1

            if abs(delta) >= reverse_single_threshold:
                break

        if stasis_count > stasis_years_threshold:
            break

        polyarchy_prev = polyarchy[0]
        year += 1
    
    if validated:
        end_year = before_reversing
        episodes = pd.concat([episodes, pd.DataFrame([{
            "country_name": country,
            "start_year": start_year - 1,
            "end_year": end_year,
            "type": type,
            "sign": 1 if type == "democratization" else -1
        }])], ignore_index=True)

for country in vdem_raw["country_name"].unique():
    country_data = vdem_raw[vdem_raw["country_name"] == country]

    year_min = country_data["year"].min()
    year_max = country_data["year"].max()

    prev = None
    for year in range(year_min, year_max + 1):
        polyarchy = country_data[country_data["year"] == year]["v2x_polyarchy"]

        if pd.isnull(polyarchy).all():
            prev = None
            continue

        polyarchy = polyarchy.values

        if len(polyarchy) != 1:
            print(f"Data issue for {country} in {year}: {polyarchy}")
            continue

        if pd.isnull(polyarchy[0]):
            prev = None
            continue

        if prev is None:
            prev = polyarchy[0]
            continue

        delta = polyarchy[0] - prev

        if abs(delta) >= start_threshold:
            ep_type = "democratization" if delta > 0 else "autocratization"

            if episodes[(episodes["country_name"] == country) & (episodes["type"] == ep_type) & (episodes["end_year"] >= year)].empty:
                verify_episode(year, country, ep_type)

        prev = polyarchy[0]

print(f"Identified {len(episodes[episodes["type"] == "democratization"])} democratization episodes and {len(episodes[episodes["type"] == "autocratization"])} autocratization episodes.")

- Additional columns are added to the episode data:
    - `duration`: number of years from start to end of episode
    - `total_change`: net change in electoral democracy index over episode
    - `total_change_abs`: absolute value of total_change

- `vdem_select` is created to include only relevant columns for analysis.

In [None]:
for idx, episode in episodes.iterrows():
    type = episode["type"]
    mask = (vdem_raw["country_name"] == episode["country_name"]) & (vdem_raw["year"] >= episode["start_year"]) & (vdem_raw["year"] <= episode["end_year"])
    years_in_episode = vdem_raw[mask]["year"].values
    t_values = years_in_episode - episode["start_year"]
    vdem_raw.loc[mask, "episode_t"] = t_values
    vdem_raw.loc[mask, "episode_type"] = type

indices = ["country_name", "year", "episode_t", "episode_type"]

high_level = [    
    "v2x_polyarchy", "v2x_libdem", "v2x_partipdem", "v2x_delibdem", "v2x_egaldem",
]

high_level_naming = {
    "v2x_polyarchy": "Electoral Democracy",
    "v2x_libdem": "Liberal Democracy",
    "v2x_partipdem": "Participatory Democracy",
    "v2x_delibdem": "Deliberative Democracy",
    "v2x_egaldem": "Egalitarian Democracy"
}

low_level = [
    "v2x_api", "v2x_mpi", "v2x_freexp_altinf", "v2x_frassoc_thick", "v2x_suffr",
    "v2xel_frefair", "v2x_elecoff", "v2x_liberal", "v2xcl_rol", "v2x_jucon",
    "v2xlg_legcon", "v2x_partip", "v2x_cspart", "v2xdd_dd", "v2xel_locelec",
    "v2xel_regelec", "v2xdl_delib", "v2x_egal", "v2xeg_eqprotec", "v2xeg_eqaccess",
    "v2xeg_eqdr", "v2elgvsuflvl", "v2expathhg", "v2elrstrct", "v2ddcredal",
    "v2exfemhog", "v2elcomvot", "v2elfemrst", "v2ddlexci", "e_regionpol_7C", "e_regiongeo"
]

numeric = high_level + [
    "v2x_api", "v2x_mpi", "v2x_freexp_altinf", "v2x_frassoc_thick", "v2x_suffr",
    "v2xel_frefair", "v2x_elecoff", "v2x_liberal", "v2xcl_rol", "v2x_jucon",
    "v2xlg_legcon", "v2x_partip", "v2x_cspart", "v2xdd_dd", "v2xel_locelec",
    "v2xel_regelec", "v2xdl_delib", "v2x_egal", "v2xeg_eqprotec", "v2xeg_eqaccess",
    "v2xeg_eqdr"
]

vdem_select = vdem_raw[indices + high_level + low_level]

episodes["duration"] = episodes["end_year"] - episodes["start_year"] + 1
episodes["total_change"] = episodes.apply(lambda row: round(vdem_select[(vdem_select["country_name"] == row["country_name"]) & (vdem_select["year"] == row["end_year"])]["v2x_polyarchy"].values[0] - vdem_select[(vdem_select["country_name"] == row["country_name"]) & (vdem_select["year"] == row["start_year"])]["v2x_polyarchy"].values[0], 5), axis=1)
episodes["total_change_abs"] = episodes["total_change"].abs()
episodes["slope"] = episodes.apply(lambda row: round(row["total_change"] / row["duration"], 5), axis=1)
episodes["slope_abs"] = episodes["slope"].abs()

episodes["reference_peak_year"] = episodes.apply(
    lambda row: (
        (
            lambda subset: subset.loc[
                (subset["v2x_polyarchy"].diff() * row["sign"]).idxmax(), "year"
            ] if not subset.empty and not (subset["v2x_polyarchy"].diff().isna().all()) else None
        )(
            vdem_select[
                (vdem_select["country_name"] == row["country_name"]) &
                (vdem_select["year"] >= row["start_year"]) &
                (vdem_select["year"] <= row["end_year"])
            ]
        )
    ),
    axis=1
)

new_cols = []

for index in numeric:
    delta = episodes.apply(
        lambda row: round(
            vdem_select.loc[
                (vdem_select["country_name"] == row["country_name"]) & 
                (vdem_select["year"] == row["end_year"]), index
            ].values[0] -
            vdem_select.loc[
                (vdem_select["country_name"] == row["country_name"]) & 
                (vdem_select["year"] == row["start_year"]), index
            ].values[0], 5
        ),
        axis=1
    )
    
    delta_abs = delta.abs()
    
    peak_yoy_abs = episodes.apply(
        lambda row: round(
            (vdem_select.loc[
                (vdem_select["country_name"] == row["country_name"]) &
                (vdem_select["year"] >= row["start_year"] - 2) &
                (vdem_select["year"] <= row["end_year"] + 2), index
            ].diff() * row["sign"]).max(), 5
        ),
        axis=1
    )
    
    peak_yoy_year = episodes.apply(
        lambda row: (
            (
                lambda subset: subset.loc[
                    (subset[index].diff() * row["sign"]).idxmax(), "year"
                ] if not subset.empty and not (subset[index].diff().isna().all()) else None
            )(
                vdem_select[
                    (vdem_select["country_name"] == row["country_name"]) &
                    (vdem_select["year"] >= row["start_year"] - 2) &
                    (vdem_select["year"] <= row["end_year"] + 2)
                ]
            )
        ),
        axis=1
    )
    
    lead = episodes["reference_peak_year"] - peak_yoy_year
    
    new_cols.append(pd.DataFrame({
        f"delta__{index}": delta,
        f"delta_abs__{index}": delta_abs,
        f"peak_yoy_abs__{index}": peak_yoy_abs,
        f"peak_yoy_year__{index}": peak_yoy_year,
        f"lead__{index}": lead
    }))

episodes = pd.concat([episodes] + new_cols, axis=1)

episodes.insert(0, "episode_id", episodes.apply(lambda row: f"{row['country_name']}_{row['start_year']}_{row['end_year']}", axis=1))
episodes.insert(1, "region", episodes.apply(lambda row: vdem_raw[vdem_raw["country_name"] == row["country_name"]]["e_regiongeo"].values[0], axis=1))
episodes.to_csv("../../data/processed/episodes.csv", index=False)

In [None]:
id_vars = ["episode_id", "country_name", "region", "start_year", "end_year", "type"]

records = []

for index in numeric:
    sub = episodes[id_vars + [f"delta__{index}", f"delta_abs__{index}", f"peak_yoy_abs__{index}", f"peak_yoy_year__{index}", f"lead__{index}"]].copy()

    sub["component"] = index
    sub = sub.rename(columns={
        f"delta__{index}": "delta",
        f"delta_abs__{index}": "delta_abs",
        f"peak_yoy_abs__{index}": "peak_yoy_abs",
        f"peak_yoy_year__{index}": "peak_yoy_year",
        f"lead__{index}": "lead"
    })

    records.append(sub)

df_long = pd.concat(records, ignore_index=True).sort_values(by=["episode_id", "component"])

df_long.to_csv("../../data/processed/episodes_long.csv", index=False)

## Data summary

In [None]:
vdem_select.sample(5)

In [None]:
vdem_summary = pd.DataFrame(index=low_level + high_level)
vdem_summary["datastart"] = vdem_select[low_level + high_level].apply(lambda x: vdem_select[vdem_select[x.name].notnull()]["year"].min())
vdem_summary["percent_nonnull"] = vdem_select[low_level + high_level].apply(lambda x: vdem_select.loc[vdem_select["year"] >= vdem_summary.loc[x.name, "datastart"], x.name].notnull().mean() * 100)
vdem_summary["min"] = vdem_select[low_level + high_level].min()
vdem_summary["max"] = vdem_select[low_level + high_level].max()

vdem_summary

In [None]:
episodes.sample(5)

In [None]:
episodes_data_cols = ["duration", "total_change", "total_change_abs", "slope", "slope_abs"]

episodes_summary = pd.DataFrame(index=episodes_data_cols)
episodes_summary["mean"] = episodes[episodes_data_cols].mean()
episodes_summary["std"] = episodes[episodes_data_cols].std()
episodes_summary["min"] = episodes[episodes_data_cols].min()
episodes_summary["max"] = episodes[episodes_data_cols].max()

episodes_summary

## Episode examples

### Canadian autocratization episode (1913-1918)

In [None]:
canada_aut_episode = vdem_select[(vdem_select["country_name"] == "Canada") & (vdem_select["year"] >= 1913) & (vdem_select["year"] <= 1918)].copy()
canada_aut_episode = canada_aut_episode.melt(id_vars=["year"], value_vars=high_level, var_name="indicator", value_name="value")
canada_aut_episode["indicator"] = canada_aut_episode["indicator"].map(high_level_naming)

alt.Chart(canada_aut_episode).mark_line().encode(
    x=alt.X("year", axis=alt.Axis(format=".0f", tickCount=len(canada_aut_episode["year"].unique())), title="Year"),
    y=alt.Y("value", scale=alt.Scale(zero=False), title="Index Value"),
    color=alt.Color("indicator", title="Index")
)
    

### Canadian democratization episode (1918-1922)

In [None]:
canada_dem_episode = vdem_select[(vdem_select["country_name"] == "Canada") & (vdem_select["year"] >= 1918) & (vdem_select["year"] <= 1922)].copy()
canada_dem_episode = canada_dem_episode.melt(id_vars=["year"], value_vars=high_level, var_name="indicator", value_name="value")
canada_dem_episode["indicator"] = canada_dem_episode["indicator"].map(high_level_naming)

alt.Chart(canada_dem_episode).mark_line().encode(
    x=alt.X("year", axis=alt.Axis(format=".0f", tickCount=len(canada_dem_episode["year"].unique())), title="Year"),
    y=alt.Y("value", scale=alt.Scale(zero=False), title="Index Value"),
    color=alt.Color("indicator", title="Index")
)

## Episode summary visualizations

In [None]:
alt.Chart(episodes).mark_boxplot().encode(
    x=alt.X("duration", title="Episode Duration (years)"),
    y=alt.Y("type", title="Episode Type"),
    color=alt.Color("type", title="Episode Type")
)

In [None]:
alt.Chart(episodes).mark_boxplot().encode(
    x=alt.X("total_change_abs", title="Absolute Change in Electoral Democracy Index"),
    y=alt.Y("type", title="Episode Type"),
    color=alt.Color("type", title="Episode Type")
)

In [None]:
total_change_plot = alt.Chart(episodes).mark_circle().encode(
    alt.X("end_year", title="Episode End Year", axis=alt.Axis(labels=False, title=None, ticks=False), scale=alt.Scale(zero=False)),
    alt.Y("total_change_abs", title="Absolute Change in Electoral Democracy Index"),
    alt.Color("type", title="Episode Type")
).facet(column=alt.Column("type:N", title="Episode Type"))

duration_plot = alt.Chart(episodes).mark_circle().encode(
    alt.X("end_year", title="Episode End Year", axis=alt.Axis(format=".0f"), scale=alt.Scale(zero=False)),
    alt.Y("duration", title="Episode Duration (years)"),
    alt.Color("type", title="Episode Type")
).facet(column=alt.Column("type:N", title=None, header=None))

alt.vconcat(total_change_plot, duration_plot).resolve_scale(
    x='shared',
    color='shared'
)

In [None]:
alt.Chart(episodes).mark_circle().encode(
    x=alt.X("duration", title="Episode Duration (years)"),
    y=alt.Y("total_change_abs", title="Absolute Change in Electoral Democracy Index"),
    color=alt.Color("type", title="Episode Type"),
    column=alt.Column("type:N", title="Episode Type")
)

In [None]:
alt.Chart(episodes).transform_density(
    "slope_abs",
    as_=["slope_abs", "density"],
    groupby=["type"]
).mark_area(orient="horizontal").encode(
    alt.X("density:Q")
        .stack('center')
        .impute(None)
        .title(None)
        .axis(labels=False, values=[0], grid=False, ticks=True),
    alt.Y("slope_abs", title="Slope (absolute value)"),
    alt.Color("type", title="Episode Type"),
    alt.Column("type", title="Episode Type")
        .spacing(0)
        .header(titleOrient='bottom', labelOrient='bottom', labelPadding=0)
).configure_view(
    stroke=None
)

## Preliminary sketches

### Question 1: Which components tend to lead or lag overall regime change? Do some freedoms consistently shift earlier or later during the process?

#### Sketch 1: Line plot of average component changes relative to edges of episode

<img src="./images/episode_component_trajectories.png" width="600" />

- Supports Characterize distribution, Sort, Find extremum
- Reasonable for showing patterns over time, but may be cluttered with many components or be hard to intepret exact timings (discriminability is quite low).

#### Sketch 2: Bar plot of mean lead/lag time by component

<img src="./images/episode_component_lag_bar.png" width="600" />

- Supports Sort, Find extremum
- Highly discriminable for lead/lag times, but fails to express the underlying trend of how a component changes over time.

#### Sketch 3: Heatmap of component change over time relative to episode

<img src="./images/episode_component_lag_heatmap.png" width="600" />

- Supports Characterize distribution, Sort, Find extremum
- Balances trend visibility and discriminability very effectively. May be less intuitive for some audiences (scale might be confusing, maybe an alternate measure could be used?).

#### Medium-fidelity prototype

<img src="./images/episode_component_lag_heatmap_mediumfidelity.png" width="600" />

- Legend title altered for clarity, color scale adjusted as the original more strongly implied a binary that did not exist. This version is more expressive of the continuous nature of the data, and less likely to mislead.

### Question 2: Which dimensions of democracy are most resistant to decline during episodes of autocratization, and which are the most vulnerable?

#### Sketch 1: Bar plot of percentage of episodes with decline by component

<img src="./images/episode_component_change_rank.png" width="600" />

- Supports Characterize distribution, Sort, Find anomalies
- Very discriminable for the relative vulnerability of each component, but does not express the underlying trends of how each component changes or the magnitude of those changes.

#### Sketch 2: Kaplan-Meier style survival plot of components during autocratization episodes

<img src="./images/episode_component_survival.png" width="600" />

- Supports Characterize distribution, Sort, Find anomalies
- Expresses the trend of component decline over time far better than the bar plot, and remains relatively easy to intepret (although the concept of a survival curve may not be perfectly suited for this data). Discriminability may be lower once more components are represented.

#### Sketch 3: Boxplots of component decline magnitudes during autocratization episodes

<img src="./images/episode_component_decline_boxplots.png" width="600" />

- Supports Characterize distribution, Sort, Find anomalies
- Expresses the distribution of decline magnitudes very well. Temporal trends are not represented, but this is less relevant for this question. If sorted along the comonent axis, discriminability will likely remain very high even with a large number of components.

#### Medium-fidelity prototype

<img src="./images/episode_component_decline_boxplots_mediumfidelity.png" width="600" />

- Very little adjustment needed. Color scheme was adjusted to match the other medium-fidelity prototype for consistency within the overall analysis. It seems unlikely that any major changes could improve this visualization significantly.

### Question 3: Are the temporal patterns of autocratization/democratization symmetric? Do gains tend to take place slowly over time while losses are abrupt, or vice versa?

#### Sketch 1: Violin plots of episode change rates by type

<img src="./images/episode_slope_violin.png" width="600" />

- Supports Characterize distribution
- Effectively shows the distribution of change rates for each episode type. Comparison between each type is very straightforward, as the two shapes can simply be visually compared. However, temporal trends within each episode type are not represented.

#### Sketch 2: Line plot of mean trajectory of absolute change

<img src="./images/episode_mean_trajectory.png" width="600" />

- Supports Characterize distribution
- Effectively shows both the overall change rate and within-episode temporal trends. However, it may be challenging for a reader to understand the concept of the 'mean trajectory'. The chart will remain highly discriminable even with many data points but may be so sparse as to appear uninformative.

#### Sketch 3: Scatter plot of cumulative absolute change against episode duration, by episode type

<img src="./images/episode_duration_change_scatter.png" width="600" />

- Supports Characterize distribution
- Effectively shows the relationship between episode duration and total change. This allows the reader to infer the temporal pattern shown in sketch 2, while also providing additional context on the density of different regions of the space represented.

#### Medium-fidelity prototype

<img src="./images/episode_duration_change_scatter_mediumfidelity.png" width="600" />

- Axis labels were altered for clarity. It may be useful to facet this plot by episode type rather than using color, which could improve clarity further.