In [None]:
!pip install lifelines
import pandas as pd
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt

# Kaplan-Meier Analysis for COMPAS Recidivism Predictions

Objectives:

1. Compare survival probabilities (recidivism rates) by COMPAS risk score levels (Low, Medium, High).
2. Explore recidivism rates by gender and race.
3. Analyze survival curves to observe disparities between groups over a two-year observation period.

We are using the COMPAS risk scores as a measure to evaluate recidivism predictions. The Kaplan-Meier analysis allows us to estimate the probability of "surviving" (not reoffending) over time and comparing survival rates between different subgroups.

In [None]:
import numpy as np
import seaborn as sns

from google.colab import drive

drive.mount('/content/drive')

## Exercise 1: Load in Data

Load in the `compas-KM-data.csv` file.

In [None]:
# TODO: Do some EDA on the dataset (explore the columns, shape, etc.)



## Exercise 2: Extract Relevant Data for Kaplan-Meier Analysis

In order to successfully extract the necessary data, we need to manipulate our dataframe

*Note*: Create a `two_year_recid` column by checking if `r_offense_date` is within two years (730 days) from screening_date.

*   Set `start` to 0 (representing release date)
*   Set end based on whether recidivism occurred within two years:
    - If `two_year_recid` is 1, set `end` to the days until `r_offense_date`.
    - If `two_year_recid` is 0, set `end` to 730 days.





In [None]:

# TODO: Convert dates to datetime format (['screening_date'], ['r_offense_date'], ['in_custody'], ['out_custody'])


# TODO: Calculate two_year_recid based on whether recidivism occurred within 2 years
# We'll create a new column 'two_year_recid': 1 for recidivated within 2 years, otherwise 0
# First check if 'r_offense_date' exists for each row of the data set, then check whether the time between
# 'screening_date' and 'r_offense_date' is 730 days or less


# Set start time to 0 (create a ['start'] column)


# TODO: Define end time based on recidivism status (create an ['end'] column)


# TODO: Drop rows with missing data: subset of 'screening_date' and 'end'


# TODO: Display the first few rows of the prepared DataFrame



## Exercise 3: Kaplan-Meier Analysis for COMPAS Risk Levels

1. Use `lifelines` to fit a Kaplan-Meier model for survival analysis.
2. Plot the survival curves for different risk levels (`Low`, `Medium`, `High`) to observe the recidivism trends over time.

*Note*: Remove rows with missing or "N/A" values in the risk level column.

In [None]:
# Initialize Kaplan-Meier fitter
kmf = KaplanMeierFitter()

# Filter out rows with missing or "N/A" risk levels (score_text) and limit 'end' to be within 0 to 730 days (.notna() might be helpful!)

# NOTE: for the following plots, we've initialized the structure for you
# TODO: fill in *** placeholders
plt.figure(figsize=(10, 6))

# TODO: Plot Kaplan-Meier curves for each risk level ('score_text') or if you renamed it to something like 'risk_level'
kmf = KaplanMeierFitter()
for level in df_filtered['***'].unique():
    subset = df_filtered[df_filtered['***'] == level]
    kmf.fit(durations=subset['***'], event_observed=subset['***'], label=f'***: {level}')
    kmf.plot()

# TODO: Customize the plot, labels are up to you!
plt.title("***")
plt.xlabel("***")
plt.ylabel("***")
plt.xlim(***)  # Set x-axis limit to 0, 730
plt.xticks(range(***))  # Show ticks from 0 to 730 in intervals of 100 days

# Display the plot
plt.show()

## Exercise 4: Kaplan-Meier Analysis by Demographic Groups (Sex)

1. Compare Kaplan-Meier curves across `sex`.
2. Compare recidivism rates for female and male individuals across different COMPAS risk levels (`Low`, `Medium`, `High`). This allows us to see how recidivism probabilities differ by both gender and risk classification.

Helpful Tips:
*   Use two side-by-side subplots to display the curves for the two racial groups separately.
*   Ensure both plots span from 0 to 730 days, with x-axis tick marks every 100 days.
*   Clearly label each plot with the corresponding racial group, and include legends to show the COMPAS risk levels.

In [None]:
# TODO: Initialize Kaplan-Meier fitter


# Initialize subplots for side-by-side display
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# TODO: Define COMPAS risk levels to iterate over
# TODO: Fill in the *** placeholders
risk_levels = ['***', '***', '***']
colors = ['***', '***', '***']  # Colors for each risk level, blue, green, etc.

# Plot for Female
female_subset = df[df['***'] == '***']
for risk, color in zip(risk_levels, colors):
    level_data = female_subset[female_subset['***'] == risk] # score_text
    kmf.fit(durations=level_data['***'], event_observed=level_data['***'], label=f'***: {risk}')
    kmf.plot(ax=axes[0], color=color)

# TODO: Customize the plot, labels are up to you!
axes[0].set_title("***")
axes[0].set_xlabel("***")
axes[0].set_ylabel("***")
axes[0].set_xlim(***)  # Set x-axis limit to 0-730
axes[0].set_xticks(range(***))  # Show ticks from 0 to 730 in intervals of 100 days

# Plot for Male
male_subset = df[df['***'] == '***']
for risk, color in zip(risk_levels, colors):
    level_data = male_subset[male_subset['***'] == risk] # score_text
    kmf.fit(durations=level_data['***'], event_observed=level_data['***'], label=f'***: {risk}')
    kmf.plot(ax=axes[1], color=color)

# TODO: Customize the plot, labels are up to you!
axes[1].set_title("***")
axes[1].set_xlabel("***")
axes[1].set_ylabel("***")
axes[1].set_xlim(***)  # Set x-axis limit to 0-730
axes[1].set_xticks(range(***))  # Show ticks from 0 to 730 in intervals of 100 days

# Adjust layout and add a legend
axes[0].legend(title="***")
axes[1].legend(title="***")
plt.tight_layout()

# Display the plots
plt.show()


## Exercise 5: Kaplan-Meier Analysis by Demographic Groups (Race)

1. Compare Kaplan-Meier curves across race.
2. Compare recidivism rates for two different racial groups, each broken down by COMPAS risk levels (Low, Medium, High). This will help us examine how survival probabilities differ by both race and risk classification.

Helpful Tips:
*   Use two side-by-side subplots to display the curves for the two racial groups separately.
*   Ensure both plots span from 0 to 730 days, with x-axis tick marks every 100 days.
*   Clearly label each plot with the corresponding racial group, and include legends to show the COMPAS risk levels.

*Note*: Select any two races from the dataset (e.g., Caucasian, African-American, Hispanic, Asian, Other); race strings are case-sensitive. Feel free to try different combinations and observe any trends.

In [None]:
# Initialize Kaplan-Meier fitter


# Initialize subplots for side-by-side display
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Define the two racial groups you want to compare (feel free to change these; CASE-SENSITIVE)
race1 = '***'
race2 = '***'

# Define COMPAS risk levels to iterate over
risk_levels = ['***', '***', '***']
colors = ['***', '***', '***']  # Colors for each risk level

# Plot for the first racial group
race1_subset = df[df['***'] == race1]
for risk, color in zip(risk_levels, colors):
    level_data = race1_subset[race1_subset['***'] == risk] # score_text
    kmf.fit(durations=level_data['***'], event_observed=level_data['***'], label=f'***: {risk}')
    kmf.plot(ax=axes[0], color=color)

# TODO: Customize the plot, labels are up to you!
axes[0].set_title(f"***")
axes[0].set_xlabel("***")
axes[0].set_ylabel("***")
axes[0].set_xlim(***)  # Set x-axis limit to 0-730
axes[0].set_xticks(range(***))  # Show ticks from 0 to 730 in intervals of 100 days

# Plot for the second racial group
race2_subset = df[df['***'] == race2]
for risk, color in zip(risk_levels, colors):
    level_data = race2_subset[race2_subset['***'] == risk] # score_text
    kmf.fit(durations=level_data['***'], event_observed=level_data['***'], label=f'***: {risk}')
    kmf.plot(ax=axes[1], color=color)

# TODO: Customize the plot, labels are up to you!
axes[1].set_title(f"***")
axes[1].set_xlabel("***")
axes[1].set_ylabel("***")
axes[1].set_xlim(***)  # Set x-axis limit to 0-730
axes[1].set_xticks(range(***))  # Show ticks from 0 to 730 in intervals of 100 days

# Adjust layout and add a legend
axes[0].legend(title="***")
axes[1].legend(title="***")
plt.tight_layout()

# Display the plots
plt.show()



## Exercise 6: Calculate Median Survival Times

**Objective**: Calculate the median survival time, which indicates the time at which 50% of the population has recidivated.

1. First, calculate the median survival time for the entire dataset. This gives a general benchmark of when half of the population has recidivated.
2. Then, calculate the median survival time for each COMPAS risk level (e.g., Low, Medium, High). This allows us to see if there are significant differences in the timing or recidivism between risk groups.

*Note*: Filter out entries with 'N/A' in the `score_text` column if needed.

*Note*: If you get infinity for an answer you did not do anything wrong! Think about what this means in the context of the objective above.

In [None]:

# TODO: Fill in the *** placeholders

# TODO: Filter out entries with 'N/A' in the 'score_text' column, or 'risk_level' if you renamed it
df = df[df['***'] != '***']

# TODO: Initialize the Kaplan-Meier fitter


# TODO: Calculate median survival time for the entire dataset
kmf.fit(durations=df['***'], event_observed=df['***'])
median_survival = kmf.median_survival_time_

# TODO: Print out the median survival times


# Median survival times by risk level
for level in df['***'].unique():  # Adjust if 'score_text' is the correct column for risk level
    subset = df[df['***'] == level]
    kmf.fit(durations=subset['***'], event_observed=subset['***'])
    median_survival = kmf.median_survival_time_

    # Print these out to visualize the differences (use fstrings!)


## Exercise 7: Interpret the Results

**Objective**: Reflect on and interpret the Kaplan-Meier analysis results:
1. How does the recidivism rate change across different COMPAS risk levels?
    - Observe the median survival times for each risk level. Does the median survival time decrease as the risk level increases (i.e., High risk has a shorter median survival time than Low risk)?

2. Are there any noticeable differences in recidivism rates among gender or racial groups?
    - Compare the Kaplan-Meier curves and survival probabilities for different genders and racial groups from earlier exercises. Do certain groups recidivate faster than others?


`Write responses here!` (double click)

*   Or check below this cool plot for our interpretations :D

In [None]:
# Cool plot that displays every race in this dataset
# Initialize the plot
plt.figure(figsize=(10, 6))

# Plot Kaplan-Meier curves for each race
for race in df['race'].unique():
    subset = df[df['race'] == race]
    kmf.fit(durations=subset['end'], event_observed=subset['two_year_recid'], label=f'Race: {race}')
    kmf.plot()

# Customize the plot
plt.title("Kaplan-Meier Curve by Race (0 to 730 Days)")
plt.xlabel("Days since Screening")
plt.ylabel("Survival Probability (No Recidivism)")
plt.xlim(0, 730)  # Set x-axis limit to 0-730
plt.xticks(range(0, 731, 100))  # Show ticks from 0 to 730 in intervals of 100 days

# Display the plot
plt.show()


1. *a.* **High-Risk Group:** Median survival time is 272 days, which indicates that half of the individuals in this group reoffend within about 9 months after release. This suggests a high recidivism rate among high-risk individuals.

   *b.* **Medium-Risk Group:** Median survival time is 431 days, meaning this group takes longer to reach the 50% recidivism mark compared to the high-risk group, which is expected.

   *c.* **Low-Risk Group:** Median survival time is "infinity," implying that fewer than 50% of low-risk individuals reoffend within the two-year observation period. This is consistent with their lower risk classification.

2. **Gender:** The survival probabilities likely reveal that men recidivate at a higher rate and earlier time points compared to women. Men in the high-risk group may show a faster decline in survival probability, suggesting they are more prone to reoffend. For women, even in the high-risk group, the survival probability may decrease at a slower rate than men.

3. **Race:** If we compare different racial groups (e.g., Caucasian and African-American), the curves may indicate disparities. For instance, African-American individuals may exhibit a higher rate of recidivism in similar risk categories compared to Caucasian individuals, potentially highlighting racial disparities. This could be due to a range of socio-economic factors and historical biases that may influence both the likelihood of reoffending and the assessment itself.