# Kaplan-Meier Estimator for Analyzing Programming Language Survival on GitHub

## Introduction

The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function from lifetime data. It is commonly used in medical research to measure the fraction of patients living for a certain amount of time after treatment. In this notebook, we will apply the Kaplan-Meier estimator to analyze the survival of programming languages using data collected from GitHub repositories.

## Objective

The primary objective of this tutorial is to understand how to use the Kaplan-Meier estimator to study the longevity and retention of GitHub repositories based on their programming language. By the end of this notebook, you will be able to:
- Load and preprocess GitHub repository data.
- Calculate survival times for repositories.
- Apply the Kaplan-Meier estimator to this data.
- Compare the survival functions of multiple programming languages.
- Interpret the results of the Kaplan-Meier survival curves.

## Data Description

We will use a dataset that contains information about various GitHub repositories. The dataset has the following columns:
- `name`: Name of the repository.
- `full_name`: Full name of the repository.
- `description`: Description of the repository.
- `search_key`: Search key used to find the repository.
- `language`: Programming language used in the repository.
- `key_words`: Keywords associated with the repository.
- `stars`: Number of stars the repository has received.
- `forks`: Number of times the repository has been forked.
- `watchers`: Number of watchers of the repository.
- `open_issues`: Number of open issues in the repository.
- `created_at`: Date and time when the repository was created.
- `updated_at`: Date and time when the repository was last updated.
- `pushed_at`: Date and time when the repository was last pushed.
- `releases`: Number of releases of the repository.

We will focus on analyzing the survival time, calculated as the duration from the repository's creation date to its last update date.

## Prerequisites

To follow this notebook, you should have a basic understanding of Python programming and familiarity with data analysis libraries such as Pandas, NumPy, and Matplotlib. Additionally, you should have the Lifelines library installed for survival analysis.

## Notebook Outline

1. **Install Required Libraries**: Ensure all necessary libraries are installed.
2. **Load and Inspect Data**: Load the dataset into a Pandas DataFrame and inspect its structure.
3. **Filter Data for Analysis**: Focus on specific programming languages or other criteria.
4. **Apply the Kaplan-Meier Estimator**: Use the Lifelines library to fit the Kaplan-Meier estimator.
5. **Compare Multiple Languages**: Repeat the fitting process for each language and plot them together.
6. **Interpret the Results**: Understand the retention and longevity of repositories for each programming language.
7. **Confidence Intervals**: Include confidence intervals to represent uncertainty.
8. **Log-rank Test**: Compare survival distributions among languages.
9. **Bonferroni Correction**: Adjust significance levels for multiple comparisons.


---



# Import libraries

In [1]:
%%capture
!pip install lifelines
!pip install plotly

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
import math

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from lifelines import KaplanMeierFitter
from lifelines.plotting import plot_lifetimes
from lifelines.statistics import logrank_test

# Data Loading

In [3]:
# Load the data
"""
loadin data from github and handling errors with try-except blocks

"""
import os
data_path = os.getenv('DATA_PATH', '/content/processed_languages_frameworks.csv')
#data_path = os.getenv('DATA_PATH', '/content/drive/MyDrive/Survana/a_languages.csv')

try:
    data = pd.read_csv(data_path)
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
    raise
except pd.errors.EmptyDataError:
    print("Error: The file is empty. Please check the file content.")
    raise
except Exception as e:
    print(f"An error occurred: {e}")
    raise

In [4]:
data

Unnamed: 0,name,full_name,description,search_key,language,key_words,stars,forks,watchers,open_issues,created_at,updated_at,pushed_at,releases,survival_time,time_since_last_push,event
0,system-design-primer,donnemartin/system-design-primer,Learn how to design large-scale systems. Prep ...,Python,Python,"['design', 'design-patterns', 'design-system',...",264072,44767,264072,435,2017-02-26 16:15:28+00:00,2024-07-12 16:47:58+00:00,2024-06-29 00:09:30+00:00,0,2679,16,0
1,awesome-python,vinta/awesome-python,An opinionated list of awesome Python framewor...,Python,Python,"['awesome', 'collections', 'python', 'python-f...",212323,24654,212323,416,2014-06-27 21:00:06+00:00,2024-07-12 16:45:14+00:00,2024-07-08 23:26:43+00:00,0,3664,6,0
2,tensorflow,tensorflow/tensorflow,An Open Source Machine Learning Framework for ...,Python,C++,"['deep-learning', 'deep-neural-networks', 'dis...",184034,74063,184034,3727,2015-11-07 01:19:20+00:00,2024-07-12 16:44:43+00:00,2024-07-12 16:47:04+00:00,30,3170,3,0
3,Python,TheAlgorithms/Python,All Algorithms implemented in Python,Python,Python,"['algorithm', 'algorithm-competitions', 'algor...",182120,44004,182120,229,2016-07-16 09:44:01+00:00,2024-07-12 16:26:11+00:00,2024-07-12 16:25:12+00:00,0,2918,3,0
4,project-based-learning,practical-tutorials/project-based-learning,Curated list of project-based tutorials,Python,Python,"['beginner-project', 'cpp', 'golang', 'javascr...",181910,23897,181910,148,2017-04-12 05:07:46+00:00,2024-07-12 16:38:09+00:00,2024-07-10 18:02:06+00:00,0,2646,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9973,bpmn-js-example-model-extension,bpmn-io/bpmn-js-example-model-extension,An example of creating a model extension for b...,Ext-JS,JavaScript,['bpmn-js'],27,14,27,0,2019-03-27 08:47:30+00:00,2024-03-12 12:57:10+00:00,2019-04-08 10:57:11+00:00,0,12,1925,1
9974,OS.js-extras,os-js/OS.js-extras,OS.js (v2) - Extra Packages,Ext-JS,JavaScript,[],26,20,26,1,2013-12-02 21:56:08+00:00,2024-01-25 18:35:17+00:00,2017-12-15 18:38:00+00:00,0,1473,2404,1
9975,jsonschema-extractor,toumorokoshi/jsonschema-extractor,Extract jsonschema from various Python objects,Ext-JS,Python,[],26,9,26,2,2017-06-20 03:32:28+00:00,2024-02-23 12:47:24+00:00,2022-10-25 05:09:49+00:00,0,1953,629,1
9976,extractor-js,rsdoiel/extractor-js,Since I originally wrote this a module called ...,Ext-JS,Ext-JS,[],26,4,26,0,2011-08-15 18:08:30+00:00,2023-01-28 13:26:17+00:00,2015-11-16 20:02:44+00:00,0,1554,3164,1


## Function: `kaplan_meier_estimate`

### Description
This function computes the Kaplan-Meier survival estimates for a specified programming language using the Kaplan-Meier Fitter from the `lifelines` library. It filters the provided dataset to focus on the repositories written in the selected language and fits the survival model.

### Parameters
- **data** (DataFrame): A Pandas DataFrame containing the repository data. This DataFrame must include the following columns:
  - `survival_time`: The duration (in days) since the repository was created until the last push.
  - `event`: A binary column indicating whether the repository is considered inactive (1) or still active (0).
  
- **language** (str): The programming language for which the Kaplan-Meier survival estimates are to be calculated.

### Returns
- **kmf** (KaplanMeierFitter): An instance of the `KaplanMeierFitter` object that contains the fitted Kaplan-Meier model for the specified language.


In [5]:
def kaplan_meier_estimate(data, language):
    """Compute Kaplan-Meier survival estimates for a given language."""
    kmf = KaplanMeierFitter()

    # Filter data for the selected language
    language_data = data[data['language'] == language]

    # Fit the model
    kmf.fit(language_data['survival_time'], event_observed=language_data['event'], label=language)

    return kmf

In [6]:
def plot_kaplan_meier(data):
    """Plot the Kaplan-Meier survival function for each language with a dropdown menu."""
    # Get unique languages
    languages = data['language'].unique()

    # Create a figure
    fig = go.Figure()

    # Add Kaplan-Meier curves for each language
    for language in languages:
        kmf = kaplan_meier_estimate(data, language)
        fig.add_trace(go.Scatter(
            x=kmf.survival_function_.index,
            y=kmf.survival_function_[language],
            mode='lines',
            name=language,
            visible=False  # Initially hidden
        ))

    # Make the first language visible by default
    fig.data[0].visible = True

    # Update layout with dropdown and fixed y-axis range
    fig.update_layout(
        title='Kaplan-Meier Survival Function by Programming Language',
        xaxis_title='Survival Time (days)',
        yaxis_title='Survival Probability',
        yaxis=dict(range=[0, 1]),  # Fix y-axis range between 0 and 1
        updatemenus=[{
            'buttons': [
                {
                    'label': language,
                    'method': 'update',
                    'args': [
                        {'visible': [lang == language for lang in languages]},
                        {'title': f'Kaplan-Meier Survival Function - {language}'}
                    ]
                } for language in languages
            ],
            'direction': 'down',
            'showactive': True,
        }]
    )

    # Get overall x-axis range
    overall_x_range = [data['survival_time'].min(), data['survival_time'].max()]

    # Add a red horizontal line at y=0.25
    fig.add_shape(
        type='line',
        x0=overall_x_range[0],
        x1=overall_x_range[1],  # Stretch across the overall x-axis range
        y0=0.25,
        y1=0.25,
        line=dict(color='Red', width=2, dash='dash'),
    )

    # Show the plot
    fig.show()

In [7]:
plot_kaplan_meier(data)

## Confidence Intervals

### Confidence Interval
A confidence interval provides a range of values around the survival estimate that likely contains the true survival probability. For example, a 95% confidence interval indicates that there is a 95% chance the true survival probability lies within that range.

### Importance of Confidence Intervals

- **Uncertainty Representation**: Including confidence intervals allows for a better understanding of the precision of the survival estimates. Wider intervals indicate more uncertainty, while narrower intervals suggest more confidence in the estimates.

- **Statistical Significance**: If confidence intervals for different groups (e.g., programming languages) do not overlap, it may indicate a significant difference in survival probabilities.

- **Decision Making**: Visualizing survival with confidence intervals helps stakeholders make informed decisions based on the reliability of the data, particularly when comparing the longevity of different programming languages or technologies.

### Visualization
When plotted, the Kaplan-Meier survival function appears as a step function, and the confidence intervals are often shown as shaded areas around the curve. This visual representation makes it easier to assess the survival trends and the degree of uncertainty in the estimates over time.


In [8]:
def plot_kaplan_meier_ci(data):
    """Plot the Kaplan-Meier survival function for each language with a dropdown menu."""
    # Get unique languages
    languages = data['language'].unique()

    # Create a figure
    fig = go.Figure()

    # Add Kaplan-Meier curves and confidence intervals for each language
    for language in languages:
        kmf = kaplan_meier_estimate(data, language)
        fig.add_trace(go.Scatter(
            x=kmf.survival_function_.index,
            y=kmf.survival_function_[language],
            mode='lines',
            name=language,
            visible=False  # Initially hidden
        ))
        fig.add_trace(go.Scatter(
            x=kmf.confidence_interval_.index,
            y=kmf.confidence_interval_[f'{language}_lower_0.95'],
            mode='lines',
            line=dict(width=0),
            showlegend=False,
            hoverinfo='skip',
            visible=False  # Initially hidden
        ))
        fig.add_trace(go.Scatter(
            x=kmf.confidence_interval_.index,
            y=kmf.confidence_interval_[f'{language}_upper_0.95'],
            mode='lines',
            line=dict(width=0),
            fill='tonexty',
            fillcolor='rgba(0,100,80,0.2)',
            showlegend=False,
            hoverinfo='skip',
            visible=False  # Initially hidden
        ))

    # Make the first language visible by default
    for i in range(3):
        fig.data[i].visible = True

    # Update layout with dropdown and fixed y-axis range
    fig.update_layout(
        title='Kaplan-Meier Survival Function by Programming Language',
        xaxis_title='Survival Time (days)',
        yaxis_title='Survival Probability',
        yaxis=dict(range=[0, 1]),  # Fix y-axis range between 0 and 1
        updatemenus=[{
            'buttons': [
                {
                    'label': language,
                    'method': 'update',
                    'args': [
                        {'visible': [i // 3 == j for i in range(len(languages) * 3)]},
                        {'title': f'Kaplan-Meier Survival Function - {language}'}
                    ]
                } for j, language in enumerate(languages)
            ],
            'direction': 'down',
            'showactive': True,
        }]
    )

    # Get overall x-axis range
    overall_x_range = [data['survival_time'].min(), data['survival_time'].max()]

    # Add a red horizontal line at y=0.25
    fig.add_shape(
        type='line',
        x0=overall_x_range[0],
        x1=overall_x_range[1],  # Stretch across the overall x-axis range
        y0=0.25,
        y1=0.25,
        line=dict(color='Red', width=2, dash='dash'),
    )

    # Show the plot
    fig.show()

In [9]:
plot_kaplan_meier_ci(data)

## Log-rank Test Implementation

The **log-rank test** is a statistical hypothesis test used to compare the survival distributions of two or more groups. It is particularly useful in survival analysis when assessing whether there are significant differences in survival times among different categories, such as programming languages in this case.

### How the Log-rank Test Works

1. **Survival Curves**: The log-rank test compares the observed number of events (e.g., repository inactivity) with the expected number of events at each time point for each group.

2. **Null Hypothesis**: The null hypothesis states that there is no difference in survival between the groups (e.g., different programming languages).

3. **Test Statistic**: The test calculates a statistic that measures the difference between the observed and expected events. A significant difference indicates that the survival functions are not the same.

### Usefulness in this Problem

1. **Compare Programming Languages**: The log-rank test allows to determine if certain languages have significantly different survival times for their repositories, highlighting which languages are more actively maintained.

2. **Statistical Rigor**: Using the log-rank test adds statistical rigor to this analysis, providing evidence to support claims about language popularity and longevity in the software community.

3. **Decision Making**: If significant differences are found, this information can guide decisions regarding which languages to use in future projects or which ones may be declining in popularity.




In [10]:
def log_rank_test(data):
    """Perform log-rank tests between all pairs of languages."""
    languages = data['language'].unique()
    results = {}

    for i in range(len(languages)):
        for j in range(i + 1, len(languages)):
            lang1 = languages[i]
            lang2 = languages[j]

            kmf1 = kaplan_meier_estimate(data, lang1)
            kmf2 = kaplan_meier_estimate(data, lang2)

            # Perform log-rank test
            test_result = logrank_test(kmf1.event_table.index, kmf2.event_table.index,
                                        event_observed_A=kmf1.event_table['observed'],
                                        event_observed_B=kmf2.event_table['observed'])
            results[f"{lang1} vs {lang2}"] = test_result.p_value

    return results


log_rank_results = log_rank_test(data)
print(log_rank_results)

{'Python vs C++': 0.42459171005385143, 'Python vs Shell': 0.5189901351061796, 'Python vs JavaScript': 2.001191904683171e-15, 'Python vs Java': 1.400635419901411e-08, 'Python vs TypeScript': 2.0945016553222943e-14, 'Python vs Scala': 1.5604706298558666e-05, 'Python vs PHP': 0.03651606038085976, 'Python vs Rust': 1.4840256257273644e-05, 'Python vs Ruby': 2.5630550356384225e-09, 'Python vs Go': 2.2373035396705085e-05, 'Python vs Kotlin': 0.2967077050422966, 'Python vs C': 0.8089668008534016, 'Python vs C#': 1.8401104733354363e-05, 'Python vs Swift': 0.800638347981992, 'Python vs Objective-C': 0.01151785231635947, 'Python vs Dart': 0.18769012382128702, 'Python vs R': 0.24948693234409378, 'Python vs Haskell': 0.09491199034972388, 'Python vs SQL': 0.6104727617157062, 'Python vs Perl': 0.0721836263120319, 'Python vs MATLAB': 0.0002659879557390626, 'Python vs Julia': 5.311333645066073e-06, 'Python vs PowerShell': 0.6568038822132942, 'Python vs Lua': 0.07350011707897555, 'Python vs Elixir': 0.0

In [11]:
# Prepare the data for Plotly
def prepare_plot_data(log_rank_results, selected_language):
    filtered_results = {key.split(' vs ')[1]: value for key, value in log_rank_results.items() if key.startswith(selected_language)}
    return filtered_results

# Generate plotly figure
def create_figure(selected_language):
    filtered_results = prepare_plot_data(log_rank_results, selected_language)

    fig = go.Figure(data=[
        go.Bar(x=list(filtered_results.keys()), y=list(filtered_results.values()))
    ])

    # Add a red horizontal line at y=0.5
    fig.add_shape(
        type='line',
        x0=-0.5, x1=len(filtered_results)-0.5,
        y0=0.05, y1=0.05,
        line=dict(color='Red', dash='dash')
    )

    fig.update_layout(
        title=f'Log-rank Test P-values for {selected_language}',
        xaxis_title='Languages',
        yaxis_title='P-value',
        yaxis=dict(range=[0, 1]),
        showlegend=False
    )

    return fig

# List of unique languages from log_rank_results
languages = sorted(set([key.split(' vs ')[0] for key in log_rank_results.keys()]))

# Create initial figure
initial_language = languages[0]
fig = create_figure(initial_language)

# Add dropdown menu
dropdown_buttons = [
    dict(
        args=[{"x": [list(prepare_plot_data(log_rank_results, lang).keys())],
               "y": [list(prepare_plot_data(log_rank_results, lang).values())],
               "title": [f'Log-rank Test P-values for {lang}']}],
        label=lang,
        method="update"
    ) for lang in languages
]

# Update figure layout with dropdown menu
fig.update_layout(
    updatemenus=[{
        "buttons": dropdown_buttons,
        "direction": "down",
        "showactive": True
    }]
)

# Show the plot
fig.show()


## Bonferroni Correction Implementation

The **Bonferroni correction** is a multiple comparison correction method used to address the problem of increased Type I errors when multiple statistical tests are conducted simultaneously. This correction adjusts the significance threshold to maintain the overall error rate at a desired level.

### How the Bonferroni Correction Works

1. **Multiple Comparisons**: When performing multiple tests, the likelihood of obtaining at least one significant result due to chance increases. The Bonferroni correction counters this by adjusting the threshold for significance.

2. **Adjusted Significance Level**: The correction divides the desired overall alpha level (e.g., 0.05) by the number of tests being performed. This adjusted alpha is then used as the new threshold for determining significance.

3. **Conservative Approach**: The Bonferroni correction is a conservative method, which means it reduces the likelihood of Type I errors (false positives) but may increase the likelihood of Type II errors (false negatives).

### Usefulness in this Problem

1. **Handling Multiple Comparisons**: When comparing the survival distributions of multiple programming languages, numerous pairwise comparisons are made. The Bonferroni correction helps control the overall Type I error rate in this context.

2. **Robust Statistical Analysis**: By applying the Bonferroni correction, the analysis becomes more robust and credible, reducing the risk of false positives that might suggest significant differences where there are none.

3. **Decision Making**: Correcting for multiple comparisons ensures that only the most statistically significant results are considered, providing a clearer and more reliable basis for decision-making regarding language popularity and longevity.

### Implementing the Bonferroni Correction

When implementing the Bonferroni correction in log-rank test analysis, the significance threshold can adjusted as follows:

- **Initial Alpha Level**: Set desired overall alpha level (e.g., 0.05).
- **Number of Tests**: Count the number of pairwise comparisons being made.
- **Adjusted Alpha Level**: Divide the initial alpha level by the number of tests.

For example, if we perform 10 comparisons and want an overall alpha level of 0.05, the adjusted alpha level for each test would be \(0.05 / 10 = 0.005\). Any p-value below this threshold indicates a significant difference in survival distributions.

By applying the Bonferroni correction, we can confidently identify significant differences between the survival times of repositories for different programming languages, ensuring the reliability of our findings.


In [12]:
def bonferroni_correction(p_values, alpha=0.05):
    """
    Apply Bonferroni correction to a list of p-values.
    """
    n_tests = len(p_values)
    corrected_alpha = alpha / n_tests
    return corrected_alpha

# Calculate corrected significance level
corrected_alpha = bonferroni_correction(list(log_rank_results.values()))

# Generate plotly figure
def create_figure(selected_language, log_rank_results, corrected_alpha):
    filtered_results = {key.split(' vs ')[1]: value for key, value in log_rank_results.items() if key.startswith(selected_language)}

    fig = go.Figure(data=[
        go.Bar(x=list(filtered_results.keys()), y=list(filtered_results.values()))
    ])

    # Add a red horizontal line at corrected alpha
    fig.add_shape(
        type='line',
        x0=-0.5, x1=len(filtered_results)-0.5,
        y0=corrected_alpha, y1=corrected_alpha,
        line=dict(color='Red', dash='dash')
    )

    fig.update_layout(
        title=f'Log-rank Test P-values for {selected_language}',
        xaxis_title='Languages',
        yaxis_title='P-value',
        yaxis=dict(range=[0, 1]),
        showlegend=False
    )

    return fig

# List of unique languages from log_rank_results
languages = sorted(set([key.split(' vs ')[0] for key in log_rank_results.keys()]))

# Create initial figure
initial_language = languages[0]
fig = create_figure(initial_language, log_rank_results, corrected_alpha)

# Add dropdown menu
dropdown_buttons = [
    dict(
        args=[{"x": [list({key.split(' vs ')[1]: value for key, value in log_rank_results.items() if key.startswith(lang)}.keys())],
               "y": [list({key.split(' vs ')[1]: value for key, value in log_rank_results.items() if key.startswith(lang)}.values())],
               "title": [f'Log-rank Test P-values for {lang}']}],
        label=lang,
        method="update"
    ) for lang in languages
]

# Update figure layout with dropdown menu
fig.update_layout(
    updatemenus=[{
        "buttons": dropdown_buttons,
        "direction": "down",
        "showactive": True
    }]
)

# Show the plot
fig.show()
