# ***Notebook to interactively generate Kaplan-Meier plots***

**Notebook made by:** *Eduardo Reyes-Alvarez (Ph.D. candidate)*

**Affiliation:** *Dr. Lois Mulligan's lab, Queen's University.*

**Contact:** *eduardo_reyes09@hotmail.com*

**Date of version:** V01 - June 27, 2023.

## Instructions

***Summary:***

This notebook allows the user to easily and interactively create Kaplan-Meier probability curves (survival, progression, recurrence) by using the KaplanMeierFitter library for Python. 

***Note on the current version (V01):***

* This is a first working version that was built based on the METABRIC dataset available on the cBioPortal website (https://www.cbioportal.org/study/summary?id=brca_metabric).
* This version of the notebook allows to easily generate a KM plot using the whole dataset or subdividing the dataset into multiple groups based on any column/variable contained in the clinical file or in the RNA Sequencing file from the same study.
* This version has not yet been tested for other datasets from cBioPortals or elsewhere and no tests have been done with files other than RNA Seq files.
* Future releases will be made addressing this: better outputs given by the notebook (KM plot, raw probability estimates, etc.), incorporating the functionality of the button "Using multiple variables" to allow the user to make groups based on 2 or more columns/variables, making a version for Google Colab to run online.

***How to use this notebook:***

1. **Getting the files:** Go to the link for cBioportals, get the whole METABRIC study (click on the arrow pointing down, on the right side of the study name). This will download a *.tar.gz* compressed file, where you can find and extract the files ***data_clinical_patient.txt*** and any of the ***data_mrna_illumina...*** files. You can extract them anywhere (desktop or downloads), but preferably in the same folder as where the notebook is (if not, don't worry, the notebook will ask you to upload file(s)).
2. ***Rename*** the clinical file to ***clinical.txt*** and the mRNA one to ***RNA.txt***.
3. ***Open the notebook*** with JupyterLab, select "Run" in the top-left menu, then "Run all cells" and wait (~1min).
4. ***Look under*** the code block on "Start plotting here!".
5. ***Plot!***: start by selecting a time variable (the METABRIC study has only Overall Survival), the column with the event to observe, fill the 0 and 1 widgets with the event (0 = No event/Living, 1 = Event/Died). After that you can plot the whole dataset by clicking the green button at the bottom, or select "Use 1 variable" to explore different variables and make subgroups. Once subgroups are made by filling the widgets with tags or ranges (please avoid overlapping borders, for example: 0.00-30.**00**, 30.**01**-60.00), you can click the green button to observe the multiple survival probabilities.
   
**NOTES:** Currently, the button "Using multiple variables" does not do anything. You can plot and replot all possible genes or clinical columns in the dataset if you want, just one by one in this version. Also, if you do not need the RNA Seq data, the notebook works with the clinical file alone too. At the moment, the notebook only shows you the plots but future releases will give it as a image file together with an excel file containing the calculated probabilities, in case you want to plot it in your software of preference and do stats there.

## **Code**

### Install and import required libraries

Some of the required libraries for this notebook are common and are pre-installed, whereas some of them need to be installed before we can import them.

In [1]:
# To do specific actions on Google Colab and Google Drive
#from google.colab import drive
#from google.colab import files
#from google.colab import output
import os

# To log relevant information during the runtime
import logging

# To edit and handle the output of each code cell
from IPython.display import display, clear_output, HTML

# General libraries for data handling
!pip install numpy
!pip install pandas
!pip install collections
import numpy as np
import pandas as pd
from collections import OrderedDict

# To generate plots and figures
!pip install matplotlib
import matplotlib.pyplot as plt

# To do KM survival analysis
!pip install lifelines
from lifelines import KaplanMeierFitter

# To generate interactive widgets
!pip install ipywidgets
import ipywidgets as widgets
from ipywidgets import Output
from ipywidgets import interact, interactive, fixed, interact_manual, Dropdown, HBox, VBox, Layout, Label

# Clear the output of the cell
clear_output()

The following steps are just the set up of the loggin file and handler, so we can evaluate the steps and identify issues in the data processeing and the widgets.

In [2]:
# Configure the logging settings
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Create a logger
logger = logging.getLogger()

# Clear the existing log file or create a new empty one
open('MyLog.txt', 'w').close()

# Create a file handler
file_handler = logging.FileHandler('MyLog.txt')
file_handler.setLevel(logging.INFO)

# Create a formatter and add it to the file handler
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)

# Add the file handler to the logger
logger.addHandler(file_handler)

# Log an initial message
logger.info("Log file created or cleared. \n")
clear_output()

### Upload datasets

Request the user to upload the files with clinical data and RNA Seq data to be used. For the moment these will need to be .txt files from the CBioPortals website.

In [3]:
def upload_input_files():

    # Check if the clinical file is already in the working directory
    if "clinical.txt" in os.listdir():
        # Read the clinical file and log it
        df_clinical = pd.read_csv("clinical.txt", sep="\t", comment="#")
        logger.info("Uploaded: clinical.txt \n")
        clear_output()
        
        # Check if there is also an RNA file in the directory
        if "RNA.txt" in os.listdir():
            # Read the RNA file
            df_RNA = pd.read_csv("RNA.txt", sep="\t")
            logger.info("Uploaded: RNA.txt \n")
            clear_output()
        else:
            df_RNA = None           
    else:
        # Ask the user to select the files to upload
        print("Select the file(s) to upload (all at once):")
        uploaded_files = files.upload()

        # Read the clinical file
        df_clinical = pd.read_csv("clinical.txt", sep="\t", comment="#")
        logger.info("Uploaded: clinical.txt \n")

        # Check if an RNA file was uploaded, and read it if so
        if "RNA.txt" in uploaded_files:
            df_RNA = pd.read_csv("RNA.txt", sep="\t")
            logger.info("Uploaded: RNA.txt \n")
        else:
            df_RNA = None
        
    clear_output()

    # Return the dataframes for the next steps
    return df_clinical, df_RNA

### Pre-process uploaded files

***In development...***

In [4]:
def file_preprocessing(df_clinical, df_RNA):

    ################### Processing for the clinical dataframe ##################

    # Log the original dataframe
    logger.info(f"Preview of the original clinical dataset: \n {df_clinical.iloc[:15, :10].to_string()} \n")
    logger.info(f"Data types of columns in the original clinical dataset: \n {df_clinical.dtypes.to_string()} \n\n")
    clear_output()

    # Prepare the variable with the reordered column names
    clinical_columns_main = ["PATIENT_ID"]
    time_to_event = []
    event_observation = []

    # Search for a column of Vital status
    if "VITAL_STATUS" in df_clinical.columns:
        clinical_columns_main.append("VITAL_STATUS")
        event_observation.append("VITAL_STATUS")

    # Search for Overall Survival columns
    if "OS_MONTHS" in df_clinical.columns and "OS_STATUS" in df_clinical.columns:
        clinical_columns_main.append("OS_MONTHS")
        time_to_event.append("OS_MONTHS")
        clinical_columns_main.append("OS_STATUS")
        event_observation.append("OS_STATUS")

    # Search for Recurrence-Free Survival columns
    if "RFS_MONTHS" in df_clinical.columns and "RFS_STATUS" in df_clinical.columns:
        clinical_columns_main.append("RFS_MONTHS")
        time_to_event.append("RFS_MONTHS")
        clinical_columns_main.append("RFS_STATUS")
        event_observation.append("RFS_STATUS")

    # Search for Progression-Free Survival columns
    if "PFS_MONTHS" in df_clinical.columns and "PFS_STATUS" in df_clinical.columns:
        clinical_columns_main.append("PFS_MONTHS")
        time_to_event.append("PFS_MONTHS")
        clinical_columns_main.append("PFS_STATUS")
        event_observation.append("PFS_STATUS")

    # Order alphabetically the remaining columns
    clinical_columns_extra = [col for col in df_clinical.columns if col not in clinical_columns_main]
    clinical_columns_extra.sort()

    # Apply the re-ordering to the df
    clinical_columns_ordered = clinical_columns_main + clinical_columns_extra
    df_clinical = df_clinical[clinical_columns_ordered] 

    # Log the re-arranged dataframe
    logger.info(f"Preview of the pre-processed clinical dataset: \n {df_clinical.iloc[:15, :10].to_string()} \n")
    logger.info(f"Data types of columns in the pre-processed clinical dataset: \n {df_clinical.dtypes.to_string()} \n\n")
    clear_output()

    ##################### Processing for the RNA dataframe #####################

    # If an RNA file was uploaded, then there is info in the dataframe
    if df_RNA is not None:
        # Log the original dataframe
        logger.info(f"Preview of the original RNA dataset: \n {df_RNA.iloc[:15, :10].to_string()} \n")
        logger.info(f"Data types of some columns in the original RNA dataset: \n {df_RNA.iloc[:, :10].dtypes.to_string()} \n\n")
        clear_output()
        
        # Drop the "Entrez_Gene_Id" column
        df_RNA.drop("Entrez_Gene_Id", axis=1, inplace=True)

        # Rename the "Hugo_Symbol" column to "PATIENT_ID" as it appears in the clinical df
        df_RNA.rename(columns={"Hugo_Symbol": "PATIENT_ID"}, inplace=True)

        # Transpose the dataframe, making the content of the "PATIENT_ID" column the new column names
        df_RNA = df_RNA.set_index("PATIENT_ID").T

        # Sort the gene names alphabetically
        df_RNA.sort_index(axis=1, inplace=True)

        # Reset the index to a numerical index
        df_RNA = df_RNA.reset_index().rename_axis("", axis="columns")

        # Rename the "index" column to "PATIENT_ID"
        df_RNA.rename(columns={"index": "PATIENT_ID"}, inplace=True)

        # Sort Patient IDs and reset the index
        df_RNA = df_RNA.sort_values("PATIENT_ID").reset_index(drop=True)

        # Log the re-arranged dataframe
        logger.info(f"Preview of the pre-processed RNA dataset: \n {df_RNA.iloc[:15, :10].to_string()} \n")
        logger.info(f"Data types of some columns in the pre-processed RNA dataset: \n {df_RNA.iloc[:, :10].dtypes.to_string()} \n\n")
        clear_output()

    ############################################################################

    # Return the variables with survival or progression labels available in the datset
    return df_clinical, df_RNA, time_to_event, event_observation

### Create interactive widgets

The structure for widget creation and display is the following (in general):

1. Create the widget and widget output (this last can be empty).
2. Declare the function that is called upon changes are detected in the widget (you can specify here what the output should display or whether more widgets need to be made and displayed).
3. Set the observe attribute, specifying the widget and callback handler function.
4. Display the widget and widget output.

We can nest multiple subwidgets and outputs so that they appear when certain values in the main widgets are selected, and each can have their own handler function and output widget. This is ideal for fine-tunning of the interactive processing, however, the code can get very confusing nesting handler functions, observe/display/output calls inside others.

In [5]:
# We need to declare outside the main widget function all output widgets and other variables
time_to_event_selection_info = Output()
event_observation_selection_info = Output()
event_observation_tagsinput_info = Output()
subgroup_options_selection_info = Output()
subgroup_maker1_info = Output()
subgroup_maker2_info = Output()
KM_plot = Output()

# Variables that need to be accessed by multiple functions
column_data = []
event_observation_tagsinput0 = None
event_observation_tagsinput1 = None

This function to display the interactive widgets, subwidgets, and outputs has to be together in a single code block/function to work properly.

In [6]:
# Main function to prepare and display the interactive widgets and subwidgets
def widget_preparation(df_clinical, df_RNA, time_to_event, event_observation):

    logger.info("----------User interaction with the widgets---------- \n")
    clear_output()
    
    #################################### First widget - Time to event ####################################

    # Create the widget 
    time_to_event_dropdown = widgets.Dropdown(options=["Click here to select..."] + time_to_event)

    # Function to display the output of time_to_event_dropdown (histogram)
    def time_to_event_selection_handler(change):
        
        # Check if the user has selected anything or if the default is selected back 
        if change["new"] != "Click here to select...":
            column_name = change["new"]
            
            # Check if the name selected is a column on the clinical dataframe
            if column_name in df_clinical.columns:
                time_column = df_clinical[column_name].dropna()
                logger.info(f"The user selected: {column_name}     Widget: time_to_event_dropdown. \n")
                logger.info(f"Original dtype of {column_name}: {df_clinical[column_name].dtype}     Dtype once removing NANs: {time_column.dtype} \n")
                
                # Make histogram of values and handle exceptions
                if time_column.dtype != "object":
                    plt.figure(figsize=(4, 2))
                    plt.hist(time_column, bins=12, color="darkgoldenrod", ec="white")
                    plt.xlabel(column_name)
                    plt.ylabel('Number of patients')
                    plt.title(f'Histogram of {column_name}')
                    logger.info(f"A histogram was succesfully made and displayed for: {column_name} \n")
                else:
                    info_str = "Warning: Column type is not numeric."
                    logger.warning("User attention required: The time to event column may not be numerical. \n")
            else:
                info_str = "Warning: Column not found in the dataframe."
                logger.error("User attention required: The time to event column name was not found in the df. \n")

            # Clear the output widget and display the additional information
            with time_to_event_selection_info:
                time_to_event_selection_info.clear_output()
                plt.show() if time_column.dtype != "object" else print(info_str)
        else:
            # Clear the output widget when the default option is selected back 
            with time_to_event_selection_info:
                time_to_event_selection_info.clear_output()
    
    # Observe changes in the widget to call the function above            
    time_to_event_dropdown.observe(time_to_event_selection_handler, names="value")
    
    #################################### Second widget - Event observation ####################################

    # Create the widget 
    event_observation_dropdown = widgets.Dropdown(options=["Click here to select..."] + event_observation)

    # Function to display the output of event_observation_dropdown (bar chart)
    def event_observation_selection_handler(change):

        # Check if the user has selected anything or if the default is selected back 
        if change["new"] != "Click here to select...":
            column_name = change["new"]
            
            # Check if the name selected is a column on the clinical dataframe
            if column_name in df_clinical.columns:
                event_column = df_clinical[column_name]
                logger.info(f"The user selected: {column_name}     Widget: event_observation_dropdown. \n")
                logger.info(f"Dtype of {column_name}: {df_clinical[column_name].dtype}     Unique value counts: \n\t {event_column.value_counts(dropna=False).to_string()} \n")
                
                # Make a bar chart for unique values and handle exceptions
                if event_column.dtype == "object":
                    value_counts = event_column.value_counts(dropna=False)
                    
                    if df_clinical[column_name].nunique() > 15:
                        logger.warning("User attention required: There may be something wrong with the event observation column as there are more than 15 unique values. \n")
                    
                    plt.figure(figsize=(4, 2))
                    value_counts.plot(kind='bar', color=['maroon', 'purple', 'green', 'crimson', 'navy', 'coral', 'lavender'])
                    plt.xlabel(column_name)
                    plt.xticks(rotation=45)
                    plt.ylabel('Number of patients')
                    plt.title(f'Bar Chart of {column_name}')
                    logger.info(f"A bar chart was succesfully made and displayed for: {column_name} \n")
                else:
                    info_str = "Warning: Column type is not categorical."
                    logger.warning("User attention required: The event observation column may not be text-based. \n")
            else:
                info_str = "Warning: Column not found in the dataframe."
                logger.error("User attention required: The event observation column name was not found in the df. \n")

            # Clear the output widget and display the additional information
            with event_observation_selection_info:
                event_observation_selection_info.clear_output()
                plt.show() if event_column.dtype == "object" else print(info_str)
            
            # Following selection and plot, get the unique values from the current column selection
            event_values = np.ndarray.tolist(df_clinical[event_observation_dropdown.value].unique())
    
            # Make subwidgets to specify the event to be observed so we can encode it in binary for kmf.fit
            global event_observation_tagsinput0, event_observation_tagsinput1
            event_observation_tagsinput0 = widgets.TagsInput(allowed_tags=event_values)
            event_observation_tagsinput1 = widgets.TagsInput(allowed_tags=event_values)
            event_observation_tagsinput = [widgets.HTML("<br/>"), HBox([widgets.Label("0:"), event_observation_tagsinput0]),
                                           widgets.HTML("<br/>"), HBox([widgets.Label("1:"), event_observation_tagsinput1])]
            event_observation_tagsinput_box = VBox(event_observation_tagsinput)
            
            with event_observation_tagsinput_info:
                event_observation_tagsinput_info.clear_output()
                display(event_observation_tagsinput_box)
        
        else:
            # Clear the output widget when the default option is selected
            event_observation_selection_info.clear_output()
            event_observation_tagsinput_info.clear_output()

    # Observe changes in the widget to call the function above 
    event_observation_dropdown.observe(event_observation_selection_handler, names="value")

    #################### Third widget - Handle time and event columns that have other names ####################

    # Create the main widget
    event_observation_checkbox = widgets.Checkbox(description="Can't find your columns?", value=False)
    
    #################################### Fourth widget - Making subgroups ####################################

    # Create the main widget  
    subgroup_options_buttons = widgets.ToggleButtons(options=["No", "Using 1 variable", "Using multiple variables"])

    # Create the subwigets (these are only displayed upon certain selections)
    dataset_dropdown = widgets.Dropdown(options=['Click here to select...', 'clinical'] + (['RNA'] if df_RNA is not None else []),
                                        value='Click here to select...', description='Dataset:')
    variables_dropdown = widgets.Dropdown(options=['Click here to select...'] + list(df_clinical.columns[1:]), 
                                          value='Click here to select...', description='Variables:')
    if df_RNA is not None:
        variables_combobox = widgets.Combobox(options=list(df_RNA.columns[1:]), placeholder='Type gene of interest here', description='Genes:')
    group_number_slider = widgets.IntSlider(min=1, max=10, description='Groups:', value=1)

    ####################################    
    # Function to display the output of subgroup_options_buttons (several subwidgets)
    def subgroup_options_selection_handler(change):
        
        # If the user wants to make subgroups based on 1 variable, ask to select a dataset, variable and number of groups
        if change['new'] == 'Using 1 variable':
            # Reset the value, clear the output and call the handler function (above)
            dataset_dropdown.value = 'Click here to select...'
            logger.info("The user selected: Using 1 variable     Widget: subgroup_options_button. \n")

            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()
            with subgroup_options_selection_info:
                subgroup_options_selection_info.clear_output()
                dataset_dropdown.observe(dataset_selection_handler, 'value')
                display(dataset_dropdown)

        # If the user wants to make subgroups based on multiple variables, ask to select something
        elif change['new'] == 'Using multiple variables':
            logger.info("The user selected: Using multiple variables     Widget: subgroup_options_button. \n")
            
            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()
            with subgroup_options_selection_info:
                subgroup_options_selection_info.clear_output()
                display(HTML('<span style="color: red;"> Feature not ready yet T_T </span>'))
        else:
            logger.info("The user selected: No     Widget: subgroup_options_button. \n")
            # Display an empty space if the user has not selected anything or if the default 'No' is selected back again
            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()
            subgroup_options_selection_info.clear_output()

    ####################################
    # Function to display the output of dataset_dropdown (two subwidgets) 
    def dataset_selection_handler(change):

        # Reset the group_number_slider
        group_number_slider.value = 1
        variables_dropdown.value = 'Click here to select...'
        if df_RNA is not None:
            variables_combobox.value = ''
            logger.info("There are RNA and clinical dataframes available to make subgroups. \n")
        
        # Display a dropdown and slider for clinical variables (excluding PATIENT_ID column)
        if change["new"] == "clinical":
            logger.info(f"The {change['new']} dataset was selected to make subgroups. \n")

            # Clear the output and show the two subwidgets next to the previous and observe changes
            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()
            
            with subgroup_options_selection_info:
                subgroup_options_selection_info.clear_output()
                variables_dropdown.observe(variables_selection_handler, 'value')
                group_number_slider.observe(group_number_selection_handler, 'value')
                display(HBox([dataset_dropdown, variables_dropdown, widgets.HTML('\u2003' * 3), group_number_slider]))
        
        # Display a combobox and slider for RNA variables (excluding PATIENT_ID column)
        elif change["new"] == "RNA":
            logger.info(f"The {change['new']} dataset was selected to make subgroups. \n")

            # Clear the output and show the two subwidgets next to the previous and observe changes
            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()
            
            with subgroup_options_selection_info:
                subgroup_options_selection_info.clear_output()
                variables_combobox.observe(variables_selection_handler, 'value')
                group_number_slider.observe(group_number_selection_handler, 'value')
                display(HBox([dataset_dropdown, variables_combobox, widgets.HTML('\u2003' * 3), group_number_slider]))
        
        # Clear the output and display the original dropdown if the default option is selected back again
        else:
            logger.info("No dataset has been seelected to make subgroups. \n")
            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()
            with subgroup_options_selection_info:
                subgroup_options_selection_info.clear_output()
                display(dataset_dropdown)

    ####################################
    # Function to display the output of variables_dropdown and variables_combobox (plots)
    def variables_selection_handler(change):
        # Reminder that this function has to work for both df_clinical and df_RNA

        # Reset the group_number_slider
        group_number_slider.value = 1
        
        # Display empty space if the user has not selected anything or if the default is selected back 
        if change["new"] == 'Click here to select...' or change["new"] == '':
            logger.info("The previous variable to make subgroups was de-selected. \n")
            subgroup_maker1_info.clear_output()
            subgroup_maker2_info.clear_output()

        # If the user has selected a variable, find the column and check dtype
        else:
            # This is to avoid plotting while incomplete strings are being searched in the combobox
            if "variables_combobox" in locals() and dataset_dropdown.value == "RNA":
                if change["new"] not in variables_combobox.options:
                    subgroup_maker1_info.clear_output()
                    subgroup_maker2_info.clear_output()
                    return

            # Make these variables global so they can be accessed from other subwidgets
            global column_data, KM_data
            
            # Apply the 0 and 1 event labels to display the rows of the selected column with those values
            KM_data = df_clinical.copy()
            if event_observation_tagsinput0.value:
                for tag in event_observation_tagsinput0.value:
                    KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].replace(tag, "0")
            if event_observation_tagsinput1.value:
                for tag in event_observation_tagsinput1.value:
                    KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].replace(tag, "1")

            # Log the current status of KM_data
            logger.info(f"[Subgrouping 1st step] The user selected to label -{str(event_observation_tagsinput0.value)}- as 0, and -{str(event_observation_tagsinput1.value)}- as 1. \n")
            logger.info(f"[Subgrouping 1st step] Apply 0/1 labels to column {event_observation_dropdown.value} on KM_data: \n {KM_data.iloc[:15, :10].to_string()} \n")
            logger.info(f"[Subgrouping 1st step] Data types of KM_data columns: \n {KM_data.dtypes.to_string()} \n\n")
            
            # Look for the selected column in either df, as it is not specified within this function
            if change["new"] in df_clinical.columns:
                # Filter out the selected variable to show values only from rows with 0 or 1 events that will be used in KM analysis
                if event_observation_tagsinput0.value and event_observation_tagsinput1.value:
                    KM_data = KM_data[['PATIENT_ID', time_to_event_dropdown.value, event_observation_dropdown.value, change["new"]]]
                    KM_data = KM_data.loc[KM_data[event_observation_dropdown.value].isin(["0", "1"])]
                    KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].astype(int)
                    logger.info(f"[Subgrouping 2nd step] The column {change['new']} -{KM_data.dtypes[change['new']]} dtype- from df_clinical was selected to make subgroups. \n")
                
                column_data = KM_data[change["new"]].copy()

            # If the column is in df_RNA, joining is needed to combine it with the clinical columns
            elif df_RNA is not None and change["new"] in df_RNA.columns:
                if event_observation_tagsinput0.value and event_observation_tagsinput1.value:
                    KM_data = KM_data[['PATIENT_ID', time_to_event_dropdown.value, event_observation_dropdown.value]]
                    df_RNA2 = df_RNA[['PATIENT_ID', change["new"]]]
                    KM_data = KM_data.merge(df_RNA2, on='PATIENT_ID', how='inner')
                    KM_data = KM_data[KM_data[event_observation_dropdown.value].isin(["0", "1"])]
                    KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].astype(int)
                    logger.info(f"[Subgrouping 2nd step] The column {change['new']} -{KM_data.dtypes[change['new']]} dtype- from df_RNA was selected to make subgroups. \n")
                    column_data = KM_data[change["new"]]
                else:
                    column_data = df_RNA[change["new"]]

            # Log the current status of KM_data 
            logger.info(f"[Subgrouping 2nd step] Keep relevant columns of KM_data and only rows with 0/1 event labels: \n {KM_data.iloc[:15, :10].to_string()} \n")
            logger.info(f"[Subgrouping 2nd step] Data types of KM_data columns: \n {KM_data.dtypes.to_string()} \n\n")
            
            # Make and display a bar chart for text columns showing the counts for unique values
            if column_data.dtype == 'object':
                value_counts = column_data.value_counts(dropna=False)
                fig, ax = plt.subplots(figsize=(5, 3))
                value_counts.plot(kind='bar', color=['indigo', 'khaki', 'lightblue', 'salmon', 'sienna', 'silver', 'aquamarine', 'coral', 'teal', 'olive'])
                ax.set_xlabel(change["new"])
                ax.set_ylabel('Count')
                ax.set_title(f'Unique Value Counts for {change["new"]}')
                plt.xticks(rotation=45)

                subgroup_maker2_info.clear_output()
                with subgroup_maker1_info:
                    subgroup_maker1_info.clear_output()
                    plt.show()
                    display(HTML('<span style="color: red;">Showing rows with 0 and 1 events (especified above)</span>'))
                
            # Make and display a histogram of frequencies for numerical columns
            else:
                fig, ax = plt.subplots(figsize=(5, 3))
                ax.hist(column_data, bins='auto', color="darkblue", ec="white")
                ax.set_xlabel(change["new"])
                ax.set_ylabel('Frequency')
                ax.set_title(f'Histogram for {change["new"]}')

                subgroup_maker2_info.clear_output()
                with subgroup_maker1_info:
                    subgroup_maker1_info.clear_output()
                    plt.show()
                    display(HTML('<span style="color: red;">Showing rows with 0 and 1 events (especified above)</span>'))
            
            # After the plot is made show two subwidgets for subgroups
            group_number_slider.value = 2

    ####################################
    # Function to display the output of group_number slider (subwidgets)
    def group_number_selection_handler(change):
        # This function uses the global variable column_data created in the function below
        global subgroup_tagsinput, subgroup_floatrangeslider
        
        # If no variable is selected in the widget on the left, do not display subgrouping options
        if variables_dropdown.value == 'Click here to select...':
            if df_RNA is None or variables_combobox.value == '':
                with subgroup_maker2_info:
                    subgroup_maker2_info.clear_output()
                    display(HTML('<span style="color: red;">Choose a variable first!</span>'))
                return
        
        # Use tags to specify the desired groups for text columns (we are only making 2 or more groups)
        if change["new"]>1 and column_data.dtype == 'object':
            logger.info(f"[Subgrouping 3rd step] The user selected to make {str(change['new'])} subgroups with tags input labels. \n")

            # If the user has been selecting several variables, empty the range slider to prevent issues
            if "subgroup_floatrangeslider" in locals() and subgroup_floatrangeslider:
                subgroup_floatrangeslider = None
            
            unique_values = np.ndarray.tolist(column_data.unique())
            subgroup_tagsinput = [widgets.TagsInput(allowed_tags=unique_values, description=f'Group {i+1}:') for i in range(change["new"])]
            subgroup_tagsinput_labels = [widgets.Label(value=f'Group {i+1}:') for i in range(change["new"])]
            subgroup_tagsinput_box = VBox([HBox([label, tagsinput]) for label, tagsinput in zip(subgroup_tagsinput_labels, subgroup_tagsinput)])

            with subgroup_maker2_info:
                subgroup_maker2_info.clear_output()
                display(subgroup_tagsinput_box)
                
        # Use range sliders to specify the desired groups for numerical columns 
        elif change["new"]>1:
            logger.info(f"[Subgrouping 3rd step] The user selected to make {str(change['new'])} subgroups with float range labels. \n")

            # If the user has been selecting several variables, empty the tags input to prevent issues
            if subgroup_tagsinput:
                subgroup_tagsinput = None
                
            cleaned_column_data = column_data.dropna()
            min_value = cleaned_column_data.min()
            max_value = cleaned_column_data.max()
            subgroup_floatrangeslider = [widgets.FloatRangeSlider(min=min_value, max=max_value, step=0.01, description=f'Group {i + 1}') for i in range(change["new"])]
            subgroup_floatrangeslider_box = VBox(subgroup_floatrangeslider)
                
            with subgroup_maker2_info:
                subgroup_maker2_info.clear_output()
                display(subgroup_floatrangeslider_box) 
        else:
            with subgroup_maker2_info:
                subgroup_maker2_info.clear_output()
                display(HTML('<span style="color: red;">For 1 group select the No button instead!</span>'))

    ####################################        
    
    # Observe changes in the main widget to call the rest if a change is detected  
    subgroup_options_buttons.observe(subgroup_options_selection_handler, 'value') 

    #################################### Fifth widget - Calling KM Fitter ####################################

    # Create the widget 
    generate_plot_button = widgets.Button(description='Generate plot', disabled=False, 
                            button_style='success', # 'success', 'info', 'warning', 'danger' or ''
                            tooltip='Click me and wait for the plot to be displayed below!',
                            icon='chart-line') # (FontAwesome names without the `fa-` prefix)

    # Function to feed the current variable selections to the KM_analysis function (in the next subsection)
    def pass_KM_parameters(change):

        global KM_data
        
        ##################
        # If no subgrouping is required, apply the event observed tags and pass the data to KM_analysis
        if subgroup_options_buttons.value == 'No':
            
            # Apply the selected labels on the event observation column 
            KM_data = df_clinical.copy()
            if event_observation_tagsinput0.value:
                for tag in event_observation_tagsinput0.value:
                    KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].replace(tag, "0")
            if event_observation_tagsinput1.value:
                for tag in event_observation_tagsinput1.value:
                    KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].replace(tag, "1")

            # Log the current status of KM_data
            logger.info(f"[No subgrouping 1st step] The user selected to label -{str(event_observation_tagsinput0.value)}- as 0, and -{str(event_observation_tagsinput1.value)}- as 1. \n")
            logger.info(f"[No subgrouping 1st step] Apply 0/1 labels to column {event_observation_dropdown.value} on KM_data: \n {KM_data.iloc[:15, :10].to_string()} \n")
            logger.info(f"[No subgrouping 1st step] Data types of KM_data columns: \n {KM_data.dtypes.to_string()} \n\n")
                    
            # Filter out non-desired values and convert column to numbers for the KM Fitter
            KM_data = KM_data[['PATIENT_ID', time_to_event_dropdown.value, event_observation_dropdown.value]]
            KM_data = KM_data.loc[KM_data[event_observation_dropdown.value].isin(["0", "1"])]
            KM_data[event_observation_dropdown.value] = KM_data[event_observation_dropdown.value].astype(int)

            # Log the current status of KM_data
            logger.info(f"[No subgrouping 2nd step] Keep relevant columns of KM_data and only rows with 0/1 event labels: \n {KM_data.head(15).to_string()} \n")
            logger.info(f"[No subgrouping 2nd step] Data types of KM_data columns: \n {KM_data.dtypes.to_string()} \n\n")
            
            # Pass the input parameters to the KM_analysis function and get back the KM plot
            KM_parameters = [subgroup_options_buttons.value]
            KM_subgroups = []           
            KM_analysis_output = KM_analysis(KM_parameters, KM_data, KM_subgroups)

            # Show the KM plot generated
            with KM_plot:
                KM_plot.clear_output()
                plt.figure(figsize=(10, 6))
                KM_analysis_output.plot(ci_show=CI_checkbox.value, iloc=slice(0, int(len(KM_analysis_output.survival_function_) * 0.95)))
                plt.xlabel("Time")
                plt.ylabel("Probability")
                plt.title("Kaplan-Meier Plot")
                plt.show()
                
        ##################
        # If subgroups are to be made, the event observed tags were already applied but not the subgrouping tags
        elif subgroup_options_buttons.value == 'Using 1 variable':

            # Remake the subgrouping changes to the original KM_data every time we press the button
            KM_data_working = KM_data.copy()
            
            # Apply tags to the variable selected and create subsets of KM_data_working for each subgroup
            if subgroup_tagsinput:
                subgroup_selections = [tagsinput.value for tagsinput in subgroup_tagsinput]
                logger.info(f"[Subgrouping 3rd step] The user selected---> {subgroup_selections} \n")
                
                # Generate labels for groups
                subgroup_labels = [f"Group {i+1}" for i in range(len(subgroup_selections))]
                label_content_pairs = []
                
                # Iterate through the subgroup_selections list
                for i, tags_list in enumerate(subgroup_selections):
                    subgroup_elements = tags_list
    
                    # Generate a mapping of unique values to group labels
                    element_to_label = {element: subgroup_labels[i] for element in subgroup_elements}

                    # Add the label and content pair to the list
                    label_content_pairs.extend([f"{element}: {subgroup_labels[i]}" for element in subgroup_elements])
                    
                    # Replace the subgroup elements with the new labels
                    variable_column_name = KM_data_working.columns[3]
                    KM_data_working[variable_column_name] = KM_data_working[variable_column_name].replace(element_to_label)
                
                # Filter out rows with new subgroup labels and log the labels selected
                KM_data_working = KM_data_working[KM_data_working[variable_column_name].isin(subgroup_labels)]
                log_string = "  -  ".join(label_content_pairs)
                logger.info(f"[Subgrouping 3rd step] Subgrouping labels applied---> {log_string} \n")
                
                # Create an empty dictionary to store the KM_data_working subsets
                KM_subgroups = {}

                # Iterate through each unique subgroup label in the variable_column_name column
                for label in KM_data_working[variable_column_name].unique():
                    # Filter the rows based on the label
                    subset = KM_data_working[KM_data_working[variable_column_name] == label].copy()
                    
                    # Add the subset to the KM_subgroups dictionary
                    KM_subgroups[label] = subset

            # Segment the values within each range on the variable selected and create subsets of KM_data_working for each subgroup
            elif subgroup_floatrangeslider:
                subgroup_labels = [slider.description for slider in subgroup_floatrangeslider]

                # Create a new column to store the subgroup labels
                KM_data_working['Subgroup'] = ''

                # Iterate through the float range sliders
                for i, slider in enumerate(subgroup_floatrangeslider):
                    # Retrieve the range selection and corresponding label
                    subgroup_range = slider.value
                    subgroup_label = subgroup_labels[i]
                
                    # Get the indices of values within the range selection
                    subgroup_rows = (KM_data_working.iloc[:, 3] >= subgroup_range[0]) & (KM_data_working.iloc[:, 3] < subgroup_range[1])
                
                    # Assign the subgroup label to the matching rows
                    KM_data_working.loc[subgroup_rows, 'Subgroup'] = subgroup_label
                
                # Remove rows where the subgroup label is not assigned
                KM_data_working = KM_data_working[KM_data_working['Subgroup'] != '']
                variable_column_name = KM_data_working.columns[3]
                KM_data_working.drop(variable_column_name, axis=1, inplace=True)
                KM_data_working.rename(columns={'Subgroup': variable_column_name}, inplace=True)

                # Log the ranges corresponding to each subgroup
                log_string = " - ".join([f"Group {i+1}: {slider.value[0]:.2f} to {slider.value[1]:.2f}" for i, slider in enumerate(subgroup_floatrangeslider)])
                logger.info(f"[Subgrouping 3rd step] Subgrouping labels applied---> {log_string} \n")

                # Create an empty dictionary to store the KM_data_working subsets
                KM_subgroups = {}

                # Iterate through each unique subgroup label in the variable_column_name column
                for label in KM_data_working[variable_column_name].unique():
                    # Filter the rows based on the label
                    subset = KM_data_working[KM_data_working[variable_column_name] == label].copy()
                    
                    # Add the subset to the KM_subgroups dictionary
                    KM_subgroups[label] = subset

            # Log the current status of KM_data_working and the subgroups created
            logger.info(f"[Subgrouping 3rd step] Apply subgrouping labels to KM_data_working and subset it in KM_subgroups: \n {KM_data_working.head(15).to_string()} \n")
            logger.info(f"[Subgrouping 3rd step] Data types of KM_data_working columns: \n {KM_data_working.dtypes.to_string()} \n\n")
            logger.info("[Subgrouping 3rd step] Subgroups made from KM_data_working:")
            for label, subgroup in KM_subgroups.items():
                logger.info(f"\n{label}:\n{subgroup.head(10)}\n")
            
            ########
                    
            # Finally, pass the input parameters to the KM_analysis function and get back the KM plot
            KM_parameters = [subgroup_options_buttons.value]
            KM_analysis_output = KM_analysis(KM_parameters, KM_data_working, KM_subgroups)

            # Show the KM plot generated (95% of data points)
            with KM_plot:
                KM_plot.clear_output()
                plt.figure(figsize=(10, 6))
                for label, KM_output in KM_analysis_output.items():
                    KM_output.plot(label=label, ci_show=CI_checkbox.value, iloc=slice(0, int(len(KM_output.survival_function_) * 0.95)))
                plt.xlabel('Time')
                plt.ylabel('Survival Probability')
                plt.title('Kaplan-Meier Curves by Subgroup (1 variable)')
                plt.legend()
                plt.show()

        ##################
        # If subgroups are to be made based on multiple variables, we duplicate the steps for 1 variable as many times as needed
        else:
            a=1
            
        ##################        
            
    # Call the KM Fitter when the button is clicked
    generate_plot_button.on_click(pass_KM_parameters)

    ############################### Six widget - KM plot customization tools #######################

    # Create the main widget
    global CI_checkbox
    CI_checkbox = widgets.Checkbox(description="Show Confidence Intervals", value=True)
    

    
    #################################### Displaying all widgets ####################################    

    # First, second and third widgets go together in the first row
    display(widgets.HTML("<br/>"))
    display(HBox([widgets.Label("Time to Event:"), time_to_event_dropdown, widgets.HTML('\u2003' * 7), widgets.Label("Event Observation:"), event_observation_dropdown, event_observation_checkbox]))
    
    # The output of the first two widgets is displayed in the second row
    time_to_event_selection_info.layout.width = '41%'
    event_observation_selection_info.layout.width = '41%'
    event_observation_tagsinput_info.layout.width = '18%'
    display(HBox([time_to_event_selection_info, event_observation_selection_info, event_observation_tagsinput_info]))
    display(widgets.HTML("<br/>"))

    # The fourth widget is displayed in the third row
    display(HBox([widgets.HTML('\u2003' * 10), widgets.Label("Make additional subgroups?:"), subgroup_options_buttons]))
    display(widgets.HTML("<br/>"))
    
    # The outputs within the fourth widget are displayed in the fourth and fifth rows
    display(subgroup_options_selection_info)
    display(widgets.HTML("<br/>"))
    display(HBox([subgroup_maker1_info, subgroup_maker2_info]))

    # Finally, the button to start the KM Fitter and generate the plot is diplayed in the sixth row with the tools to customize it
    display(widgets.HTML("<br/>"))
    display(HBox([generate_plot_button, CI_checkbox]))
    display(KM_plot)

    ####################################

### Function to do the KM analysis

***In development...***

In [7]:
def KM_analysis(KM_parameters, KM_data, KM_subgroups):

    # Unpack the received KM parameters
    current_subgroup_option = KM_parameters[0]
    current_time_column = KM_data.columns[1]
    current_event_column = KM_data.columns[2]
    
    global KMF_object
    
    if current_subgroup_option == 'No':

        # Create a single KaplanMeierFitter object
        KMF_object = KaplanMeierFitter()

        # Generate the plot using the specified columns
        KMF_object.fit(durations=KM_data[current_time_column], event_observed=KM_data[current_event_column])

        # Log part of the curve to verify the data was passed correctly
        logger.info("[No Subgrouping 3rd step] The KM Fitter succesfully calculated the probabilities and made the plot. \n")
        logger.info(f"[No Subgrouping 3rd step] Calculated survival function estimates: \n {KMF_object.survival_function_.head(7).to_string()} \n ... \n {KMF_object.survival_function_.tail(7).to_string()} \n\n")


    elif current_subgroup_option == 'Using 1 variable':

        # Unpack the received KM parameters
        current_variable_name = KM_data.columns[3]

        # Sort the subgroups in alphabetical order to plot them in the same order and colour
        KM_subgroups = OrderedDict(sorted(KM_subgroups.items()))
        
        # Create an empty dictionary to store the KaplanMeierFitter objects
        KMF_object = {}
        logger.info("[Subgrouping 4th step] The KM Fitter succesfully calculated the probabilities and made the plot. \n")
        
        # Create KaplanMeierFitter objects for each subgroup in KM_subgroups
        for label, subset in KM_subgroups.items():
            kmf = KaplanMeierFitter()
            kmf.fit(durations=subset[current_time_column], event_observed=subset[current_event_column])
            KMF_object[label] = kmf

            # Log part of the curve to verify the data was passed correctly
            logger.info(f"[Subgrouping 4th step] Calculated survival function estimate of: {label}")
            logger.info(f"\n {kmf.survival_function_.head(7).to_string()} \n ... \n {kmf.survival_function_.tail(7).to_string()} \n\n")


    else:
        a=0
        
    return KMF_object

### Function for flow control

***In development...***

In [8]:
def start_plotting():
    
    # First, load the input files
    df_clinical, df_RNA = upload_input_files()

    # Second, preprocess the uploaded files
    df_clinical, df_RNA, time_to_event, event_observation = file_preprocessing(df_clinical, df_RNA)

    # Third, generate and display the interactive widgets to select the variables to plot
    widget_preparation(df_clinical, df_RNA, time_to_event, event_observation)

    # Fourth, save the results...


### Function to save results -optional-

***In development...***

In [9]:
# In development...
def save_KM_results():

  #
  a=1

## **Start KM-plotting here!!!**

In [10]:
# To begin, run this code block
start_plotting()

HTML(value='<br/>')

HBox(children=(Label(value='Time to Event:'), Dropdown(options=('Click here to select...', 'OS_MONTHS'), value…

HBox(children=(Output(layout=Layout(width='41%')), Output(layout=Layout(width='41%')), Output(layout=Layout(wi…

HTML(value='<br/>')

HBox(children=(HTML(value='\u2003\u2003\u2003\u2003\u2003\u2003\u2003\u2003\u2003\u2003'), Label(value='Make a…

HTML(value='<br/>')

Output()

HTML(value='<br/>')

HBox(children=(Output(), Output()))

HTML(value='<br/>')

HBox(children=(Button(button_style='success', description='Generate plot', icon='chart-line', style=ButtonStyl…

Output()