## End-to-End Scenario: Scenario Survival Analysis
Author: TI HA DB ML China - SAP HANA PAL Team

Date: 2020/06/18

In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects who have survived or are saved after that intervention over a period of time. We wish to measure the survival probability of Dukes’C colorectal cancer patients after treatment and evaluate statistically whether the patients who accept treatment can survive longer than those who are only controlled conservatively.


## 1.  Setup the Connection to SAP HANA
First, create a connetion to SAP HANA. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters.A sample section in the config file is shown below which includes HANA url, port, user and password information.<br>

###################<br>
[hana]<br>
url=host-url<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
###################<br>

In [None]:
from hana_ml.dataframe import ConnectionContext
from hana_ml.algorithms.pal.utility import Settings
url, port, user, pwd = Settings.load_config("../../../config/e2edata.ini")
connection_context = ConnectionContext(url, port, user, pwd)

Connection status:

In [None]:
print(connection_context.connection.isconnected())

##  2. Dataset
This scenarios describes a clinical trial of 49 patients for the treatment of Dukes’C colorectal cancer. The following data shows the survival time in 49 patients with Dukes’C colorectal cancer who are randomly assigned to either linoleic acid or control treatment.

![](patient.png)

The + sign indicates censored data. Until 6 months after treatment, there are no deaths. The effect of the censoring is to remove from the alive group those that are censored. At time 6 months two subjects have been censored so the number alive just before 6 months is 23. There are two deaths at 6 months. Thus,

We now reduce the number alive (“at risk”) by two. The censored event at 9 months reduces the “at risk” set to 20. At 10 months there are two deaths. So the proportion surviving is 18/20 = 0.9, and the cumulative proportion surviving is 0.913*0.90 = 0.8217.

## 3. Implementation Steps

### Option 1: Kaplan-Meier Estimate
**Technology Background**

Kaplan-Meier estimate is one of the simplest way to measure the fraction of subjects living for a certain amount of time after treatment. The time starting from a defined point to the occurrence of a given event, for example death, is called as survival time.

**Step 1**

The trial data can then be loaded into table as follows:


In [None]:
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas

data = {'TIME':  [1,5,6,6,9,10,10,10,12,12,13,15,16,20,24,24,27,32,34,36,44,3,6,8,12,12,15,16,18,20,22,24,28,30,30,33,42],
        'STATUS': [0,0,1,1,0,1,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,1,0,1,0,0,1,0,1],
        'OCCURRENCES': [1,1,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,2,1,1,4,2,2,1,1,1,2,1,1,1,3,1,1,1,1],
        'GROUP': ["linoleic acid","linoleic acid","linoleic acid","linoleic acid","linoleic acid","linoleic acid",
                  "linoleic acid","linoleic acid","linoleic acid","linoleic acid","linoleic acid","linoleic acid",
                  "linoleic acid","linoleic acid","linoleic acid","linoleic acid","linoleic acid","linoleic acid",
                  "linoleic acid","linoleic acid","linoleic acid","control","control","control","control","control",
                  "control","control","control","control","control","control","control","control","control","control",
                  "control"] }

trial = pd.DataFrame (data, columns = ['TIME','STATUS','OCCURRENCES', 'GROUP'])
trial_df = create_dataframe_from_pandas(connection_context, pandas_df=trial, 
                                        table_name='PAL_TRIAL_DATA_TBL', force=True, replace=True)
trial_df.head(5).collect()


**Step 2**

Input customer data and use the Kaplan-Meier function to get the survival estimates and log-rank test statistics.

To compare survival estimates produced from two groups, we use log-rank test. 
It is a hypothesis test to compare the survival distribution of two groups (some of the observations may be censored) 
and is used to test the null hypothesis that there is no difference between the populations (treatment group and control group)
in the probability of an event (here a death) at any time point. The methods are nonparametric in 
that they do not make assumptions about the distributions of survival estimates. 
The analysis is based on the times of events (here deaths). For each such time 
we calculate the observed number of deaths in each group and the number expected 
if there were in reality no difference between the groups. It is widely used in clinical trials 
to establish the efficacy of a new treatment in comparison with a control treatment when the measurement 
is the time to event (such as the time from initial treatment to death).

Because the log-rank test is purely a test of significance, it cannot provide an estimate of the size of the difference between the groups.



In [None]:
from hana_ml.algorithms.pal.stats import kaplan_meier_survival_analysis
result = kaplan_meier_survival_analysis(trial_df)
print(result[0].collect())

In [None]:
print(result[1].collect())

In [None]:
print(result[2].collect())

###  Option 2: Weibull Distribution
**Technology Background**

Weibull distribution is often used for reliability and survival analysis. It is defined by 3 parameters: shape, scale, and location. Scale works as key to magnify or shrink the curve. Shape is the crucial factor to define how the curve looks like, as described below:

 - Shape = 1: The failure rate is constant over time, indicating random failure.
 
 - Shape < 1: The failure rate decreases over time.
 
 - Shape > 1: The failure rate increases over time.

**Step 1**

Get Weibull distribution and statistics from the linoleic acid treatment data:

In [None]:
cursor = connection_context.connection.cursor()
try:
     cursor.execute("DROP TABLE PAL_LINO_ACID_TBL")
except:
     pass

cursor.execute('CREATE COLUMN TABLE PAL_LINO_ACID_TBL (\"LEFT\" DOUBLE, \"RIGHT\" DOUBLE);')
values = [(1,None),
          (5,None),
          (6,6),
          (6,6),
          (9,None),
          (10,10),
          (10,10),
          (10,None),
          (12,12),
          (12,12),
          (12,12),
          (12,12),
          (12,None),
          (13,None),
          (15,None),
          (16,None),
          (20,None),
          (24,24),
          (24,None),
          (27,None),
          (32,32),
          (34,None),
          (36,None),
          (36,None),
          (44,None)]
try:
    cursor.executemany("INSERT INTO " +
                       "{} VALUES ({})".format('PAL_LINO_ACID_TBL',
                       ', '.join(['?']*len(values[0]))), values)
    connection_context.connection.commit()
finally:
    cursor.close()
linoleic_acid_df = connection_context.table("PAL_LINO_ACID_TBL")

print(linoleic_acid_df.head(5).collect())

Call Weibull Distribution function and show the results:

In [None]:
from hana_ml.algorithms.pal.stats import distribution_fit
result = distribution_fit(linoleic_acid_df, distr_type = "weibull", optimal_method = "maximum_likelihood", censored=True)
print(result[0].collect())

In [None]:
print(result[1].collect())

**Step 2**

Get Weibull distribution and statistics from the control treatment data:

In [None]:
cursor = connection_context.connection.cursor()
try:
    cursor.execute("DROP TABLE PAL_CONTROL_TBL")
except:
    pass
cursor.execute('CREATE COLUMN TABLE PAL_CONTROL_TBL (\"LEFT\" DOUBLE, \"RIGHT\" DOUBLE);')
values = [(3,None),
          (6,6),
          (6,6),
          (6,6),
          (6,6),
          (8,8),
          (8,8),
          (12,12),
          (12,12),
          (12,None),
          (15,None),
          (16,None),
          (18,None),
          (18,None),
          (20,20),
          (22,None),
          (24,24),
          (28,None),
          (28,None),
          (28,None),
          (30,30),
          (30,None),
          (33,None),
          (42,42)]
try:
    cursor.executemany("INSERT INTO " +
                       "{} VALUES ({})".format('PAL_CONTROL_TBL',
                       ', '.join(['?']*len(values[0]))), values)
    connection_context.connection.commit()
finally:
    cursor.close()
control_df = connection_context.table("PAL_CONTROL_TBL")

print(control_df.head(5).collect())

In [None]:
result = distribution_fit(control_df, distr_type = "weibull", optimal_method = "maximum_likelihood", censored=True)
print(result[0].collect())

In [None]:
print(result[1].collect())

**Step 3**

Get the CDF (cumulative distribution function) of Weibull distribution for the linoleic acid treatment data:

In [None]:
cursor = connection_context.connection.cursor()
try:
    cursor.execute("DROP TABLE PAL_DISTRPROB_DATA_TBL")
except:
    pass
cursor.execute('CREATE COLUMN TABLE PAL_DISTRPROB_DATA_TBL (\"DATACOL\" DOUBLE);')
values = [(6,),(8,),(12,),(20,),(24,),(30,),(42,)]
try:
    cursor.executemany("INSERT INTO " +
                       "{} VALUES ({})".format('PAL_DISTRPROB_DATA_TBL',
                       ', '.join(['?']*len(values[0]))), values)
    connection_context.connection.commit()
finally:
    cursor.close()
distri_prob_df = connection_context.table("PAL_DISTRPROB_DATA_TBL")

print(distri_prob_df.collect())

Invoke CDF and show the result:

In [None]:
from hana_ml.algorithms.pal.stats import cdf
distr_info = {'name' : 'weibull', 'shape' : 1.40528, 'scale': 36.3069}
result = cdf(distri_prob_df, distr_info, complementary=False)
print(result.collect())

**Step 4**

Get the CDF (cumulative distribution function) of Weibull distribution for the control treatment data:

In [None]:
cursor = connection_context.connection.cursor()
try:
    cursor.execute("DROP TABLE PAL_DISTRPROB_DATA_TBL")
except:
    pass
cursor.execute('CREATE COLUMN TABLE PAL_DISTRPROB_DATA_TBL (\"DATACOL\" DOUBLE);')
values = [(6,),(10,),(12,),(24,),(32,)]
try:
    cursor.executemany("INSERT INTO " +
                       "{} VALUES ({})".format('PAL_DISTRPROB_DATA_TBL',
                       ', '.join(['?']*len(values[0]))), values)
    connection_context.connection.commit()
finally:
    cursor.close()
distri_prob_df = connection_context.table("PAL_DISTRPROB_DATA_TBL")

print(distri_prob_df.collect())

Invoke CDF and show the result:

In [None]:
distr_info = {'name' : 'weibull', 'shape' :  1.71902, 'scale': 20.444}
result = cdf(distri_prob_df, distr_info, complementary=False)
print(result.collect())

## Drop Tables and Close HANA Connection

In [None]:
cursor = connection_context.connection.cursor()
try:
    cursor.execute("DROP TABLE PAL_TRIAL_DATA_TBL")
    cursor.execute("DROP TABLE PAL_DISTRPROB_DATA_TBL")
    cursor.execute("DROP TABLE PAL_LINO_ACID_TBL")
    cursor.execute("DROP TABLE PAL_CONTROL_TBL")
except:
    pass
connection_context.close()