# Data Analysis Project
*Participants: Ilia Rozanovskii, Katarzyna Rongiers*

**For better user experience, it is recomended to run Viola-view**

## Introduction

Data analysis is a part of data science, the process of cleaning, changing, and processing raw data, and extracting actionable, relevant information that helps make informed decisions.

The aim of this work is to make regression analysis of selected dataset. The selection of dataset was the first task. Due to small experience in data analysis, the selected dataset accords to some conditions. At first, is has narrow set of independed variables (or attributes), it increases the clearence of correlation between variables. Secondly, the dataset contains plenty of instances comparing to other available datasets. Larger number of instances gives better model accuracy and therefor is better choice. 

The dataset analized in this work contains minimal health records of 110,204 admissions (primary cohort), 19,051 admissions (study cohort), and 137 admissions (validation cohort) of patients who had sepsis. During the work the set of  logistic regression models was generated and one of the models is selected as final model. Additionally, simple user interface to model prediction is generated.

This work should be considered only as student work with datasets. As long as the dataset and the models concerns to medical field, especially a matter of life and death, we declare, that any result of current work cannot and should not be used as ground for any  decisions related to health, medical, social and others fields.

The used dataset could be found here: 
 [Sepsis data sets](https://archive.ics.uci.edu/ml/datasets/Sepsis+survival+minimal+clinical+records).
 




## Dataset analysis

Analysis in the work is made using ```python``` programming language and a set of available libraries. 
The first step of the dataset analysis is the basic descriptive statistics and searching of correlation between depended and independed variables. 

The selected dataset contain three different part. 
All three sets are going to be taken into account when getting the model. Therefore each data subset loaded by program and has its unique name assign to it. Before processing of data and model analisis, the data sets are cleared -- all instances , that contains non-number values (NaN) are droped out. Additionally, the depended variable placed into dedicated place to uniform the structure of data.

In [222]:
#%pip install ipywidgets
#%pip install voila
#%pip install voila-gridstack
### 

In [223]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#import pandas.util.testing as tm
import statsmodels.api as sm
import seaborn as sns
import statsmodels.formula.api as smf
from IPython.display import display
from IPython.display import clear_output
from IPython.display import Markdown as md
import itertools
import ipywidgets as widgets

In [224]:
### input data
## data filename
datafilename1 = 'sepsis_survival_primary.csv'
datafilename2 = 'sepsis_survival_study.csv'
datafilename3 = 'sepsis_survival_validation.csv'

## depended Variable column name
dependVar = 'hospital_outcome_1alive_0dead'
## exclude columns from analyze (text, urls, etc. )
excludeColumns = []
## minimum correlation coeff to assume as a key variable
minimumCorrCoef = 0.01
threshold = 0.5


In [225]:
## read CSV file, autodetect delimeters, skip spaces in names
df1 = pd.read_csv(datafilename1, sep=None, engine="python", skipinitialspace=True)
df2 = pd.read_csv(datafilename2, sep=None, engine="python", skipinitialspace=True)
df3 = pd.read_csv(datafilename3, sep=None, engine="python", skipinitialspace=True)
dfs = [df1, df2, df3]

In [226]:
for df in dfs:
    ## exclude columns
    df.drop(excludeColumns, axis ='columns', inplace = True)
    ## and drop  all NaN
    df.dropna(inplace=True)
    ## place depended var into pos 0
    poped = df.pop(dependVar)
    df.insert(0, poped.name, poped)
        
## independed variable columns names  
independVarList = list(df.columns.values)
independVarList.remove(dependVar)

This dataset analysis begin with basic introduction to the variables. There are numeric as well as categorical variables. First group constist of ```age_years``` and ```episode_number```, next group has ```hospital_outcome_1alive_0dead``` and ```sex_0male_1female```. Where ```sex_0male_1female``` is categorical variable: 0 reperesent male gender and 1 represent female gender. Where ```hospital_outcome_1alive_0dead``` is __the dependent variable__: 0 respresents illness outcome resolving in death of the patient and 1 concluding the alive state.

In [227]:
class ShowHideableOutput():
    def __init__(self, outputGenerator, description="output"):
        self.outputGenerator = outputGenerator
        self.description = description
        layout = widgets.Layout(width='auto', height='40px') #set width and height
        self.wdg_button   = widgets.ToggleButton(
                    value=False,
                    description='Show ' + self.description,
                    disabled=False,
                    button_style='', # 'success', 'info', 'warning', 'danger' or ''
                    tooltip='Press to show/hide output information',
                    icon='check', # (FontAwesome names without the `fa-` prefix)
                    layout = layout
                )
        self.wdg_output  = widgets.Output(layout={'border': '3px solid green'})
        display(self.wdg_button, self.wdg_output)
        self.wdg_button.observe(self._onchange_wdg_button, names='value')

    def _onchange_wdg_button(self, change):
        if (change['new']) :
            with self.wdg_output:
                display('Output is shown, press button above to hide it.')
                self.outputGenerator()
            self.wdg_button.description = 'Hide ' + self.description
        else:
            self.wdg_button.description = 'Show ' + self.description
            self.wdg_output.clear_output()



Furthermore, both data sets summary tables are visulised to have a brief look on the brief statictics of the values. Seaborn library is used for this part.

In [228]:
def out_show_stats():
    display(md("""Dataset ```df1``` (primary cohort)  statistic:"""))
    display(df1.describe())
    display(md("""Dataset ```df2``` (study cohort)  statistic:"""))
    display(df2.describe())
    display(md("""Dataset ```df3``` (validation cohort)  statistic:"""))
    display(df3.describe())
    
    
var = ShowHideableOutput(out_show_stats, "basic statistic")

ToggleButton(value=False, description='Show basic statistic', icon='check', layout=Layout(height='40px', width…

Output(layout=Layout(border='3px solid green'))

In [229]:
class reviewRawData:
    #all plots for data set to be reviewed in
    def __init__(self, df):
        self.df = df.copy()
        self.pairPlot()
        self.HeatmapCorr()
        self.scatterPlot()
        self.histogramPlot()
        
    def scatterPlot(self):
        allVars=df.columns
        nxdr2x = widgets.Dropdown(options = allVars, value = allVars[1], 
                            description = 'X value', option = 'X var')
        nxdr2y = widgets.Dropdown(options = allVars, value = allVars[1], 
                            description = 'X value', option = 'X var')

        def plot_scatterplot(x = 'age_years', y = 'hospital_outcome_1alive_0dead'):    
            sns.scatterplot(data = self.df, x = x, y = y)

        out2 = widgets.interact(plot_scatterplot, x=nxdr2x, y = nxdr2y)
        ui = widgets.HBox([nxdr2x,nxdr2y])
        
    def pairPlot(self):
        display(md("""Pairplot of variables of dataset"""))
        sns.pairplot(self.df)
        plt.show()
        
        
    def HeatmapCorr(self): 
        sns.set(rc = {'figure.figsize':(15,8)})
        sns.set_theme(style="white")
        corr = self.df.corr()
        # Generate a mask for the upper triangle
        mask = np.triu(np.ones_like(corr, dtype=bool))
        # Custom colormap
        cmap = sns.diverging_palette(275, 150, s=90, l=50, n=9, as_cmap=True)
        sns.heatmap(corr, mask=mask, cmap=cmap,  linewidths=0.3, cbar_kws={"shrink":0.5})
        
    def histogramPlot(self):
        allVars=df.columns
        nxdr = widgets.Dropdown(options = allVars, value = allVars[1], 
                            description = 'X value', option = 'X var')

        nBinsdr = widgets.Dropdown(options=[2,5,10,15,20,40], value=15, 
                            description = 'Bins number', option = 'Bin value')

        def plot_histogram(x = 'age_years', nBins = 15, PlotKDE = False):    
            sns.histplot(data = self.df[x] , bins = nBins, kde = PlotKDE)

        out = widgets.interact(plot_histogram, nBins = nBinsdr,PlotKDE = False,x=nxdr)
        ui = widgets.HBox([nxdr,nBinsdr]) 

Then data from sets can be visualised for better understanding. The values in the sets are compared using different plots. This provides an even better understanding of the correlation between the variables. Both three data set can be selected for this basic visualisation analysis. Following the data set choice, three charts display the correlation between variables like correlation heatmap, scatterplot and histogram.

In [230]:
def out_show_statistic_graphs():
    ##dfChoice = np.array({'df1':'df 1','df2':'df 2','df3':'df 3'})
    dfChoice = np.asarray(['df1','df2','df3'])
    ndf = widgets.Dropdown(options = dfChoice, value = dfChoice[0], 
                        description = 'dfChoice', option = 'dfChoice')

    def plot_review(x = ndf):
        #plotting according to chosen data set
        if x == 'df1':
            z = reviewRawData(df1)
        elif x == 'df2':
            z = reviewRawData(df2)
        elif x == 'df3':
            z = reviewRawData(df3)

    out = widgets.interact(plot_review, x = ndf) 
    ui = widgets.HBox([ndf]) 
        
var = ShowHideableOutput(out_show_statistic_graphs, "statistical graphs")

ToggleButton(value=False, description='Show statistical graphs', icon='check', layout=Layout(height='40px', wi…

Output(layout=Layout(border='3px solid green'))

## Logistic Regression

There are a lot of different types of models are developed. Each of them is suitable for specific situation, used data types, target of modeling and other factors. The model choice is importante stage of data analysis, because the selected modeling method imposes some restrictions on the accuracy, validity and actuality of the result model.

For the problem in this project, logistic regression was chosen. The logistic regression usually used in case the predicting value is a categorical variable. In current case predicting value is dependent variable, which has two possible (binary) values: 0 and 1. In fact the prediction output of model is not the value 0 either 1, but the probability that value is 1. This is the reason why in table of models present ```passprob``` column. Finaly, the two possible values expected as output result of model. With different probability used as threshold, the same model gives different output results. 

Depending on situation and dataset under analysis, different approach for criteries could be selected. In this project the question is the survival of people is predicted. Therefore the model should be selected the way to minimize so called "false negative" cases. False negative prediction means the model predicts output "0" (death), but in fact there was output "1" (alive). Decreasing of false negative prediction increase false positive cases and decreases "effectiveness" of model. But, as was mentioned before, the correct model selection depends on the meaning of data under analysis.

The models in this chapter are generated using ```df1``` (primary cohort) as training data for model and ```df2``` (study cohort) as testing data for model.

In [231]:
class LogisticRegression:
    def __init__(self, dfTrain, dfTest):
        self.dfTrain = dfTrain.copy()
        self.dfTest = dfTest.copy()
        self.models = []
        self.modresults = []
        self.predictVars = []
        self.workingModel = None
        ### make models for all variables and combinations
        self.allResults = pd.DataFrame(columns=['vars', 'passprob', 'res', 'model', 'BA', 'TP', 'FN', 'TN', 'FP'])
        self.CalculateModels()
        self.TestModels()
        
    def CalculateModels(self):
        keyVars = self.GetKeyVariables()
        self.dependVar = keyVars['key'].iloc[0]
        ### All vars into list.
        self.predictVars = keyVars['key'].iloc[1:].to_list()    ## .iloc[1:NvarsCount+1].to_list()
        ### make models for all variables, sequently adding one by one
        for n,indvar in enumerate(self.predictVars):
            indvars = self.predictVars[:n+1]
            modvars = list(map(lambda orig_string: 'Q("' + orig_string + '")', indvars))
            model = smf.logit(formula = '' + dependVar + ' ~ ' + ' + '.join(modvars), data = self.dfTrain)
            res = model.fit(disp=False)
            self.models.append(model)
            self.modresults.append(res)


    def GetKeyVariables(self):
        parCorr = pd.DataFrame(self.dfTrain.corr() )
        n = len(parCorr.columns)
        keyVars = pd.DataFrame(columns=['key', 'val'])
        ## depended variable moved to index 0
        i = 0
        for j in range( n):
            if j >= i:
                keyVars = keyVars.append({'key':parCorr.columns[j],'val':parCorr.iloc[i, j]}, ignore_index=True)
        ## sort key vars by value
        ##### keyVars.pop(df.columns[0])
        keyVars.sort_values(by='val', key=abs, ascending=False, inplace=True)
        return keyVars

    
    def get_BA(self, crossdf):
        crossdf.index = crossdf.index.map(str)
        crossdf.columns = crossdf.columns.map(str)
        try: TN = crossdf.loc['0','0']
        except: TN = 0
        try: FN = crossdf.loc['0','1']
        except: FN = 0
        try: FP = crossdf.loc['1','0']
        except: FP = 0
        try: TP = crossdf.loc['1','1']
        except: TP = 0
        ###print('tn=', TN, 'fn=', FN, 'fp=', FP, 'tp=', TP)
        # sensitivity = (TP)/(TP+FN)
        # specificity = (TN)/(TN+FP)
        # precision = (TP)/(TP+FP)
        try: sensitivity = TP/(TP + FN)
        except: sensitivity = np.NaN
        try: specificity = TN/(TN + FP)
        except: specificity = np.NaN
        try: precision = TP/(TP + FP)
        except: precision = np.NaN
        BA = (sensitivity + specificity)/2
        return dict(zip(['BA', 'sensitivity', 'specificity', 'precision', 'TN', 'FN', 'TP', 'FP'], [BA, sensitivity, specificity, precision, TN, FN, TP, FP]))
        pass
    
    
    def TestModels(self):
        ### calculate BA for all threashholds...
        passProbabilities = np.linspace(0,1,int(1/0.05)+1)

        ## Generate all possible combinations of variables
        variableList = sum([list(map(list, itertools.combinations(self.predictVars, i))) for i in range(len(self.predictVars) + 1)], [])
        variableList = list(filter(None, variableList))

        for n,indvars in enumerate(variableList):
            ## progress # display(indvars)
            modvars = list(map(lambda orig_string: 'Q("' + orig_string + '")', indvars))
            model = smf.logit(formula = '' + dependVar + ' ~ ' + ' + '.join(modvars), data = self.dfTrain )
            fitresult = model.fit(disp=False)
            for prob in passProbabilities:
                inSample = pd.DataFrame({'probability':fitresult.predict(self.dfTest[indvars])}) 
                inSample['Model prediction'] = (inSample['probability'] >= prob).astype(int) ## 0 or 1 values based on probabilities
                confMatrix = pd.crosstab(inSample['Model prediction'], self.dfTest[self.dependVar], dropna=False)
                ba = self.get_BA(confMatrix)
                ### add all info to one dataframe
                ## TODO: memory consumption?
                self.allResults = self.allResults.append({'vars':indvars, 
                                   'passprob':prob, 
                                   'res':fitresult, 
                                   'model':model,
                                   'Rsqrd':fitresult.prsquared, **(ba)},
                                 ignore_index=True)        

        
    def ShowResultsTable(self):
        def _ui_show_result_table(sortcol1,sortcol2, sortorder1, sortorder2):
            print("Selected columns and order: ", sortcol1,sortcol2, sortorder1, sortorder2)
            allResults = self.allResults.reindex(
                            index = self.dfBA.sort_values(by=[sortcol1,sortcol2], #['TP', 'BA'],#
                                        ascending=[sortorder1,sortorder2]).index,
                            copy=True)
            pd.set_option("display.max_rows", 400)
            pd.set_option("display.max_columns", 15)
            pd.set_option("display.max_colwidth", 2000)
            display(allResults.loc[:, ~allResults.columns.isin(['res', 'model'])])
            
        sortableColumns = [item for item in list(self.allResults.columns) 
                   if item not in list(['vars', 'res', 'model'])] 
        self.dfBA = self.allResults.loc[:, sortableColumns].astype('float64').round(decimals=1)
        sort1col = widgets.SelectionSlider(
            options=sortableColumns,
            value='FN',
            description='Sort by...',
            disabled=False,
            continuous_update=False,
            orientation='horizontal',
            readout=True
        )
        sort2col = widgets.SelectionSlider(
            options=sortableColumns,
            value='Rsqrd',
            description='and by...',
            disabled=False,
            continuous_update=False,
            orientation='horizontal',
            readout=True
        )
        sort1order = widgets.ToggleButtons(
            options=[('Ascending',True), ('Descending',False)],
            value=True,
            description='Order:',
            disabled=False,
            button_style='', # 'success', 'info', 'warning', 'danger' or ''
            tooltips=['Sort column in ascending order (1..99)', 'Sort column in descending order (99..1)']
        )
        sort2order = widgets.ToggleButtons(
            options=[('Ascending',True), ('Descending',False)],
            value=False,
            description='Order:',
            disabled=False,
            button_style='', # 'success', 'info', 'warning', 'danger' or ''
            tooltips=['Sort column in ascending order (1..99)', 'Sort column in descending order (99..1)']
        )

        sort1ui = widgets.HBox([sort1col,sort1order])
        sort2ui = widgets.HBox([sort2col,sort2order])
        resultui = widgets.interactive_output(_ui_show_result_table,
                            {'sortcol1':sort1col,'sortcol2':sort2col, 
                             'sortorder1':sort1order,'sortorder2':sort2order})
        display(sort1ui,sort2ui,resultui)
        
    def setWorkingModel(self, modelNumber):
        if (modelNumber in self.allResults.index.values):
            self.workingModel = modelNumber

        
    def getModelPrediction(self, dfInput, modelNumber=None):
        if (modelNumber == None):
            modelNumber = self.workingModel
        if (modelNumber in self.allResults.index.values):
            modelresult = self.allResults.iloc[modelNumber].res
            return modelresult.predict(dfInput)
        return []

    def ShowModelInfo(self, modelNumber=None):
        if (modelNumber == None):
            modelNumber = self.workingModel
        if (modelNumber in self.allResults.index.values):
            modelresult = self.allResults.iloc[modelNumber].res
            ## Print results
            print("*" * 80)
            print("*" * 20, f"      Analyze results for model {modelNumber}     ","*" * 20)
            print("*" * 80)
            print(modelresult.summary() )
   
        

In [232]:
## Calculate Logistic regression models
## the biggest dataset df1 as train daataset
## the df2 as test dataset
logreg1 = LogisticRegression(df1, df2)  ## df1-big as train
logreg1.setWorkingModel(141)

In [233]:
def out_show_models_table():
    display(md("""Key variables table:"""))
    display(logreg1.GetKeyVariables())
    display(md("""Models comparison table:"""))
    logreg1.ShowResultsTable()
    
var = ShowHideableOutput(out_show_models_table, "models table")

ToggleButton(value=False, description='Show models table', icon='check', layout=Layout(height='40px', width='a…

Output(layout=Layout(border='3px solid green'))

In [234]:
def out_show_models_info():
    display(md("""Selected model parameters:"""))
    logreg1.ShowModelInfo()
    
var = ShowHideableOutput(out_show_models_info, "model info")

ToggleButton(value=False, description='Show model info', icon='check', layout=Layout(height='40px', width='aut…

Output(layout=Layout(border='3px solid green'))

## Prediction

As was mentioned above, different datasets require different approach to choosing the best one. On current dataset the best model was choosen based on such parameters as zero false negative cases and after that maximum of pseudo R-squared. 

In [235]:
md(f"""According to this factors, the model № ```{logreg1.workingModel}``` (see models table above) is selected as the working model.
Based on the model, now user can input patient data and have predicted outcome for the illness. Form below accept all independent variables, next they are passed to chosen outcome prediction model. Information from the form is gathered in table. """)

According to this factors, the model № ```141``` (see models table above) is selected as the working model.
Based on the model, now user can input patient data and have predicted outcome for the illness. Form below accept all independent variables, next they are passed to chosen outcome prediction model. Information from the form is gathered in table. 

In [236]:
class OutcomePredictionUI: 
    def __init__(self, logReg, independentVars):
        self.ageMin = 20
        self.ageMax = 100+1
        self.episodesMin = 1
        self.episodesMax = 5
        
        self.independentVars = independentVars.copy()
        self.logReg = logReg
        # self.formdf = pd.DataFrame(columns= [*self.independentVars, 'Result probability'])
        self.formdf = self.logReg.dfTrain.drop(self.logReg.dfTrain.index)
        self.i,self.j = 4,4
        self.grid = widgets.GridspecLayout(self.i,self.j)
        ### self.dfval = df[self.independentVars]
        self.widgetsCreate()

    def create_expanded_button(self,description, button_style):
        return widgets.Button(description=description, button_style=button_style,
                          layout=widgets.Layout(height='auto', width='auto'))

    def widgetsCreate(self):
        #BUTTONS
        self.btn_Send = self.grid[2, 0] = self.create_expanded_button('Results'.format(0, 0), 'warning')
        self.btn_Clear = self.grid[2, 1] = self.create_expanded_button('Clear'.format(0, 0), 'warning')
        #self.btn_Results = self.grid[2, 2] = self.create_expanded_button('Get results'.format(0, 0), 'warning')
        #Fields to select
        self.btn_sex = self.grid[0,0] = widgets.Dropdown(description='Sex', options =['Female','Male'])
        self.btn_age = self.grid[0,1] = widgets.Dropdown(description='Age', options =range(self.ageMin,self.ageMax))
        self.btn_episodes = self.grid[0,2] = widgets.Dropdown(description='Episodes number', 
                                                              options =range(self.episodesMin,self.episodesMax))
        
        for j in range(2):
            self.grid[2,j].on_click(self.on_btn_click)
        #display the grid and table
        self.out_table = widgets.Output()
        display(md(f"""Working model: {self.logReg.workingModel}."""))
        display(self.grid, self.out_table)

    def addToDF(self):
        #self.formdf = self.logReg.dfTrain.drop(self.logReg.dfTrain.index)
       
        self.formdf = self.formdf.append(pd.DataFrame([[self.btn_age.value,self.btn_sex.value, 
                                           self.btn_episodes.value]], columns = self.independentVars),
                                         ignore_index=True)
        self.convertDF()
        self.update_results()
       
    def convertDF(self):
        #for user convinence - select gender name
        #later converting before prdictions
        
        for i in range(0,len(self.formdf['sex_0male_1female'])):
            self.formdf['sex_0male_1female'] =self.formdf['sex_0male_1female'].replace(['Male'],0)
            self.formdf['sex_0male_1female'] =self.formdf['sex_0male_1female'].replace(['Female'],1)
        resProbs = self.logReg.getModelPrediction(self.formdf)
        self.formdf['Result probability'] = resProbs
        self.formdf['Result: alive'] = resProbs > self.logReg.allResults.iloc[self.logReg.workingModel].passprob
        self.update_results()
             
    def clear(self):
        self.formdf = self.logReg.dfTrain.drop(self.logReg.dfTrain.index)
        self.update_results()

    def on_btn_click(self, btn):
        if btn.description == 'Results':
            self.addToDF()

        elif btn.description == 'Clear':
            self.clear()
    
    def update_results(self):
        self.out_table.clear_output(wait=True)
        formdf = self.formdf.copy()
        formdf.pop(dependVar)
        with self.out_table:
            display(formdf)

The form below allows the user to input values and have it evaluated further. Results prediction works upon the model prediction selected previously.
The outcome of the evaluated values inserted in the form are the prediction of alive state of the patient.

In [237]:
var = OutcomePredictionUI(logreg1, ["age_years", "sex_0male_1female", "episode_number"] )

Working model: 141.



Output()

## Second model

In this paragraph the second model is tried to generate. In this case as training data the smallest dataset ```df3``` is used, and the same dataset ```df2``` is used as testing data. From models table below, it is hard to use the same principles as it was made for previous case. Additionally, have been noticed that the correlation coefficient of key variables  is totaly different comparing to previous model.

In [238]:
## calculate models for other datasets
## df3 is smalest dataset now used as _train_
## the same df2 used as test
## is any difference comparing to first set?
logreg3 = LogisticRegression(df3, df2)  ## df3-small as train
logreg3.setWorkingModel(73)

In [239]:
def out_show_model_2():
    def out_show_models_table_2():
        display(md("""Key variables table:"""))
        display(logreg3.GetKeyVariables())
        display(md("""Models for ```df3``` as train data comparison table:"""))
        logreg3.ShowResultsTable()
    def out_show_models_info_2():
        display(md("""Selected model parameters:"""))
        logreg3.ShowModelInfo()
    def out_show_prediction_2():
        var = OutcomePredictionUI(logreg3, ["age_years", "sex_0male_1female", "episode_number"] )
    
    display(md("""The key variables and correlation coefficient, models table"""))
    var = ShowHideableOutput(out_show_models_table_2, "models table 2")
    display(md("""Selected model information"""))
    var = ShowHideableOutput(out_show_models_info_2, "model info 2")    
    var = OutcomePredictionUI(logreg3, ["age_years", "sex_0male_1female", "episode_number"] )
    
var = ShowHideableOutput(out_show_model_2, "second model information")

ToggleButton(value=False, description='Show second model information', icon='check', layout=Layout(height='40p…

Output(layout=Layout(border='3px solid green'))

## Conclusion

The results of dataset analysis is the logistic regression model what could be used for prediction. The process of model choosing is importante and difficult and the results integrally depends on this choice.   The structure of the analysis included many widgets with a user-friendly view. The results form gives the user the possibility to input information and make prediction straight away.  The overall analysis was executed as intended. When it comes to illness data set analysis and distinguishing between alive and dead state, as stated at the beginning, this particular evaluation cannot be taken as a valid outcome resulting in any health decisions.
During this work was shown that the data analysis and model selection approach is data-depended and target-depended, and therefore could not be made fully automatically.