In [24]:
%matplotlib notebook

### Loading modules

First of all, we need to load the modules required for the analysis to the enviroment.

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set(style="whitegrid", color_codes=True)

### Loading the data
The dataset of interest was downloaded from [OpenML](https://www.openml.org/d/55) as a **comma separated value** file and uploaded to the jupyter enviroment. Now, we use **pandas** to load the data into a dataframe.

In [26]:
hepatitis_data = pd.read_csv("dataset_55_hepatitis.csv")

### Understanding and preparing the dataset

We can take a look at the dataset that we've just loaded by using the `head` command.

In [27]:
hepatitis_data.head()

Unnamed: 0,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER_BIG,LIVER_FIRM,SPLEEN_PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK_PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Class
0,30,male,no,no,no,no,no,no,no,no,no,no,no,1.0,85,18,4.0,?,no,LIVE
1,50,female,no,no,yes,no,no,no,no,no,no,no,no,0.9,135,42,3.5,?,no,LIVE
2,78,female,yes,no,yes,no,no,yes,no,no,no,no,no,0.7,96,32,4.0,?,no,LIVE
3,31,female,?,yes,no,no,no,yes,no,no,no,no,no,0.7,46,52,4.0,80,no,LIVE
4,34,female,yes,no,no,no,no,yes,no,no,no,no,no,1.0,?,200,4.0,?,no,LIVE


As we can see above, there are missing values identified with the '?' symbol and most of the data is not numerical. We can check this by using `dtypes` function

In [28]:
hepatitis_data.dtypes

AGE                 int64
SEX                object
STEROID            object
ANTIVIRALS         object
FATIGUE            object
MALAISE            object
ANOREXIA           object
LIVER_BIG          object
LIVER_FIRM         object
SPLEEN_PALPABLE    object
SPIDERS            object
ASCITES            object
VARICES            object
BILIRUBIN          object
ALK_PHOSPHATE      object
SGOT               object
ALBUMIN            object
PROTIME            object
HISTOLOGY          object
Class              object
dtype: object

Before proceding, we can also check the `shape` of our dataframe. As we can see below, the dataset has 155 rows corresponding to the number of patients included in this study, and 20 columns, corresponding to the features or characteristics collected for each patient.

In [29]:
hepatitis_data.shape

(155, 20)

Because for machine learning algorithms, it is requiered to have numerical data, we will convert categorical data as 'no', 'yes', 'DIE', 'LIVE' into numerical categories. We will use for this task, the function `replace`

In [30]:
#Replace yes, no, die, live, female, male and ? for numerical values or np.nan
replacements = {'no': 0,
               'yes': 1,
               'DIE': 0,
               'LIVE': 1,
               '?': np.nan,
               'female': 0,
               'male': 1}

hepatitis_data.replace(replacements, inplace = True)

Lastly, we will convert all of our columns in the dataset to **float** type.`

In [31]:
hepatitis_data = hepatitis_data.astype(float)

We need to see how the different classes of survival are represented in the dataset. (Class imbalance)

In [32]:
total_of_patients = hepatitis_data.shape[0]
total_of_live_patients = (np.sum(hepatitis_data['Class'] == 1)/total_of_patients)*100
total_of_dead_patients = (np.sum(hepatitis_data['Class'] == 0)/total_of_patients)*100
print("Living patients:", round(total_of_live_patients,2),"%")
print("Dead patients:", round(total_of_dead_patients,2),"%")

Living patients: 79.35 %
Dead patients: 20.65 %


### Exploratory Analysis

In [33]:
hepatitis_data.describe()

Unnamed: 0,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER_BIG,LIVER_FIRM,SPLEEN_PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK_PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Class
count,155.0,155.0,154.0,155.0,154.0,154.0,154.0,145.0,144.0,150.0,150.0,150.0,150.0,149.0,126.0,151.0,139.0,88.0,155.0,155.0
mean,41.2,0.103226,0.506494,0.154839,0.649351,0.396104,0.207792,0.827586,0.416667,0.2,0.34,0.133333,0.12,1.427517,105.325397,85.89404,3.817266,61.852273,0.451613,0.793548
std,12.565878,0.30524,0.501589,0.362923,0.47873,0.490682,0.407051,0.379049,0.494727,0.40134,0.475296,0.341073,0.32605,1.212149,51.508109,89.65089,0.651523,22.875244,0.499266,0.40607
min,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,26.0,14.0,2.1,0.0,0.0,0.0
25%,32.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.7,74.25,31.5,3.4,46.0,0.0,1.0
50%,39.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,85.0,58.0,4.0,61.0,0.0,1.0
75%,50.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.5,132.25,100.5,4.2,76.25,1.0,1.0
max,78.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,295.0,648.0,6.4,100.0,1.0,1.0


### Graphical Exploratory Analysis

In [34]:
hepatitis_analysis = hepatitis_data.dropna()
interesting_values_x = ['AGE', 'BILIRUBIN', 'PROTIME', 'ALBUMIN', 'ASCITES', 'ALK_PHOSPHATE', 'SGOT']

In [37]:
plot_relation = sns.PairGrid(hepatitis_analysis, vars = interesting_values_x, hue = 'Class', despine=True);
plot_relation.map_upper(sns.regplot, scatter_kws={"s": 30, "alpha": .5}, line_kws = {"alpha": 0.4, "lw": 1.5}); 
plot_relation.map_lower(sns.residplot); 
plot_relation.map_diag(plt.hist) 
plot_relation.add_legend();


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>