## Import data



In [1]:
import pandas as pd
df = pd.read_csv('hepatitis.csv')
print(df)
df.head(5)

FileNotFoundError: [Errno 2] No such file or directory: 'hepatitis.csv'

## Identify missing values



In order to check whether our dataset contains missing values, we can use the function `isna()`

In [None]:
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
steroid,1
antivirals,0
fatigue,1
malaise,1
anorexia,1
liver_big,10
liver_firm,11
spleen_palpable,5


Now we can count the percentage of missing values for each column, simply by dividing the previous result by the length of the dataset (`len(df)`) and multiplying by 100.

In [None]:
df.isna().sum()/len(df)*100

Unnamed: 0,0
age,0.0
sex,0.0
steroid,0.645161
antivirals,0.0
fatigue,0.645161
malaise,0.645161
anorexia,0.645161
liver_big,6.451613
liver_firm,7.096774
spleen_palpable,3.225806


## Drop missing values
Dropping missing values can be achieved in the following ways:
* remove rows having missing values
* remove the whole column containing missing values
We can use the dropna() by specifying the axis to be considered. If we set axis = 0 we drop the entire row, if we set axis = 1 we drop the whole column.


In [None]:
df.dropna(axis=1)

Unnamed: 0,age,sex,antivirals,histology,class
0,30,male,False,False,live
1,50,female,False,False,live
2,78,female,False,False,live
3,31,female,True,False,live
4,34,female,False,False,live
...,...,...,...,...,...
150,46,female,False,True,die
151,44,female,False,True,live
152,61,female,False,True,live
153,53,male,False,True,live


In [None]:
df.dropna(axis=0)

Unnamed: 0,age,sex,steroid,antivirals,fatigue,malaise,anorexia,liver_big,liver_firm,spleen_palpable,spiders,ascites,varices,bilirubin,alk_phosphate,sgot,albumin,protime,histology,class
5,34,female,True,False,False,False,False,True,False,False,False,False,False,0.9,95.0,28.0,4.0,75.0,False,live
10,39,female,False,True,False,False,False,False,True,False,False,False,False,1.3,78.0,30.0,4.4,85.0,False,live
11,32,female,True,True,True,False,False,True,True,False,True,False,False,1.0,59.0,249.0,3.7,54.0,False,live
12,41,female,True,True,True,False,False,True,True,False,False,False,False,0.9,81.0,60.0,3.9,52.0,False,live
13,30,female,True,False,True,False,False,True,True,False,False,False,False,2.2,57.0,144.0,4.9,78.0,False,live
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,45,female,True,True,False,False,False,True,False,False,False,False,False,1.3,85.0,44.0,4.2,85.0,True,live
143,49,female,False,False,True,True,False,True,False,True,True,False,False,1.4,85.0,70.0,3.5,35.0,True,die
145,31,female,False,False,True,False,False,True,False,False,False,False,False,1.2,75.0,173.0,4.2,54.0,True,live
153,53,male,False,False,True,False,False,True,False,True,True,False,True,1.5,81.0,19.0,4.1,48.0,True,live


* We can use the argument inplace=True in order to store changes in the original dataframe
* we can specify only the column on which the dropping operation must be applied. This can be achieved through the subset parameter, which permits to specify the subset of columns where to apply the dropping operation.

In [None]:
df.dropna(subset=['liver_big'],axis=0,inplace=True)
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
steroid,1
antivirals,0
fatigue,0
malaise,0
anorexia,0
liver_big,0
liver_firm,1
spleen_palpable,1


## Replace missing values

In order to replace missing values, three functions can be used: `fillna()`, `replace()` and `interpolate()`.
The `fillna()` function replaces all the NaN values with the value passed as argument.
For example, for numerical values, all the NaN values in the numeric columns could be replaced with the average value.


In order to list the type of a column, we can use the attribute `dtypes` as follows:

In [None]:
df.dtypes

Unnamed: 0,0
age,int64
sex,object
steroid,object
antivirals,bool
fatigue,object
malaise,object
anorexia,object
liver_big,object
liver_firm,object
spleen_palpable,object


## Numeric columns



In [None]:
import numpy as np
numeric = df.select_dtypes(include=np.number)
numeric_columns = numeric.columns
print(numeric_columns)

Index(['age', 'bilirubin', 'alk_phosphate', 'sgot', 'albumin', 'protime'], dtype='object')


We can fill the NaN values of numeric columns with the average value, given by the `df.mean()` function.

In [None]:
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())
df.isna().sum()/len(df)*100

Unnamed: 0,0
age,0.0
sex,0.0
steroid,0.689655
antivirals,0.0
fatigue,0.0
malaise,0.0
anorexia,0.0
liver_big,0.0
liver_firm,0.689655
spleen_palpable,0.689655


## Categorial columns
We note that in `dtypes` the categorial columns are described as objects. Thus we can select the `object` columns. We would like to consider only boolean columns. However the `object` type includes also the column `class`, which is a string. We select all the object columns, and then we remove from them the column `class`. Then we can convert the type of the result to `bool`.

In [None]:
boolean_columns = df.select_dtypes(include=object).columns.tolist()
boolean_columns.remove('class')
df[boolean_columns] = df[boolean_columns].astype('bool')

We can replace all the missing values for booleans with the most frequent value. We can use the `mode()` function to calculate the most frequent value. We use the `fillna()` function to replace missing values.

In [None]:
df[boolean_columns].fillna(df.mode())

Unnamed: 0,sex,steroid,fatigue,malaise,anorexia,liver_big,liver_firm,spleen_palpable,spiders,ascites,varices
0,True,False,False,False,False,False,False,False,False,False,False
1,True,False,True,False,False,False,False,False,False,False,False
2,True,True,True,False,False,True,False,False,False,False,False
3,True,True,False,False,False,True,False,False,False,False,False
4,True,True,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
150,True,True,True,True,True,True,False,False,True,True,True
151,True,True,True,False,False,True,True,False,False,False,False
152,True,False,True,True,False,False,True,False,True,False,False
153,True,False,True,False,False,True,False,True,True,False,True


In [None]:
df.isna().sum()/len(df)*100

## Interpolation
Another solution to replace missing values involves the usage of other functions, such as linear interpolation. In this case, for example, we could replace a missing value over a column, with the interpolation between the previous and the next ones. This can be achieved through the use of the `interpolate()` function.

Since we have already fixed all the missing values, we need to reload the dataset.

In [None]:
df = pd.read_csv('hepatitis.csv')
df.isna().sum()/len(df)*100

Unnamed: 0,0
age,0.0
sex,0.0
steroid,0.645161
antivirals,0.0
fatigue,0.645161
malaise,0.645161
anorexia,0.645161
liver_big,6.451613
liver_firm,7.096774
spleen_palpable,3.225806


In [None]:
numeric = df.select_dtypes(include=np.number)
numeric_columns = numeric.columns

Now we can apply the `interpolate()` function to numeric columns, by setting also the limit direction to `forward`. This means that the linear interpolation is applied starting from the first row until the last one.

https://www.geeksforgeeks.org/pandas-dataframe-interpolate/

In [None]:
df[numeric_columns] = df[numeric_columns].interpolate(method = 'linear', limit_direction = 'forward')

In [None]:
df.head(10)

Unnamed: 0,age,sex,steroid,antivirals,fatigue,malaise,anorexia,liver_big,liver_firm,spleen_palpable,spiders,ascites,varices,bilirubin,alk_phosphate,sgot,albumin,protime,histology,class
0,30,True,False,False,False,False,False,False,False,False,False,False,False,1.0,85.0,18.0,4.0,,False,live
1,50,True,False,False,True,False,False,False,False,False,False,False,False,0.9,135.0,42.0,3.5,,False,live
2,78,True,True,False,True,False,False,True,False,False,False,False,False,0.7,96.0,32.0,4.0,,False,live
3,31,True,True,True,False,False,False,True,False,False,False,False,False,0.7,46.0,52.0,4.0,80.0,False,live
4,34,True,True,False,False,False,False,True,False,False,False,False,False,1.0,70.5,200.0,4.0,77.5,False,live
5,34,True,True,False,False,False,False,True,False,False,False,False,False,0.9,95.0,28.0,4.0,75.0,False,live
6,51,True,False,False,True,False,True,True,False,True,True,False,False,0.95,91.6,34.666667,4.133333,77.0,False,die
7,23,True,True,False,False,False,False,True,False,False,False,False,False,1.0,88.2,41.333333,4.266667,79.0,False,live
8,39,True,True,False,True,False,False,True,True,False,False,False,False,0.7,84.8,48.0,4.4,81.0,False,live
9,30,True,True,False,False,False,False,True,False,False,False,False,False,1.0,81.4,120.0,3.9,83.0,False,live


In [None]:
df.isna().sum()/len(df)*100

Unnamed: 0,0
age,0.0
sex,0.0
steroid,0.0
antivirals,0.0
fatigue,0.0
malaise,0.0
anorexia,0.0
liver_big,0.0
liver_firm,0.0
spleen_palpable,0.0
