# Data Cleaning

Before starting with analyzing the data, it is important for us to clean our dataset.

Importing necessary packages

In [8]:
## Import the required libraries
import pandas as pd
import numpy as np

Importing dataset.

In [17]:
df = pd.read_csv('liver_disease.csv')

Let's see what are the different columns in our dataset.

In [18]:
df.head()

Unnamed: 0,id,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
0,1,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,2,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,3,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,4,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,5,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


Let's find out whether our data contains any missing data or not.

In [19]:
df.isnull().sum()

id                  0
age                 0
gender              0
tot_bilirubin       0
direct_bilirubin    0
tot_proteins        0
albumin             0
ag_ratio            0
sgpt                0
sgot                0
alkphos             4
is_patient          0
dtype: int64

From the above code, we can understand that there are few missing values in one columns. It is difficult for us to proceed with analyzing these data if we don't handle these missing data on first place.

Let's find out what are different types of data in our dataset.

In [20]:
df.dtypes

id                    int64
age                   int64
gender               object
tot_bilirubin       float64
direct_bilirubin    float64
tot_proteins          int64
albumin               int64
ag_ratio              int64
sgpt                float64
sgot                float64
alkphos             float64
is_patient            int64
dtype: object

We can see that many colummns contains data which are in either float or integer format. Which is a good sign but there us one column with datatype as `objext`. This can hinder our analysis. So, it is advisable for us to change this to numeric data.

Let's replace the null value in that column by filling mean values from that column.

In [21]:
df['alkphos'] = round(df['alkphos'].fillna(df['alkphos'].mean(axis=0)),2)

Let's check whether it worked or not.

In [22]:
df.isnull().sum()

id                  0
age                 0
gender              0
tot_bilirubin       0
direct_bilirubin    0
tot_proteins        0
albumin             0
ag_ratio            0
sgpt                0
sgot                0
alkphos             0
is_patient          0
dtype: int64

Not, let's create dummy columns to eliminate `string` data in our dataset.

In [23]:
df=pd.get_dummies(df,prefix=['gender','patient'],columns=['gender','is_patient'])
df.rename(columns={'patient_2':'is_patient'}, inplace=True)

In [24]:
df.isnull().sum()

id                  0
age                 0
tot_bilirubin       0
direct_bilirubin    0
tot_proteins        0
albumin             0
ag_ratio            0
sgpt                0
sgot                0
alkphos             0
gender_Female       0
gender_Male         0
patient_1           0
is_patient          0
dtype: int64

Let's check whether it worked or not.

In [25]:
df.dtypes

id                    int64
age                   int64
tot_bilirubin       float64
direct_bilirubin    float64
tot_proteins          int64
albumin               int64
ag_ratio              int64
sgpt                float64
sgot                float64
alkphos             float64
gender_Female         uint8
gender_Male           uint8
patient_1             uint8
is_patient            uint8
dtype: object

Removing all unwanted columns.

In [26]:
df.drop(['gender_Female','patient_1'],axis=1,inplace=True)

Saving the cleaned data in a new csv file.

In [27]:
df.to_csv('liver.csv' , index = False)

Conclusion: We have handled all the missing data from our dataset and we have also removed all the string values and converted it to numeric data which can make our data analysis easy.