# Data Science Coding Challenge for Wemanity by Ani Khachatryan

## Setup

In [1]:
import sys
import os
import numpy as np
import pandas as pd
from scipy.io import arff

In [2]:
# check Python version
print(f'current Python version: {sys.version}')
print(f'working directory: {os.getcwd()}')

current Python version: 3.10.9 (main, Dec 15 2022, 10:44:50) [Clang 14.0.0 (clang-1400.0.29.202)]
working directory: /Users/anikhachatryan/Documents/Projects/Wemanity/code


## Data

**Note**: The data could not be loaded through scipy.io.arff.loadarff because it contains errors as discussed in <a href="https://stackoverflow.com/questions/62653514/open-an-arff-file-with-scipy-io">this</a> StackOverflow question. In order to load it with scipy, it is necessary to manually clean the data, so I will use the code provided in one of the answers to load the data instead. 

### Load

In [3]:
# load the data to a pandas DataFrame
path_to_data = '../data/chronic_kidney_disease_full.arff'

data = []
with open(path_to_data, "r") as f:
    for i, line in enumerate(f):
        line = line.replace('\n', '')
        # I added the next three lines - remove extra '\t's, spaces, and double commas
        line = line.replace('\t', '')
        line = line.replace(' ', '')
        line = line.replace(',,', ',')
        #
        data.append(line.split(','))


names = ['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba',
         'bgr', 'bu',  'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc',
         'rbcc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane',
         'class', 'no_name']

df = pd.DataFrame(data[145:], columns=names)
print(f'df shape: {df.shape}')

df shape: (402, 26)


### Examine

In [4]:
# examine the head of the df
df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class,no_name
0,48,80,1.02,1,0,?,normal,notpresent,notpresent,121,...,7800,5.2,yes,yes,no,good,no,no,ckd,
1,7,50,1.02,4,0,?,normal,notpresent,notpresent,?,...,6000,?,no,no,no,good,no,no,ckd,
2,62,80,1.01,2,3,normal,normal,notpresent,notpresent,423,...,7500,?,no,yes,no,poor,no,yes,ckd,
3,48,70,1.005,4,0,normal,abnormal,present,notpresent,117,...,6700,3.9,yes,no,no,poor,yes,yes,ckd,
4,51,80,1.01,2,0,normal,normal,notpresent,notpresent,106,...,7300,4.6,no,no,no,good,no,no,ckd,


### Clean

As mentioned in _chronic_kidney_disease.info.txt_, the data should have 400 instances and 25 attributes (we have 402 instances and 26 attributes after reading the data). 

The last two rows of the data contain no information and can be dropped. The _'no_name'_ column can also be dropped as it is a result of extra commas in the _.arff_ file.

In [5]:
# the last two rows of the data
df.iloc[400:, :]

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class,no_name
400,,,,,,,,,,,...,,,,,,,,,,
401,,,,,,,,,,,...,,,,,,,,,,


In [6]:
# drop the last two rows
df.drop([400, 401], inplace=True)

# drop the 'no_name' column
df.drop(columns='no_name', inplace=True)

print(f'df shape: {df.shape}')

df shape: (400, 25)


_chronic_kidney_disease.info.txt_ also mentions that missing values are denoted by _'?'_, so I will replace them with _NaNs_ to make it easier to deal with them later.

In [7]:
# replace '?' with NaN
df.replace('?', np.nan, inplace=True)

### Check dtypes

In [8]:
df.dtypes

age      object
bp       object
sg       object
al       object
su       object
rbc      object
pc       object
pcc      object
ba       object
bgr      object
bu       object
sc       object
sod      object
pot      object
hemo     object
pcv      object
wbcc     object
rbcc     object
htn      object
dm       object
cad      object
appet    object
pe       object
ane      object
class    object
dtype: object

It might helpful to separate numerical and categorical columns to make imputation easier.

In [9]:
# numerical columns
colnames_num = ['age', 'bp', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', ]

# categorical columns
# no need to include 'class' since it's the target variable
colnames_cat = ['sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']

In [10]:
# convert numerical columns to numeric
df[colnames_num] = df[colnames_num].apply(pd.to_numeric, errors='coerce', axis=1)

In [11]:
# make sure dtypes are correct
df.dtypes

age      float64
bp       float64
sg        object
al        object
su        object
rbc       object
pc        object
pcc       object
ba        object
bgr      float64
bu       float64
sc       float64
sod      float64
pot      float64
hemo     float64
pcv      float64
wbcc     float64
rbcc     float64
htn       object
dm        object
cad       object
appet     object
pe        object
ane       object
class     object
dtype: object

### Missing values

In [12]:
# number of missing values for each column
df.isna().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

Oops! Seems like we have many missing values. Since our dataset size is small as it is, removing rows with any missing values will significantly decrease our dataset size. Instead, we can impute the missing values. Categorical columns are usually imputed with the mode value, while numerical columns are usually imputed with the mean/median.

**Note**: We are making the assumption that the values are missing at random. More informaton about the dataset and the data collection process will allow to make more informed decisions on how to handle missing data.

In [None]:
# TODO
# more explanation about missing values and why I'm doing median/mode imputation

In [51]:
# impute missing values

# categorical - mode
for colname in colnames_cat:
    df[colname] = df[colname].fillna(df[colname].mode()[0])

# numerical - median because it is more robust to outliers
for colname in colnames_num:
    df[colname] = df[colname].fillna(df[colname].median())

In [52]:
df.isna().sum()

age      0
bp       0
sg       0
al       0
su       0
rbc      0
pc       0
pcc      0
ba       0
bgr      0
bu       0
sc       0
sod      0
pot      0
hemo     0
pcv      0
wbcc     0
rbcc     0
htn      0
dm       0
cad      0
appet    0
pe       0
ane      0
class    0
dtype: int64