# **Data Cleaning**

## Objectives

* Clean the dataset
* Impute zero values with the median

## Inputs

* outputs/datasets/collection/diabetes.csv

## Outputs

* Cleaned dataset will be generated for Train and Test sets and will be exported to outputs/datasets/cleaned folder

## Additional Comments

* This Notebook falls under the CRISP-DM of Data Preperation.
* The conclusions of this notebook is that we will have a Cleaned Data Pipeline ready for the model.
* The Imputation of median values for the missing values shown as zeros in the original dataset could fall under the feature engineering section. I felt it was appropriate to include at this stage as it would be setting up the dataset ready to be split between train and test data.


---

# Change working directory

* As the notebooks are stored in the subfolder 'jupyter_notebooks' we therefore, when running the notebook in the editor, need to change the working directory.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction'

# Importing the Libraries

* Here we will import the dependencies used during the Correlation Study phase.

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading Collected Dataset

* We will begin by loading the diabetes dataset.

* A Pandas dataframe is declared using the diabetes dataset using `read_csv()`

* The first fifteen rows will be displayed using `head()` to get a broader view of the data that will need cleaning

In [5]:
import pandas as pd
df = pd.read_csv(f"outputs/datasets/collection/diabetes.csv")
df.head(15)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


---

# Zero Value Data

* We identified previously that some variables had some abnormalities with zero values, this is likely due to missing values being represented as zeros in the original source data.

* These zero values will either need to be removed from the dataset or replaced and added by imputing with a median value. As the dataset is already small, imputing will be opted for in this stage of cleaning the data.

* External code was taken from [towardsdatascience](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) for guidance with replacing zero values and was then customised to be appropriate for our dataset.

In [6]:
# BloodPressure

df1 = df.loc[df['Outcome'] == 1]
df2 = df.loc[df['Outcome'] == 0]
df1 = df1.replace({'BloodPressure':0}, np.median(df1['BloodPressure']))
df2 = df2.replace({'BloodPressure':0}, np.median(df2['BloodPressure']))
dataframe = [df1, df2]
df = pd.concat(dataframe)

* Zero values will now be replaced for 1 with a median of 1 and this will also be true for 0 when it comes to the Outcome variable.

* We need to apply the same code for each variable where there is an abnormal occurance.

In [8]:
# BMI

df1 = df.loc[df['Outcome'] == 1]
df2 = df.loc[df['Outcome'] == 0]
df1 = df1.replace({'BMI':0}, np.median(df1['BMI']))
df2 = df2.replace({'BMI':0}, np.median(df2['BMI']))
dataframe = [df1, df2]
df = pd.concat(dataframe)

In [9]:
# Insulin

df1 = df.loc[df['Outcome'] == 1]
df2 = df.loc[df['Outcome'] == 0]
df1 = df1.replace({'Insulin':0}, np.median(df1['Insulin']))
df2 = df2.replace({'Insulin':0}, np.median(df2['Insulin']))
dataframe = [df1, df2]
df = pd.concat(dataframe)

In [10]:
# SkinThickness

df1 = df.loc[df['Outcome'] == 1]
df2 = df.loc[df['Outcome'] == 0]
df1 = df1.replace({'SkinThickness':0}, np.median(df1['SkinThickness']))
df2 = df2.replace({'SkinThickness':0}, np.median(df2['SkinThickness']))
dataframe = [df1, df2]
df = pd.concat(dataframe)

In [11]:
# Glucose
df1 = df.loc[df['Outcome'] == 1]
df2 = df.loc[df['Outcome'] == 0]
df1 = df1.replace({'Glucose':0}, np.median(df1['Glucose']))
df2 = df2.replace({'Glucose':0}, np.median(df2['Glucose']))
dataframe = [df1, df2]
df = pd.concat(dataframe)

* We can test that the above changes have been made by running the function `head()` to get a list of the first fifteen rows again to see what the new median data will be.

In [12]:
df.head(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
2,8,183,64,27,0,23.3,0.672,32,1
4,0,137,40,35,168,43.1,2.288,33,1
6,3,78,50,32,88,31.0,0.248,26,1
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,27,0,34.25,0.232,54,1
11,10,168,74,27,0,38.0,0.537,34,1
13,1,189,60,23,846,30.1,0.398,59,1
14,5,166,72,19,175,25.8,0.587,51,1
15,7,100,74,27,0,30.0,0.484,32,1


* After cleaning the data we can see that Insulin still has zero values for the median for Diabetics (Outcome 1). From this we hypothesise that Insulin for Diabetics is lower than that of non-diabetics.

* Next we want to group the diabetes dataset by mean for the Outcome as this is what the machine learning algorithm will see.

In [13]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,110.622,70.844,25.502,87.2,30.8451,0.429734,31.19
1,4.865672,142.302239,75.242537,31.029851,100.335821,35.398134,0.5505,37.067164


* We then want to separate the variables data from the Outcome label by dropping the Outcome column using the `drop()` function.

In [14]:
x = df.drop(columns = 'Outcome', axis=1)
y = df['Outcome']

In [15]:
print(x)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
2              8      183             64             27        0  23.3   
4              0      137             40             35      168  43.1   
6              3       78             50             32       88  31.0   
8              2      197             70             45      543  30.5   
..           ...      ...            ...            ...      ...   ...   
762            9       89             62             21       39  22.5   
763           10      101             76             48      180  32.9   
764            2      122             70             27       39  36.8   
765            5      121             72             23      112  26.2   
767            1       93             70             31       39  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
2                       0.672   32  


* As we can see we have successfully separated the Outcome column as it is no longer showing with the rest of the dataset when printing 'x'.

In [16]:
print(y)

0      1
2      1
4      1
6      1
8      1
      ..
762    0
763    0
764    0
765    0
767    0
Name: Outcome, Length: 768, dtype: int64


* Next we want to standardise the data to make it easier for the machine learning model to predict the outcome.

## Data Standardisation

In [17]:
df_scaler = StandardScaler()

* We use `fit()` and `transform()` to take the training data as an argument and compute the mean and standard deviation to be used for further scaling.

In [18]:
df_scaler.fit(x)
stnd_data = df_scaler.transform(x)
print(stnd_data)

[[ 0.63994726  0.86462486 -0.0313235  ...  0.16958256  0.46849198
   1.4259954 ]
 [ 1.23388019  2.01426457 -0.69266918 ... -1.3283415   0.60439732
  -0.10558415]
 [-1.14185152  0.50330953 -2.67670622 ...  1.551163    5.4849091
  -0.0204964 ]
 ...
 [-0.54791859  0.01060679 -0.19665992 ...  0.63495703 -0.39828208
  -0.53102292]
 [ 0.3429808  -0.02224005 -0.0313235  ... -0.90659589 -0.68519336
  -0.27575966]
 [-0.84488505 -0.94195182 -0.19665992 ... -0.2957919  -0.47378505
  -0.87137393]]


* We will use the variables 'x' and 'y' for training our machine learning model. 'x' represents the data and 'y' represents the model.

In [19]:
x = stnd_data
y = df['Outcome']
print(x)
print(y)

[[ 0.63994726  0.86462486 -0.0313235  ...  0.16958256  0.46849198
   1.4259954 ]
 [ 1.23388019  2.01426457 -0.69266918 ... -1.3283415   0.60439732
  -0.10558415]
 [-1.14185152  0.50330953 -2.67670622 ...  1.551163    5.4849091
  -0.0204964 ]
 ...
 [-0.54791859  0.01060679 -0.19665992 ...  0.63495703 -0.39828208
  -0.53102292]
 [ 0.3429808  -0.02224005 -0.0313235  ... -0.90659589 -0.68519336
  -0.27575966]
 [-0.84488505 -0.94195182 -0.19665992 ... -0.2957919  -0.47378505
  -0.87137393]]
0      1
2      1
4      1
6      1
8      1
      ..
762    0
763    0
764    0
765    0
767    0
Name: Outcome, Length: 768, dtype: int64


---

## Train Test Split

* Next we will need to split the cleaned data into the train and test sets.
* The x variable data will be split into two arrays which are x_train and x_test.
* The test size will be 10% of the dataset and the train size will be 90% of the dataset. As the dataset is small, we want to maximise the amount of trained data to better improve the model.

In [20]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.1, stratify=y, random_state=2)
print(f"Total Records: {x.shape} \nx_train set: {x_train.shape} \nx_test set: {x_test.shape}")

Total Records: (768, 8) 
x_train set: (691, 8) 
x_test set: (77, 8)


* From this, 768 is the number of records in our dataset with 691 records being used for training data leaving 77 records to be used for the test data.

* Next we need to move onto Feature Engineering and train the model.

---

# Pushing the clean files to Repository

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [20]:
try:
    # Creates cleaned folder in the outputs directory
    os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
    print(e)

[Errno 17] File exists: 'outputs/datasets/cleaned'


### Train Dataset

In [21]:
pd.DataFrame(x_train).to_csv("outputs/datasets/cleaned/x_train_cleaned.csv", index=False)

In [21]:
pd.DataFrame(y_train).to_csv("outputs/datasets/cleaned/y_train_cleaned.csv", index=False)

### Test Dataset

In [None]:
pd.DataFrame(x_test).to_csv("outputs/datasets/cleaned/x_test_cleaned.csv", index=False)