<a href="https://colab.research.google.com/github/EverlynAsiko/Dashboard/blob/master/Data_cleaning_catch_up.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>**DATA CLEANING CATCH UP**</center>







## **Recap**

## Cleaning the dataset
Methods of cleaning the data
1. Deleting the records
2. Measures of central tendency imputation
3. Regression imputation
4. Stochastic regression imputation 
5. Multiple imputation

## Missing data situations

1. Missing Completely at Random(MCAR)


> This means the probability of missing is uncorrelated with the data. With this situation, any method of handling missing data will work.

2. Missing At Random(MAR)


> Data is missing conditionally at random. Basically, data is missing because of some variable that was measured.

3. Missing Not At Random(MNAR)


> Data is missing because it was not measured











### Why imputation?
The obvious reason being that, if we delete the data point containing missing data, we will end up with a small number of samples to train our learning model or to do analysis on and thus accuracy would be of concern.

In [0]:
#importing the needed libraries
import numpy as np
import pandas as pd
from sklearn import linear_model

In [0]:
#Loading the dataset
from google.colab import files
uploaded = files.upload()

Saving diabetes.csv to diabetes.csv


In [0]:
#Previewing the data
import io
df = pd.read_csv(io.BytesIO(uploaded['diabetes.csv']))

In [0]:
#First five rows
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [0]:
#Getting all the columns
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [0]:
#Getting how big the dataset is
df.shape

(768, 9)

In [0]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [0]:
df.loc[df["Glucose"] == 0.0, "Glucose"] = np.NAN
df.loc[df["BloodPressure"] == 0.0, "BloodPressure"] = np.NAN
df.loc[df["SkinThickness"] == 0.0, "SkinThickness"] = np.NAN
df.loc[df["Insulin"] == 0.0, "Insulin"] = np.NAN
df.loc[df["BMI"] == 0.0, "BMI"] = np.NAN

df.isnull().sum()[1:6]

Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64

In [0]:
#Mean imputation
# Create an empty dataset
mean_df = pd.DataFrame()

# Create two variables called x0 and x1. Make the first value of x1 a missing value
mean_df['x0'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
mean_df['x1'] = [np.nan,0.2654,0.2615,0.5846,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]

# View the dataset
mean_df

Unnamed: 0,x0,x1
0,0.3051,
1,0.4949,0.2654
2,0.6974,0.2615
3,0.3769,0.5846
4,0.2231,0.4615
5,0.341,0.8308
6,0.4436,0.4962
7,0.5897,0.3269
8,0.6308,0.5346
9,0.5,0.6731


In [0]:
# fill missing values with mean column values
mean_df.fillna(mean_df.mean(), inplace=True)

mean_df

# #For Median
# median_value=df['Age'].median()
# df['Age']=df['Age'].fillna(median_value)

# #For mode
# mode_value=df['Age'].mode()
# df['Age']=df['Age'].fillna(mode_value)


Unnamed: 0,x0,x1
0,0.3051,0.492733
1,0.4949,0.2654
2,0.6974,0.2615
3,0.3769,0.5846
4,0.2231,0.4615
5,0.341,0.8308
6,0.4436,0.4962
7,0.5897,0.3269
8,0.6308,0.5346
9,0.5,0.6731


In [0]:
#You can also use imputers
from sklearn.preprocessing import Imputer

# Create an imputer object that looks for 'Nan' values, then replaces them with the mean value of the feature by columns (axis=0)
mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

# Train the imputor on the df dataset
mean_imputer = mean_imputer.fit(mean_df)

# Apply the imputer to the df dataset
imputed_df = mean_imputer.transform(mean_df.values)

# View the data
imputed_df

#There is an updated method called simpleimpute, find out about it



array([[0.3051    , 0.49273333],
       [0.4949    , 0.2654    ],
       [0.6974    , 0.2615    ],
       [0.3769    , 0.5846    ],
       [0.2231    , 0.4615    ],
       [0.341     , 0.8308    ],
       [0.4436    , 0.4962    ],
       [0.5897    , 0.3269    ],
       [0.6308    , 0.5346    ],
       [0.5       , 0.6731    ]])

## Using regression

In [0]:
missing_columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

In [0]:
#Taking care of columns with missing values
def random_imputation(df, feature):

    number_missing = df[feature].isnull().sum()
    observed_values = df.loc[df[feature].notnull(), feature]
    df.loc[df[feature].isnull(), feature + '_imp'] = np.random.choice(observed_values, number_missing, replace = True)
    
    return df

for feature in missing_columns:
    df[feature + '_imp'] = df[feature]
    df = random_imputation(df, feature)

### Deterministic regression

In Deterministic Regression Imputation, we replace the missing data with the values predicted in our regression model and repeat this process for each variable.

In [0]:
deter_data = pd.DataFrame(columns = ["Det" + name for name in missing_columns])

for feature in missing_columns:
        
    deter_data["Det" + feature] = df[feature + "_imp"]
    parameters = list(set(df.columns) - set(missing_columns) - {feature + '_imp'})
    
    #Create a Linear Regression model to estimate the missing data
    model = linear_model.LinearRegression()
    model.fit(X = df[parameters], y = df[feature + '_imp'])
    
    #observe that I preserve the index of the missing data from the original dataframe
    deter_data.loc[df[feature].isnull(), "Det" + feature] = model.predict(df[parameters])[df[feature].isnull()]

In [0]:
pd.concat([df[["Insulin", "SkinThickness"]], deter_data[["DetInsulin", "DetSkinThickness"]]], axis = 1).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Insulin,394.0,155.548223,118.775855,14.0,76.25,125.0,190.0,846.0
SkinThickness,541.0,29.15342,10.476982,7.0,22.0,29.0,36.0,99.0
DetInsulin,768.0,150.751079,88.566698,14.0,105.0,136.258071,175.852331,846.0
DetSkinThickness,768.0,29.094249,9.220655,7.0,23.0,28.996624,35.0,99.0


### Stochastic regression

To add uncertainity back to the imputed variable values, we can add some normally distributed noise with a mean of zero and the variance equal to the standard error of regression estimates . This method is called Stochastic Regression Imputation

In [0]:
random_data = pd.DataFrame(columns = ["Ran" + name for name in missing_columns])

for feature in missing_columns:
        
    random_data["Ran" + feature] = df[feature + '_imp']
    parameters = list(set(df.columns) - set(missing_columns) - {feature + '_imp'})
    
    model = linear_model.LinearRegression()
    model.fit(X = df[parameters], y = df[feature + '_imp'])
    
    #Standard Error of the regression estimatessize = df[feature].shape[0], 
                                      loc = predict, 
                                      scale = std_error is equal to std() of the errors of each estimates
    predict = model.predict(df[parameters])
    std_error = (predict[df[feature].notnull()] - df.loc[df[feature].notnull(), feature + '_imp']).std()
    
    #observe that I preserve the index of the missing data from the original dataframe
    random_predict = np.random.normal(size = df[feature].shape[0], 
                                      loc = predict, 
                                      scale = std_error)
    random_data.loc[(df[feature].isnull()) & (random_predict > 0), "Ran" + feature] = random_predict[(df[feature].isnull()) & 
                                                                            (random_predict > 0)]

In [0]:
pd.concat([df[["Insulin", "SkinThickness"]], random_data[["RanInsulin", "RanSkinThickness"]]], axis = 1).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Insulin,394.0,155.548223,118.775855,14.0,76.25,125.0,190.0,846.0
SkinThickness,541.0,29.15342,10.476982,7.0,22.0,29.0,36.0,99.0
RanInsulin,768.0,158.391902,109.956649,0.758031,81.75,134.314866,207.190644,846.0
RanSkinThickness,768.0,29.137821,10.269483,6.516797,22.0,29.0,36.0,99.0
