#  Data Cleaning

## Description : Code for replacing missing data by using MICE algorithm
####  Link :(https://www.kaggle.com/rtatman/data-cleaning-challenge-cleaning-numeric-columns/)
###### Author : Monika Dogra
###### Revision : 1
###### Date : 28 Aug 2019

## Introduction : Pandas-MICE

This project provides an Sklearn type interface for imputing independent variables using MICE (Multiple Imputation by Chained Equations). MICE has become an industry standard way of dealing with null values while preprocessing data. It is argued that by simply using fill values such as the mean or mode we are throwing information away that is present in other variables that might give insight into what the null values might be. With that thought in mind, we predict the null values from the other features present in the data. Thus preserving as much information as possible. If the data is not missing at random (MAR) then this method is inappropriate. Instead use a feature descritization method.

## features:
### *The model preserves the users pandas dataframe column and row indexes.
    The model can be fit on past data and applied to new unseen data using fit and transform methods separate from one another. This allows for supporting machine learning applications.
    Only columns that are never null in the training set are used to fit the model unless the user sets the "seed_nulls" argument to "True". This then fills all other nulls in the data based on the standard Sklearn Imputer methods. The null target variable is then predicted using this filled data.
    A class variable holds the fitted models for each feature in a dictionary that can be accessed through the API. This can be saved and used at a later time if needed.
 

## The chained equation process can be broken down into four general steps:

    Step 1: A simple imputation, such as imputing the mean, is performed for every missing value in the dataset. These mean imputations can be thought of as “place holders.”

    Step 2: The “place holder” mean imputations for one variable (“var”) are set back to missing.

    Step 3: The observed values from the variable “var” in Step 2 are regressed on the other variables in the imputation model, which may or may not consist of all of the variables in the dataset. In other words, “var” is the dependent variable in a regression model and all the other variables are independent variables in the regression model. These regression models operate under the same assumptions that one would make when performing (e.g.,) linear, logistic, or Poison regression models outside of the context of imputing missing data.

    Step 4: The missing values for “var” are then replaced with predictions (imputations) from the regression model. When “var” is subsequently used as an independent variable in the regression models for other variables, both the observed and these imputed values will be used.

    Step 5: Steps 2–4 are then repeated for each variable that has missing data. The cycling through each of the variables constitutes one iteration or “cycle.” At the end of one cycle all of the missing values have been replaced with predictions from regressions that reflect the relationships observed in the data.

    Step 6: Steps 2 through 4 are repeated for a number of cycles, with the imputations being updated at each cycle. The number of cycles to be performed can be specified by the researcher. At the end of these cycles the final imputations are retained, resulting in one imputed dataset.

# Import Libraries:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("/home/ritesh/Desktop/md_work/data/Missing_Data.csv")
print(df)
df_1 = df.copy()

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [9]:
# Locations with empty values:
df_1 = df.loc[:, ["Age","Salary"]]
print(df_1)

    Age   Salary
0  44.0  72000.0
1  27.0  48000.0
2  30.0  54000.0
3  38.0  61000.0
4  40.0      NaN
5  35.0  58000.0
6   NaN  52000.0
7  48.0  79000.0
8  50.0  83000.0
9  37.0  67000.0


In [8]:
df_1 = df_1.fillna(df_1.mean())         # Filling only mean() values before implementing MICE algorithum.
print(df_1)

   Country        Age        Salary Purchased
0   France  44.000000  72000.000000        No
1    Spain  27.000000  48000.000000       Yes
2  Germany  30.000000  54000.000000        No
3    Spain  38.000000  61000.000000        No
4  Germany  40.000000  63777.777778       Yes
5   France  35.000000  58000.000000       Yes
6    Spain  38.777778  52000.000000        No
7   France  48.000000  79000.000000       Yes
8  Germany  50.000000  83000.000000        No
9   France  37.000000  67000.000000       Yes


In [29]:
# MICE is used to impute numeric data only,if the dataset contains categorical data but you wish to impute numeric data you should use just the numeric columns of the dataframe. If categorical data is missing then a different algorithm would have to be used.

 
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.impute import SimpleImputer 
class MiceImputer:
    
    model_dict_ = {}                                       #'MiceImputer' object has attribute:model_dict_ = {}
    
    def __init__(self, seed_nulls=False, seed_strategy='mean'):
        self.seed_nulls = seed_nulls
        self.seed_strategy = seed_strategy
    

    def transform(self, X):                                # Impute all missing values in X.
        col_order = X.columns
        new_X = []
        mutate_cols = list(self.model_dict_.keys())        # Convert dictionary into list of tuples.
  
        
        for i in mutate_cols:
            y = X[i]
            x_null = X[y.isnull()]                         # ISNULL() function lets you return an boolean value when an expression is NULL.
            y_null = y[y.isnull()].reset_index()['index']  # Just reset the index, without inserting it as a column in the new DataFrame
            y_notnull = y[y.notnull()]                     # notnull() method which stores True for ever NON-NULL value and False for a null value.
            
            model = self.model_dict_.get(i)
            
            if self.seed_nulls:                           # The method seed() sets the integer starting value used in generating random numbers. Call this function before calling any other random module function.
                x_null = model[1].transform(x_null)
            else:
                null_check = x_null.isnull().any()       # Check any value which is null,return boolean values
                x_null = x_null[null_check.index[~null_check]]  # Check those columns having no null vales.
                
# Concatenate pandas objects along a particular axis with optional set logic along the other axes.         
            pred = pd.concat([pd.Series(model[0].predict(x_null))\
                              .to_frame()\
                              .set_index(y_null),y_notnull], axis=0)\
                              .rename(columns={0: i})
            
            new_X.append(pred)

        new_X.append(X[X.columns.difference(mutate_cols)])  #The function returns as output a new list of columns from the existing columns excluding the ones given as arguments. 
        final = pd.concat(new_X, axis=1)[col_order]         # concatenate the columns.

        return final
        
        
    def fit(self, X):                                      # Fit the imputer on X.
        x = X.fillna(value=np.nan)  

        null_check = x.isnull().any()
        null_data = x[null_check.index[null_check]]
        
        for i in null_data:
            y = null_data[i]
            y_notnull = y[y.notnull()]

            model_list = []
            if self.seed_nulls:
                imp = SimpleImputer(strategy=self.seed_strategy)
                model_list.append(imp.fit(x))
                non_null_data = pd.DataFrame(imp.fit_transform(x))
                
            else:
                non_null_data = x[null_check.index[~null_check]]
                
            
            x_notnull = non_null_data[y.notnull()]
            
            if y.nunique() < 2:                                 # nunique() is used to get a count of unique values.
                model = LinearRegression()
                model.fit(x_notnull, y_notnull)
                model_list.insert(0, model)
                self.model_dict_.update({i: model_list})
            else:
                model = LogisticRegression()
                model.fit(x_notnull, y_notnull)
                model_list.insert(0, model)
                self.model_dict_.update({i: model_list})
         
        return self
        

    def fit_transform(self, X):                             # Fit to data, then transform it.
        return self.fit(X).transform(X)

    
    
customer_df = pd.read_csv("/home/ritesh/Desktop/md_work/data/Missing_Data.csv")  

# Create object/instance of class MiceImputer by calling its constructor i.e classname(arguments to init method)

miceimputer_obj = MiceImputer(True,"mean")

print(miceimputer_obj.fit_transform(customer_df.loc[:, ["Age","Salary"]]))


# Sometimes system may show us Future warning message,so to ignore the warning each time your code is executed, if you wish.
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

    Age   Salary
0  44.0  72000.0
1  27.0  48000.0
2  30.0  54000.0
3  38.0  61000.0
4  40.0  83000.0
5  35.0  58000.0
6  50.0  52000.0
7  48.0  79000.0
8  50.0  83000.0
9  37.0  67000.0
