* **Imputation is the process of replacing missing data with subsitituted values.**

* **SimpleImputer** -Univariate imputer for completing missing values with simple strategies. Replace missing values using a descriptive statistic like mean, median, mode along each column or using a constant value.

* **enable_iterative_imputer** enables the IterativeImputer so we can import it normally.

* **IterativeImputer** - Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

* **KNNImputer** - Imputation for completing missing values using k-Nearest Neighbors. Each sample's missing values are imputed using the mean values from **n_neighbours** nearest neighbours found in the training set. 2 samples are close if the features that neither is missing are close.

In [3]:
#Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer

In [8]:
#Loading the dataset
data = pd.read_csv("D:\\SLIIT\\3rd year 2nd sem\\Fundamentals of Data Mining\\Coding\\Cupcakes.csv")
data.head()

Unnamed: 0,Mese,Cupcake
0,2004-01,5
1,2004-02,5
2,2004-03,4
3,2004-04,6
4,2004-05,5


In [9]:
data.describe()

Unnamed: 0,Cupcake
count,204.0
mean,49.661765
std,28.192482
min,4.0
25%,25.0
50%,50.0
75%,73.0
max,100.0


In [14]:
data.isnull().count()

Mese       204
Cupcake    204
dtype: int64

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Mese     204 non-null    object
 1   Cupcake  204 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 3.3+ KB


# Univariate Feature Imputation

In [4]:
preprocessor = SimpleImputer(missing_values=np.nan, strategy='mean')

In [16]:
X = np.array(data['Cupcake']).reshape(-1,1)
preprocessor.fit(X)

SimpleImputer()

In [18]:
X_prep = preprocessor.transform(X)

In [20]:
data['Cupcake_univariate'] = X_prep.reshape(1,-1)[0]
data.head()

Unnamed: 0,Mese,Cupcake,Cupcake_univariate
0,2004-01,5,5.0
1,2004-02,5,5.0
2,2004-03,4,4.0
3,2004-04,6,6.0
4,2004-05,5,5.0


# Multivariate Feature Imputation

In [21]:
preprocessor = IterativeImputer(max_iter=10, random_state=0)

Converting the two features into arrays and transform them in the form [[f11,f21], [f12,f22] ...]. This can be done by applying the reshape() function to each feature and then the hstack() function as follows:

In [23]:
X1 = np.array(data['Cupcake']).reshape(-1,1)
X2 = np.array(data.index).reshape(-1,1)
X = np.hstack((X1,X2)) #Stack arrays in sequence horizontally

In [24]:
preprocessor.fit(X)

IterativeImputer(random_state=0)

In [29]:
X_prep = preprocessor.transform(X)
data['Cupcake_multivariate'] = np.hsplit(X_prep,2)[0].reshape(1,-1)[0] #hsplit splits the array horizontally

Missing values are located at position 26 and 1

In [31]:
data.iloc[26]

Mese                    2006-03
Cupcake                      10
Cupcake_univariate         10.0
Cupcake_multivariate       10.0
Name: 26, dtype: object

In [32]:
data.iloc[1]

Mese                    2004-02
Cupcake                       5
Cupcake_univariate          5.0
Cupcake_multivariate        5.0
Name: 1, dtype: object

# Nearest Neighbors imputation

In [35]:
preprocessor = KNNImputer(n_neighbors=5,weights="distance")
preprocessor.fit(X)
X_prep = preprocessor.transform(X)
data['Cupcake_knn'] = np.hsplit(X_prep,2)[0].reshape(1,-1)[0]

In [36]:
data.iloc[26]

Mese                    2006-03
Cupcake                      10
Cupcake_univariate         10.0
Cupcake_multivariate       10.0
Cupcake_knn                10.0
Name: 26, dtype: object

In [37]:
data.iloc[1]

Mese                    2004-02
Cupcake                       5
Cupcake_univariate          5.0
Cupcake_multivariate        5.0
Cupcake_knn                 5.0
Name: 1, dtype: object