# Missing values


## About missing values
`NaN` is usually used as the placeholder for missing values. However, this enforces the data type to be a float and prevents you from converting to an integer if need be. The parameter `missing_values` in sklearn.impute algorithms allows us to specify other placeholders such as integers, e.g, -1.


## How to identify missing values

`df.isnull()`: this returns True if values are missing.
    
## How to deal with missing values

1. Drop the missing values
- By dropping the entire row

- By dropping the entire column

2. Replace the missing value - imputation

- Use business understanding
- Statistical methods: imputation using mean, median or mode (most_frequent)

`df['col_name'].fillna(df.col_name.mean(), inplace=True)`

- Imputation algorithms(sklearn.impute)
    - SimpleImputer
    - IterativeImputer
    - KNNImputer

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('Data/data.csv')
df

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore,location
0,abc,mno,12.0,m,90.0,65.0,
1,,,,,90.0,?,
2,ghi,pqr,12.0,f,-,65.0,?
3,jkl,stu,12.0,f,90.0,62.0,
4,mno,vwx,12.0,m,89.0,63.0,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   FirstName      4 non-null      object 
 1   LastName       4 non-null      object 
 2   Age            4 non-null      float64
 3   Sex            4 non-null      object 
 4   preTestScore   5 non-null      object 
 5   postTestScore  5 non-null      object 
 6   location       1 non-null      object 
dtypes: float64(1), object(6)
memory usage: 408.0+ bytes


In [4]:
#the missing values are represented in more than one form
missing_vals = ['n/a', '-', '?']
df = pd.read_csv('Data/data.csv', na_values=missing_vals) 
df

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore,location
0,abc,mno,12.0,m,90.0,65.0,
1,,,,,90.0,,
2,ghi,pqr,12.0,f,,65.0,
3,jkl,stu,12.0,f,90.0,62.0,
4,mno,vwx,12.0,m,89.0,63.0,


All representations of null values have been formatted to be represented as NaN.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   FirstName      4 non-null      object 
 1   LastName       4 non-null      object 
 2   Age            4 non-null      float64
 3   Sex            4 non-null      object 
 4   preTestScore   4 non-null      float64
 5   postTestScore  4 non-null      float64
 6   location       0 non-null      float64
dtypes: float64(4), object(3)
memory usage: 408.0+ bytes


## How to identify missing values

In [6]:
df.isnull()

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore,location
0,False,False,False,False,False,False,True
1,True,True,True,True,False,True,True
2,False,False,False,False,True,False,True
3,False,False,False,False,False,False,True
4,False,False,False,False,False,False,True


In [7]:
#to identify the number of missing values in each column
df.isnull().sum()

FirstName        1
LastName         1
Age              1
Sex              1
preTestScore     1
postTestScore    1
location         5
dtype: int64

In [8]:
#to identify columns with at least one missing values
df.isnull().any(axis=0)

FirstName        True
LastName         True
Age              True
Sex              True
preTestScore     True
postTestScore    True
location         True
dtype: bool

In [9]:
#to identfy columns that have missing values throughout the column
df.isnull().all(axis=0)

FirstName        False
LastName         False
Age              False
Sex              False
preTestScore     False
postTestScore    False
location          True
dtype: bool

In [10]:
#number of columns with all missing values
df.isnull().all(axis=0).sum()

1

In [11]:
#rows with at least one missing values
df.isnull().any(axis=1)

0    True
1    True
2    True
3    True
4    True
dtype: bool

In [12]:
#rows with all missing values
df.isnull().all(axis=1)

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [13]:
#number of rows with all missing values
df.isnull().all(axis=1).sum()

0

## Missing values treatment in columns


In [15]:
#calculating the percent of null values for each column
(df.isnull().sum()/len(df.index))*100

FirstName         20.0
LastName          20.0
Age               20.0
Sex               20.0
preTestScore      20.0
postTestScore     20.0
location         100.0
dtype: float64

In [20]:
#the location has 100%, which means that the entire column is made up of null values
#removing the location column

df.dropna(axis=1, how='all', inplace=True)
df

#this removes columns where "all" the values are NaN

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore
0,abc,mno,12.0,m,90.0,65.0
1,,,,,90.0,
2,ghi,pqr,12.0,f,,65.0
3,jkl,stu,12.0,f,90.0,62.0
4,mno,vwx,12.0,m,89.0,63.0


## Missing values treatment in rows

In [21]:
df[df.isnull().sum(axis=1)>=4]

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore
1,,,,,90.0,


In [23]:
#retaining the rows having <=4 NaNs

df = df[df.isnull().sum(axis=1)<=4]
df

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore
0,abc,mno,12.0,m,90.0,65.0
2,ghi,pqr,12.0,f,,65.0
3,jkl,stu,12.0,f,90.0,62.0
4,mno,vwx,12.0,m,89.0,63.0


In [24]:
df['preTestScore'].describe()

count     3.000000
mean     89.666667
std       0.577350
min      89.000000
25%      89.500000
50%      90.000000
75%      90.000000
max      90.000000
Name: preTestScore, dtype: float64

## Imputation using mean

In [25]:
df['preTestScore'].fillna(df['preTestScore'].mean(), inplace=True)
df

Unnamed: 0,FirstName,LastName,Age,Sex,preTestScore,postTestScore
0,abc,mno,12.0,m,90.0,65.0
2,ghi,pqr,12.0,f,89.666667,65.0
3,jkl,stu,12.0,f,90.0,62.0
4,mno,vwx,12.0,m,89.0,63.0


## sk.learn.impute
By default, the scikit-learn imputers will drop fully empty features, i.e. columns containing only missing values. This would also return the values in an array.


Transformers for missing value imputation:
1. IterativeImputer: multivariate imputer that estimates each feature from all the others

2. KNNImputer: imputation for completing missing values using k-nearest neighbors

3. MissingIndicator: binary indicators for missing values

4. SimpleImputer: univariate imputer for completing missing values with simple strategies


In [42]:
df = pd.read_csv('Data/data.csv', na_values=missing_vals)
df_num = df.select_dtypes(include=['number'])
df_num2 = df_num.drop(['location'], axis=1)
df_num

Unnamed: 0,Age,preTestScore,postTestScore,location
0,12.0,90.0,65.0,
1,,90.0,,
2,12.0,,65.0,
3,12.0,90.0,62.0,
4,12.0,89.0,63.0,


### IterativeImputer

In [44]:
#explicitly ask for this experimental feature

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
iterimp =IterativeImputer()
df_iterimp = pd.DataFrame(iterimp.fit_transform(df_num2),
                         columns=df_num2.columns,
                         index=df_num2.index)
df_iterimp.head()

Unnamed: 0,Age,preTestScore,postTestScore
0,12.0,90.0,65.0
1,12.0,90.0,63.750226
2,12.0,89.751133,65.0
3,12.0,90.0,62.0
4,12.0,89.0,63.0


In [45]:
#dropping the location column since it contains all null values

iterimp.fit_transform(df_num)



array([[12.        , 90.        , 65.        ],
       [12.        , 90.        , 63.75022648],
       [12.        , 89.75113289, 65.        ],
       [12.        , 90.        , 62.        ],
       [12.        , 89.        , 63.        ]])

### SimpleImputer

In [46]:
from sklearn.impute import SimpleImputer

mean_imputer = SimpleImputer(strategy='mean')
df_mean_imp = pd.DataFrame(mean_imputer.fit_transform(df_num2),
                          columns=df_num2.columns,
                          index=df_num2.index)
df_mean_imp.head()

Unnamed: 0,Age,preTestScore,postTestScore
0,12.0,90.0,65.0
1,12.0,90.0,63.75
2,12.0,89.75,65.0
3,12.0,90.0,62.0
4,12.0,89.0,63.0


In [47]:
mean_imputer.fit_transform(df_num)



array([[12.  , 90.  , 65.  ],
       [12.  , 90.  , 63.75],
       [12.  , 89.75, 65.  ],
       [12.  , 90.  , 62.  ],
       [12.  , 89.  , 63.  ]])

### KNNImputer

In [35]:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=3)

df_knn_imp = pd.DataFrame(knn_imputer.fit_transform(df_num),
                         columns=df_num.columns,
                         index=df_num.index)
df_knn_imp.head()

Unnamed: 0,Age,preTestScore,postTestScore
0,12.0,90.0,65.0
1,12.0,90.0,63.333333
2,12.0,89.666667,65.0
3,12.0,90.0,62.0
4,12.0,89.0,63.0


In [48]:
knn_imputer.fit_transform(df_num)

array([[12.        , 90.        , 65.        ],
       [12.        , 90.        , 63.33333333],
       [12.        , 89.66666667, 65.        ],
       [12.        , 90.        , 62.        ],
       [12.        , 89.        , 63.        ]])

## Marking the imputed values using MissingIndicator

The `MissingIndicator` transformer is useful to transform a dataset into corresponding binary matrix indicating the presence of missing values in the data. This is useful when used in conjunction with imputation to preserve information about which values had been missing.

    P.S. `SimpleImputer` and `IterativeImputer` have the parameter `add_indicator` (False by default). When set to True, it provides a convenient way of stacking the output of the `MissingIndicator` transformer withthe output of the imputer.

In [51]:
from sklearn.impute import MissingIndicator
x_array = np.array([[-1, -1, 3, 7],
                   [4, 6, 0, -1],
                   [-1, 1, 6,2]])

indicator = MissingIndicator(missing_values = -1)

mask_missing_values_only = indicator.fit_transform(x_array)
mask_missing_values_only

array([[ True,  True, False],
       [False, False,  True],
       [ True, False, False]])

This only returns columns where a null value was found.In order to return all columns regardless, we use the `features` parameter.

This parameter is used to choose the features for which the mask is constructed. By default, it is `missing-only`, which retuens the imputer mask of the features containing missing values at fit time.

The `features` parameter can be set to "all" to return all features whether or not they contain missing values.

In [52]:
#to check the columns that were returned

indicator.features_

array([0, 1, 3], dtype=int64)

In [53]:
indicator = MissingIndicator(missing_values= -1, features='all')
mask_all = indicator.fit_transform(x_array)
mask_all

array([[ True,  True, False, False],
       [False, False, False,  True],
       [ True, False, False, False]])

In [54]:
indicator.features_

array([0, 1, 2, 3])

In [61]:
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion

transformer = FeatureUnion(
                      transformer_list =[
                          ('features', SimpleImputer(strategy='mean')),
                          ('indicators', MissingIndicator())
                      ])
results = pd.DataFrame(transformer.fit_transform(df_num2), 
                       #columns=df_num2.columns,
                       index=df_num2.index)
results

Unnamed: 0,0,1,2,3,4,5
0,12.0,90.0,65.0,0.0,0.0,0.0
1,12.0,90.0,63.75,1.0,0.0,1.0
2,12.0,89.75,65.0,0.0,1.0,0.0
3,12.0,90.0,62.0,0.0,0.0,0.0
4,12.0,89.0,63.0,0.0,0.0,0.0


In [62]:
results.columns = ['Age', 'preTestScore', 'postTestScore', 'is_null_age', 'is_null_preTest', 'is_null_postTest']
results

Unnamed: 0,Age,preTestScore,postTestScore,is_null_age,is_null_preTest,is_null_postTest
0,12.0,90.0,65.0,0.0,0.0,0.0
1,12.0,90.0,63.75,1.0,0.0,1.0
2,12.0,89.75,65.0,0.0,1.0,0.0
3,12.0,90.0,62.0,0.0,0.0,0.0
4,12.0,89.0,63.0,0.0,0.0,0.0


## Estimators that can handle NaN values

NaN values can be pain points for certain ML algorithms and woukd therefore require you to preprocess the data by filling in the missing values (imputation) or removing rows/columns with them.

However, in scikit-learn, there are several esyimators that can work directly with data containing NaN values. They can work with NaN values without any preprocvessing by:

- Ignoring these data points with NaN values, e.g KNN
- Estimating the mssing value by using mean/median of the existing data points in the same column
- Using statistcal techniques to estimate missing value
- Specific handling built-in mechanism, e.g decision tree algorithm can treat NaNs as a seperate category during splitting.

The specific way any algorithm handles NaNs is usually documented within the scikit-learn documentation for that particular estimator.

According to scikit-learn, the following algorithms classified by type (cluster, regressor, classifier, transform) fall under this category:

1. Estimators that allow NaN values for type **cluster**:
`HDBSCAN`
        
2. Estimators that allow NaN values for type **regressor**:

`BaggingRegressor`

`DecisionTreeRegressor`

`HistGradientBoostingRegressor`

`RandomForestRegressor`

`StackingRegressor`

`VotingRegressor`

3. Estimators that allow NaN values for type **classifier**:

`BaggingClassifier`

`DecisionTreeClassifier`

`HistGradientBoostingClassifier`

`RandomForestClassifier`

`StackingClassifier`

`VotingClassifier`

4. Estimators that allow NaN values for type **transformer**:

`IterativeImputer`

`KNNImputer`

`MaxAbsScaler`

`MinMaxScaler`

`MissingIndicator`

`OneHotEncoder`

`OrdinalEncoder`

`PowerTransformer`

`QuantileTransformer`

`RobustScaler`

`SimpleImputer`

`StackingClassifier`

`StackingRegressor`

`StandardScaler`

`TargetEncoder`

`VarianceThreshold`

`VotingClassifier`

`VotingRegressor`