## Missing Values and Imputing

### Reasons for NA:
1. **Data is Lost:** a common problem with the production date. The company changes servers, becomes overgrown with bureaucracy, digitizes slowly and in stages -> data will be lost.
2. **Data is not captured:** normal history. For example, the respondent did not want to answer the question, or the metric was invented in the middle of the data collection process, and it is not cost-effective to restart the algorithm.
3. **Wrong values present in Dataset:** when the values are explicit outlayers even in the intended population. For example, age 221, city of Moscow, etc. (often incorrectly working algorithms send random data or zeros, although in fact it is NA)
   
Another division:
1. **Missing completely at random**
2. **Missing at Random**
3. **Missing not at random**

*The more data is missing (MNAR), the more biased are the estimations*. That is why in ML tasks it is important to pay attention to missing data in the dataset before *training* models. What's more: many python packages don't work well with whitespace (try the `.mean` method on a `numpy` c na array and you'll be surprised).

### Methods for working with NA (on the example of [dataset with penguins](https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris))

In [1]:
! pip install palmerpenguins


[notice] A new release of pip available: 22.1.2 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip




In [17]:

from palmerpenguins import load_penguins
import pandas as pd
import numpy as np 
import math

df = load_penguins()
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [18]:
print("The length of the original dataset:", len(df))

The length of the original dataset: 344


#### 0. **Leave everything as it is** (if the algorithm ***can and knows how*** to work with gaps)

#### 1. **Simple**: drop missing values.

We discard rows/columns containing at least one/several/all NaNs.

**+:** minimum cognitive load, maximum result.

**--:** total rows with missing values can be so many that they eat up your dataset.

NB: don't forget about the option to call the function help - there might be options there that you need!

In [19]:
# help(pd.DataFrame.dropna) # - help

df1 = df.dropna() # this is how we drop all lines containing at least 1 NA
df1.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


In [21]:
len(df1)

333

In [22]:
df.dropna(axis = 'columns', how = "any").head(5) # а так мы отбрасываем все *колонки*, где есть *хоть один* NA, иногда полезно

Unnamed: 0,species,island,year
0,Adelie,Torgersen,2007
1,Adelie,Torgersen,2007
2,Adelie,Torgersen,2007
3,Adelie,Torgersen,2007
4,Adelie,Torgersen,2007


#### 2. **Imputation**: mask with something similar (medium)

Quickly replace NA with median/mean/modal values with `SimpleImputer`

**+ :** quickly makes numeric data look neat.

**-- :** only numeric, with a large number of substitutions, the bias of the dataset grows.

    NB: you can also use mode for categorical ones - just substitute the most frequent value in the gaps

In [23]:
from sklearn.impute import SimpleImputer
# help(SimpleImputer) # - help

x = np.array(df['body_mass_g'])
x = x.reshape(-1,1)

print("Mean penguin mass:", x.mean())

Mean penguin mass: nan


In [24]:
#strategies: mean, median, most_frequent, constant

imputer = SimpleImputer(strategy = 'mean')
imputer.fit(x)
x1 = imputer.transform(x)

print("Mean penguin mass:", x1.mean())
# df['body_mass_g'] = imputer.transform(x)

Mean penguin mass: 4201.754385964912


#### 3. Imputation: k-NN method (classification classic)
   **++ :** cooler and more varied than a гыштп simple imputer.
   
   **--** : Computing on large datasets will take time.

In [25]:
from sklearn.impute import KNNImputer
# help(KNNImputer) # - help

k = np.array(df['body_mass_g'])
k = x.reshape(-1,1)

print("Mean penguin mass:", k.mean())

Mean penguin mass: nan


In [26]:
# set - n_neighbors

imputer = KNNImputer(n_neighbors = 3, weights='uniform', metric='nan_euclidean')
k1 = imputer.fit_transform(k)

print("Mean penguin mass:", k1.mean())
# df['body_mass_g'] = imputer.transform(x)

Mean penguin mass: 4201.754385964912


For a whole dataset

In [27]:
data = df.select_dtypes(include=np.number).values
ix = [i for i in  range(data.shape[1])]
X = data[:, ix]
print('old Missing: %d' % sum(np.isnan(X).flatten()))

imputer = KNNImputer()
imputer.fit(X)
Xtrans = imputer.transform(X)

print('new Missing: %d' % sum(np.isnan(Xtrans).flatten()))

old Missing: 8
new Missing: 0


By the way, ***categorical*** variables ***can also be imputed*** - just turn them into dummy and back:

In [15]:
df['is_male'] = df['sex'].map({'male': 1,
                             'female': 0})
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,is_male
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,1.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,0.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,0.0
3,Adelie,Torgersen,,,,,,2007,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,0.0


In [16]:
sex = df['is_male'].values
sex = sex.reshape(-1,1)

imputer = KNNImputer(n_neighbors = 2)
sex1 = imputer.fit_transform(sex)
sex1

array([[1.       ],
       [0.       ],
       [0.       ],
       [0.5045045],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.5045045],
       [0.5045045],
       [0.5045045],
       [0.5045045],
       [0.       ],
       [1.       ],
       [1.       ],
       [0.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [1.       ],
       [0.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [0.       ],
       [1.       ],
       [1.       ],
       [0.5045045],
       [0.       ],
       [1.       ],


*really I don't know how to interpret gender 0.5*

### What else to read on the topic:
  * [Medium](https://towardsdatascience.com/the-robustness-of-machine-learning-algorithms-against-missing-or-abnormal-values-ec3222379905) where the author compares the performance of rf, boosting and lasso with different methods imputation;
  * [yet another Medium](https://medium.com/analytics-vidhya/why-it-is-important-to-handle-missing-data-and-10-methods-to-do-it-29d32ec4e6a), where it is possible to esteem descriptions of methods;
  * [guide](https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/) by kNN.