Real data usually is more complex then artificial datasets. It can have a lot of outliers, nan-values and etc. 

In [2]:
import numpy as np
import pandas as pd

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])

In [18]:
df.isnull().sum()

0    1
1    1
2    0
dtype: int64

**Drop NA**

If we have a lot of NA in data, it can be more useful to remove them.

In [4]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [10]:
df.dropna(axis='columns') #df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


Drop row if all columns has nan

In [22]:
df.dropna(how ='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


**Fill NA**

*With certain value*

Not good variant for linear models, as we lose information about the presence of a gap in data

In [11]:
df[0].fillna(0)

0    1.0
1    2.0
2    0.0
Name: 0, dtype: float64

In [26]:
df[0].fillna(df[0].mean()) # max(), min() and etc.

0    1.0
1    2.0
2    1.5
Name: 0, dtype: float64

*Forward-fill* - propagate the previous value forward

In [29]:
# in column
df[0].fillna(method='ffill')

0    1.0
1    2.0
2    2.0
Name: 0, dtype: float64

In [30]:
# in row
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2
0,1.0,1.0,2.0
1,2.0,3.0,5.0
2,,4.0,6.0


*Back-fill* - propagate the next values backward

In [24]:
df[1].fillna(method='bfill')

0    3.0
1    3.0
2    4.0
Name: 1, dtype: float64

**Imputer**

In [37]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # median, most_frequent
imputer = imputer.fit(df)
imputed_data = imputer.transform(df.values)
imputed_data

array([[1. , 3.5, 2. ],
       [2. , 3. , 5. ],
       [1.5, 4. , 6. ]])

In [39]:
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
imputer = imputer.fit(df)
imputed_data = imputer.transform(df.values)
imputed_data

array([[1., 0., 2.],
       [2., 3., 5.],
       [0., 4., 6.]])

**Create indicator**

Sometimes the presence of gap in data is itself an information. Therefore, you can create an indicator: which indicates the presence of a gap in data.

In [9]:
np.where(df[0].isnull(), 1,0)

array([0, 0, 1])

**Use another algorithm to predict gap values**