### Missing data
Missing data occurs when values are simply absent or contain NaN (not a number) for any feature (column) in a given dataset. This will cause issues in many machine learning algorithms.

**Missing data can negatively impact:**
- Data visualisation
- Arithmetic computations
- Machine learning algorithms

**Common methods to deal with missing data**
- Remove rows or columns containing missing data
- Impute with mean or median
- Impute with mode (most frequently occurring feature)
- Impute with forward or backward fill
- Interpolate data between two points

*Note: Domain knowledge is often needed to decide how to fill nulls.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

In [20]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])

In [21]:
df = df[df > 0]

In [22]:
df

Unnamed: 0,A,B,C,D
0,0.230238,0.459924,1.185044,
1,,,,
2,,,,0.164301
3,0.096313,,,1.925863
4,1.261786,2.420444,0.007489,0.729503
5,0.095576,1.353026,0.463694,1.248931
6,,,0.017988,0.309169
7,,,1.007704,0.649609
8,0.556035,1.073602,1.967679,0.139063
9,,0.58285,,


In [23]:
copy = df.copy()

In [26]:
copy.drop(columns="A", inplace=True)

## dropna
Remove rows (default), or columns, containing null.

**Parameters**
- **how** = "any" (default), or "all"
- **thresh** = set number of (non-missing) values a row must contain in order to *not* drop
- **subset** = only look for NaN in subset of columns (or rows)
- **axis** = "index" (default), or "columns"

In [27]:
copy

Unnamed: 0,B,C,D
0,0.459924,1.185044,
1,,,
2,,,0.164301
3,,,1.925863
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
6,,0.017988,0.309169
7,,1.007704,0.649609
8,1.073602,1.967679,0.139063
9,0.58285,,


In [28]:
copy.isna()

Unnamed: 0,B,C,D
0,False,False,True
1,True,True,True
2,True,True,False
3,True,True,False
4,False,False,False
5,False,False,False
6,True,False,False
7,True,False,False
8,False,False,False
9,False,True,True


In [29]:
copy.dropna()

Unnamed: 0,B,C,D
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
8,1.073602,1.967679,0.139063


In [30]:
copy

Unnamed: 0,B,C,D
0,0.459924,1.185044,
1,,,
2,,,0.164301
3,,,1.925863
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
6,,0.017988,0.309169
7,,1.007704,0.649609
8,1.073602,1.967679,0.139063
9,0.58285,,


In [33]:
copy.dropna(how="all")

Unnamed: 0,B,C,D
0,0.459924,1.185044,
2,,,0.164301
3,,,1.925863
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
6,,0.017988,0.309169
7,,1.007704,0.649609
8,1.073602,1.967679,0.139063
9,0.58285,,


In [34]:
copy.dropna(thresh=2)

Unnamed: 0,B,C,D
0,0.459924,1.185044,
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
6,,0.017988,0.309169
7,,1.007704,0.649609
8,1.073602,1.967679,0.139063


In [36]:
copy.dropna(subset="B")

Unnamed: 0,B,C,D
0,0.459924,1.185044,
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
8,1.073602,1.967679,0.139063
9,0.58285,,


In [37]:
copy.dropna(subset=["B", "C"])

Unnamed: 0,B,C,D
0,0.459924,1.185044,
4,2.420444,0.007489,0.729503
5,1.353026,0.463694,1.248931
8,1.073602,1.967679,0.139063


In [38]:
copy.dropna(axis="columns")

0
1
2
3
4
5
6
7
8
9


In [40]:
copy.dropna(axis="columns", thresh=6)

Unnamed: 0,C,D
0,1.185044,
1,,
2,,0.164301
3,,1.925863
4,0.007489,0.729503
5,0.463694,1.248931
6,0.017988,0.309169
7,1.007704,0.649609
8,1.967679,0.139063
9,,


In [41]:
copy.dropna(axis="columns", subset=2)

Unnamed: 0,D
0,
1,
2,0.164301
3,1.925863
4,0.729503
5,1.248931
6,0.309169
7,0.649609
8,0.139063
9,
