<a href='https://www.hexnbit.com/'> <img src='hexnbit.png'/> </a>

# Missing Data
Many time, in real life applications, data will be missing, here, we have discussed two techniques using
which the missing data can be taken care of
 - Drop Missing Data
 - Fill Missing Data

In [1]:
import pandas as pd
import numpy as np

## Creating Sample Data

In [2]:
np.random.seed(10)   # so that the random values generated, remain every time this code is run, if we change or remove
                     # the seed, the generated random values will change
data=np.random.rand(25).reshape(5,5)   # creating 5 x 5 numpy array of random numbers
print(data)

[[0.77132064 0.02075195 0.63364823 0.74880388 0.49850701]
 [0.22479665 0.19806286 0.76053071 0.16911084 0.08833981]
 [0.68535982 0.95339335 0.00394827 0.51219226 0.81262096]
 [0.61252607 0.72175532 0.29187607 0.91777412 0.71457578]
 [0.54254437 0.14217005 0.37334076 0.67413362 0.44183317]]


In [3]:
# creating Data Frame from the array created in the above cell
frame=pd.DataFrame(data,index="R1 R2 R3 R4 R5".split(),columns="C1 C2 C3 C4 C5".split())

In [4]:
print(frame)

          C1        C2        C3        C4        C5
R1  0.771321  0.020752  0.633648  0.748804  0.498507
R2  0.224797  0.198063  0.760531  0.169111  0.088340
R3  0.685360  0.953393  0.003948  0.512192  0.812621
R4  0.612526  0.721755  0.291876  0.917774  0.714576
R5  0.542544  0.142170  0.373341  0.674134  0.441833


In [5]:
frame1=frame[frame>0.2]  # creating a data frame where the values below 0.2 will be dropped

In [6]:
print(frame1)   # printing data which contains missing values (NaN)

          C1        C2        C3        C4        C5
R1  0.771321       NaN  0.633648  0.748804  0.498507
R2  0.224797       NaN  0.760531       NaN       NaN
R3  0.685360  0.953393       NaN  0.512192  0.812621
R4  0.612526  0.721755  0.291876  0.917774  0.714576
R5  0.542544       NaN  0.373341  0.674134  0.441833


## Drop Missing Data
All drops are temporary, to make permanent change, use inplace parameter as True

In [7]:
frame1.dropna() # Dropping all rows containing NaNs

Unnamed: 0,C1,C2,C3,C4,C5
R4,0.612526,0.721755,0.291876,0.917774,0.714576


In [8]:
frame1.dropna(axis=1)  # dropping all columns containing NaNs

Unnamed: 0,C1
R1,0.771321
R2,0.224797
R3,0.68536
R4,0.612526
R5,0.542544


In [9]:
frame1.dropna(thresh=3)   # dropping rows which have less than 3 valid data points

Unnamed: 0,C1,C2,C3,C4,C5
R1,0.771321,,0.633648,0.748804,0.498507
R3,0.68536,0.953393,,0.512192,0.812621
R4,0.612526,0.721755,0.291876,0.917774,0.714576
R5,0.542544,,0.373341,0.674134,0.441833


In [10]:
frame1.dropna(axis=1,thresh=3)   # dropping columns which have less than 3 valid data points

Unnamed: 0,C1,C3,C4,C5
R1,0.771321,0.633648,0.748804,0.498507
R2,0.224797,0.760531,,
R3,0.68536,,0.512192,0.812621
R4,0.612526,0.291876,0.917774,0.714576
R5,0.542544,0.373341,0.674134,0.441833


## Fill Missing Data
All fills are temporary, use, inplace parameter as True for permanent changes

In [11]:
print(frame1)

          C1        C2        C3        C4        C5
R1  0.771321       NaN  0.633648  0.748804  0.498507
R2  0.224797       NaN  0.760531       NaN       NaN
R3  0.685360  0.953393       NaN  0.512192  0.812621
R4  0.612526  0.721755  0.291876  0.917774  0.714576
R5  0.542544       NaN  0.373341  0.674134  0.441833


In [12]:
frame1.fillna(value=999)   # filling all NaNs with value 999.

Unnamed: 0,C1,C2,C3,C4,C5
R1,0.771321,999.0,0.633648,0.748804,0.498507
R2,0.224797,999.0,0.760531,999.0,999.0
R3,0.68536,0.953393,999.0,0.512192,0.812621
R4,0.612526,0.721755,0.291876,0.917774,0.714576
R5,0.542544,999.0,0.373341,0.674134,0.441833


In [13]:
frame1.loc["R2"].fillna(value=999)

C1      0.224797
C2    999.000000
C3      0.760531
C4    999.000000
C5    999.000000
Name: R2, dtype: float64

In [14]:
frame1["C2"].fillna(value=999)

R1    999.000000
R2    999.000000
R3      0.953393
R4      0.721755
R5    999.000000
Name: C2, dtype: float64

In [15]:
# filling missing data of C3 with mean of all elements of C3
frame1["C3"]=[1,1,2,np.NaN,2]   # replaced values of C3 just for sake of simplicity so that mean value can be observed
frame1["C3"].fillna(value=frame1["C3"].mean())

R1    1.0
R2    1.0
R3    2.0
R4    1.5
R5    2.0
Name: C3, dtype: float64