Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.
<br>
# When and Why Is Data Missed?
Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.
<br>
<br>
Let us now see how we can handle missing values (say NA or NaN) using Pandas.

In [6]:
# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
print(df)


        one       two     three
a -0.069095 -0.225319  0.814670
c -0.634259  0.266647  1.827144
e -0.950299 -1.230033  1.014528
f  1.254108 -0.267129 -1.908693
h -0.153163 -0.276276  1.567372


In [8]:
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) #re index baris
print(df)

        one       two     three
a -0.069095 -0.225319  0.814670
b       NaN       NaN       NaN
c -0.634259  0.266647  1.827144
d       NaN       NaN       NaN
e -0.950299 -1.230033  1.014528
f  1.254108 -0.267129 -1.908693
g       NaN       NaN       NaN
h -0.153163 -0.276276  1.567372


Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

## 1. Check for Missing Values

To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −

example 1 :


In [10]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnull())

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


## 2. Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

### 2.1 Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".

In [15]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print ("NaN replaced with '0':")
df.fillna(0)

        one       two     three
a -0.741098 -1.039918 -0.459304
b       NaN       NaN       NaN
c  0.700029  0.281674 -1.335094
NaN replaced with '0':


Unnamed: 0,one,two,three
a,-0.741098,-1.039918,-0.459304
b,0.0,0.0,0.0
c,0.700029,0.281674,-1.335094


### 2.2 Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values. <br>
Metode: <br>
pad/fill    : Fill methods Forward <br>
bfill/backfill :	Fill methods Backward

example 1:


In [22]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df

Unnamed: 0,one,two,three
a,0.483503,-0.515016,-0.387263
b,,,
c,-0.282185,-0.102253,0.820617
d,,,
e,0.112986,1.412236,0.266292
f,0.114866,0.487706,0.81263
g,,,
h,1.32132,0.401935,-1.97389


In [21]:
df.fillna(method='pad')

Unnamed: 0,one,two,three
a,1.438613,-0.598262,-1.03413
b,1.438613,-0.598262,-1.03413
c,-0.770954,0.585604,-0.488252
d,-0.770954,0.585604,-0.488252
e,0.463285,1.1076,0.654969
f,0.174185,-1.696309,0.132577
g,0.174185,-1.696309,0.132577
h,3.060544,-0.161154,0.358404


### 2.3 Drop Missing Values

If you want to simply exclude the missing values, then use the ```dropna``` function along with the axis argument. By default, ```axis=0```, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

In [23]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df.dropna()

Unnamed: 0,one,two,three
a,-0.52424,-0.106547,1.280218
c,0.640365,-1.114495,-1.156275
e,2.208189,-0.365424,-0.933192
f,-0.029982,0.856086,-0.48333
h,-1.300845,-0.104492,-1.187474
