# Missing data - np.NaN,None 

is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

# When and Why Is Data Missed?
Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

### Creating Missing Values - using reindexing

In [3]:
# import the pandas library
import pandas as pd
import numpy as np

# Sample DataFrame with missing values
data = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score': [np.nan, 40, 80, 98]}

df = pd.DataFrame(data)

df


Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


### Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −

In [5]:
df.isnull()

Unnamed: 0,First Score,Second Score,Third Score
0,False,False,True
1,False,False,False
2,True,False,False
3,False,True,False


In [7]:
df.isnull().sum()

First Score     1
Second Score    1
Third Score     1
dtype: int64

In [9]:
df.isnull().sum(axis=1)

0    1
1    0
2    1
3    1
dtype: int64

In [11]:
df.notnull()

Unnamed: 0,First Score,Second Score,Third Score
0,True,True,False
1,True,True,True
2,False,True,True
3,True,False,True


# Drop Missing Values: dropna()
If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

In [13]:
print(df.dropna(axis=1))

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


In [15]:
print(df.dropna(axis=0))

   First Score  Second Score  Third Score
1         90.0          45.0         40.0


In [17]:
print(df.dropna())

   First Score  Second Score  Third Score
1         90.0          45.0         40.0


In [19]:
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [23]:
df.dropna(how='all',axis=1)

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [25]:
print(df.dropna())

   First Score  Second Score  Third Score
1         90.0          45.0         40.0


In [26]:
df=df.dropna(how='all')
df


Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [29]:
df=df.dropna(how='all')
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [31]:
# dictionary of lists
dict = {'First Score':[100, None, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, None, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
    
df

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,
1,,,,
2,,45.0,80.0,
3,95.0,56.0,98.0,65.0


In [22]:
df.fillna(0)

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,45.0,80.0,0.0
3,95.0,56.0,98.0,65.0


# Filling Missing Values : fillna(), replace(), interpolate()


Replacing NA with a scalar value is equivalent behavior of the fillna() function.

# fillna()


In [29]:
df

Unnamed: 0,one,two,three
a,-1.864349,0.331341,0.361491
b,,,
c,0.04018,-1.199355,1.014927
d,,,
e,-0.727764,-0.571631,0.488744


Fill NA Forward and Backward

1 pad/fill - Fill methods Forward

2 bfill/backfill-Fill methods Backward

In [35]:
df.fillna(method='pad')

  df.fillna(method='pad')


Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,
1,100.0,30.0,52.0,
2,100.0,45.0,80.0,
3,95.0,56.0,98.0,65.0


In [37]:
df.fillna(method='backfill')


  df.fillna(method='backfill')


Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,65.0
1,95.0,45.0,80.0,65.0
2,95.0,45.0,80.0,65.0
3,95.0,56.0,98.0,65.0


# replace()

Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.


In [8]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

df

Unnamed: 0,one,two
0,10,1000
1,20,0
2,30,30
3,40,40
4,50,50
5,2000,60


In [9]:
print (df.replace({1000:'cat',2000:'bat', 50: 'fifty'}))

     one    two
0     10    cat
1     20      0
2     30     30
3     40     40
4  fifty  fifty
5    bat     60


# interpolate()

It uses various interpolation technique to fill the missing values rather than hard-coding the value.

In [40]:
import pandas as pd 
    
# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[None, 2, 54, 3, None], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6]}) 
    
# Print the dataframe 
df 

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,2.0,16.0,3.0
2,5.0,54.0,,
3,,3.0,3.0,
4,1.0,,8.0,6.0


In [41]:
# to interpolate the missing values 
df.interpolate(method ='linear', limit_direction ='forward')

Unnamed: 0,A,B,C,D
0,12.0,,20.0,14.0
1,4.0,2.0,16.0,3.0
2,5.0,54.0,9.5,4.0
3,3.0,3.0,3.0,5.0
4,1.0,3.0,8.0,6.0


### Dropping Entries
`drop` - use drop to return object with dropped values

In [24]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [25]:
obj.drop('c',inplace=False)
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [26]:
obj.drop('c',inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [23]:
n_obj=obj.drop('d')
n_obj

a    0.0
b    1.0
e    4.0
dtype: float64

# Thank You