As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While 'NaN' is the default missing value marker for reasons of data processing speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. 

In [2]:
import pandas as pd

In [3]:
ufo=pd.read_csv('http://bit.ly/uforeports')

In [4]:
ufo.tail() # Will take a look at last 5 rows using tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59


Here in the 'Colors Reported' column we see 'NaN' which stands for 'Not a Number' and conceptually it says a missing data. 
While constructing 'ufo' Dataframe, read_csv() detected missing value and tagged it with special 'NaN'.

###### Methods to work with missing Data:

1--isnull() : Detect missing values

In [7]:
ufo.isnull().tail() 

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,False,True,False,False,False
18237,False,True,False,False,False
18238,False,True,True,False,False
18239,False,False,False,False,False
18240,False,True,False,False,False


Here the way null works is, it shows 'False' if something is not null/ not missing and 'True' if something null and missing.

For instance, here 'City' which shows last 5 rows and has all the Data values as 'False' means the data is not missing at all.

And in 'Colors Reported', most of the Data values are 'True' means they are missing. Its only because pandas uses this special value NaN that isnull() can detect them and produce 'True's and 'False's as a result.

2--notnull(): Detect existing (non-missing) values.

In [9]:
ufo.notnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,True,False,True,True,True
18237,True,False,True,True,True
18238,True,False,False,True,True
18239,True,True,True,True,True
18240,True,False,True,True,True


Here we will get exact opposite result of isnull(). 

Lets see why isnull() and notnull() methods are usefull:

Below is one of the pandas trick:

In [10]:
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

Above result says is the number of missing values in each of the columns in 'ufo' Dataframe. 

Lets see how did that work and will break it down and there are couple of things to understand here below:

In [11]:
pd.Series([True, False, True]) # Will pass list of booleans 

0     True
1    False
2     True
dtype: bool

In [12]:
pd.Series([True, False, True]).sum()

2

Above result of sum() is 2 because, pandas does math with boolean values as it will convert all 'True's to 1 and 'False's to 0. 

So above result is showing you the count of 'True's and this pandas thought process about sum of boolean will help you understand " ufo.isnull().sum() " results the count of missing values in each column.

And why summing sum() at the column level?  :   sum(axis=0) by default, where axis=0 is 'row', means sum() is doing the maths across the rows which is down the column.

In short, when we say " ufo.isnull().sum() ":  ufo.isnull() is creating a Dataframe of 'True's and 'False's , then sum() is doing column count of 'True's and 'False's and converting 'True's to 1 and 'False's to 0. 

###### Another way to use isnull() method and isnull() turns out to be Series method

In [13]:
ufo.City.isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
18236    False
18237    False
18238    False
18239    False
18240    False
Name: City, Length: 18241, dtype: bool

We will use above Series of boolean to filter a Dataframe as below:

In [14]:
ufo[ufo.City.isnull()]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


We know that there are 25 missing values in 'City' Series, we can see those 25 rows where 'City' is missing in ufo Dataframe. 

Thats how we examine the subset of the Dataframe by only looking at the portion of the Dataframe. 

So above is all basic funtionality.

what should we do about missing data? 

Based on Dataset or the analysis we are doing or problems we are trying to solve, we can see some options which are in pandas are:

1: Common thing to do is to drop missing values as below:

In [15]:
ufo.shape

(18241, 5)

In [18]:
ufo.dropna(how='any').shape 
# dropna() : Remove missing values, how{‘any’, ‘all’}, default ‘any’,  inplace=False(default)-drops temporarily

(2486, 5)

In the above command, we are asking to drop a row if any of its values is missing. 

And the result we got is (2486,5), lost almost all of the rows except 2486 rows in which any of the 5columns have missing values.

In [19]:
ufo.shape

(18241, 5)

Since we did not say ' inplace=True ' in above dropna() method, we can see same original results now too when we do ufo.shape. 

lets modify dropna(): 

In [20]:
ufo.dropna(how='all').shape 
# that says only drop a row if all values are missing 

(18241, 5)

Above did not drop any rows because 'State' and 'Time' column dont have any missing values. so how='all' is not going to make any changes in this case.

one more option to dropna() is called subset as below:

In [22]:
ufo.dropna(subset=['City','Shape Reported'],how='any').shape # Can pass subset a list of columns to consider.
#Above command is saying we want to drop a row if either 'City' or 'Shape Reported' are missing.

(15576, 5)

In [23]:
ufo.dropna(subset=['City','Shape Reported'],how='all').shape
#Above command is saying we want to drop a row if both 'City' and 'Shape Reported' are missing.

(18237, 5)

Here only 4 rows are dropped from original data(18241,5) because only 4 rows where both 'City' and 'Shape Reported' are missing.

###### Useful hint: About filling Missing Data

In [24]:
ufo['Shape Reported'].value_counts()
# Value_counts() saying how many times did the particular value occur in 'Shape Reported' Series. 

LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
PYRAMID         1
DOME            1
FLARE           1
Name: Shape Reported, dtype: int64

Above result says, most common value is 'LIGHT', 'DISK' so on. 

In [25]:
ufo['Shape Reported'].value_counts().sum()

15597

In [27]:
ufo.shape

(18241, 5)

We might be thinking, why there is difference between original data (18241) and 'Shape Reported' sum of value_counts() which is 15597: that is true because by default missing values(NaN) are excluded in Value_counts(). 

In [32]:
ufo['Shape Reported'].value_counts(dropna=True).sum() # by default dropna=True: means Don’t include counts of NaN.

15597

In [34]:
ufo['Shape Reported'].value_counts(dropna=False).sum() # dropna=False : Means it include counts of NaN

18241

In [36]:
ufo['Shape Reported'].value_counts(dropna=False)

LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
PYRAMID         1
DOME            1
FLARE           1
Name: Shape Reported, dtype: int64

Above we can see that 'Shape Reported' column has 2644rows of NaN(which missing values). 

What if i wanted to fill in those 2644 rows in ? : 
Based on the understanding of data, may be we can fill in 2644 rows of NaN with 'OTHERS' which sort make sense.

In [37]:
ufo['Shape Reported'].fillna(value='OTHERS', inplace=True) # 'inplace=True' means make the change permanent in ufo Dataframe.

In [38]:
ufo['Shape Reported'].value_counts(dropna=False)

LIGHT        2803
OTHERS       2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
HEXAGON         1
PYRAMID         1
DOME            1
FLARE           1
Name: Shape Reported, dtype: int64

Now we can see all of the missing values(NaN) are converted to 'OTHERS'.