# How do I handle missing values in pandas?

🐼 Tuto on pandas by Data School - Exercice performed by Dorian.H Mekni 🥇 | Sun 06 Dec 2020

In [1]:
import pandas as pd

In [2]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [4]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59



☝🏻 NaN is a missing value. 


In [5]:
ufo.isnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,False,True,False,False,False
18237,False,True,False,False,False
18238,False,True,True,False,False
18239,False,False,False,False,False
18240,False,True,False,False,False



⭐️ By using the isnull() method, it shows a false when the row is not NaN, and True if it is missing. 

Now let's see the exact opposite result using a different semantic method : notnull(). 


In [7]:
ufo.notnull().tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,True,False,True,True,True
18237,True,False,True,True,True
18238,True,False,False,True,True
18239,True,True,True,True,True
18240,True,False,True,True,True


In [8]:
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64


☝🏻 This trick enables us to read the total missing values per column. 
It works out the sum per column because the axis' default value is 0, or column. 


In [15]:
ufo.isnull().sum(axis=0)

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64


✅ It does work as expected, as the sum is being calculated from 0, so from top to bottom. It then converts the 0 to False and the 1 to True. 



➕ Here's another example that summarize how pandas convert the bollean values in binary conversion when operating a sum.  


In [17]:
pd.Series([True, False, True]).sum()

2

In [16]:
pd.Series([True, False, True]).notnull().sum()

3


✅ Here we only have values, hence the result 3. 


In [35]:
pd.Series([True, False, True]).isnull().sum()

0


✅ Here we have no missing values, hence the result 0.



⭐️ Let's now study a subset of a dataframe by only looking at a portion of it. Here we'll focus on the 25 rows containing missing within the column city. 

In [23]:
ufo[ufo.City.isnull()]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00



⭐️ A common action we may undertake is to drop missing values. 


In [24]:
ufo.shape

(18241, 5)

In [25]:
ufo.dropna(how='any').shape 

(2486, 5)


☝🏻 By dropping the missing values, we see that our dataframe has considerably shrunk. 



⭐️ Now we can perform the same method but modifying the parameter. We only want pandas to drop missing values ONLY if all of its values are missing. 


In [26]:
ufo.dropna(how='all').shape

(18241, 5)


✅ State and time both don't have misisng values so that action won't be performed, hence the result with no dropped missing values. 


In [27]:
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

(15576, 5)


☝🏻 If only of these two are missing for a given row, I want pandas to drop the row. We are effectively left with 15576 rows. 


# 🎩 Bonus tips : iPython | Jupyter Notebook ONLY


🤠 What do we do with these missing values ?
    

In [29]:
ufo['Shape Reported'].value_counts()

LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
ROUND           2
CRESCENT        2
HEXAGON         1
FLARE           1
DOME            1
PYRAMID         1
Name: Shape Reported, dtype: int64


☝🏻 value_counts method counts how many times a given value has occured in the shape reported Series.



❗️By default, missing values are excluded as the value_counts methods only focus on the values and therefore does not take any missing values into account. 



⭐️ Let's modify a key parameter so that the missing values' count can be seen. 


In [30]:
ufo['Shape Reported'].value_counts(dropna=False)

LIGHT        2803
NaN          2644
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
VARIOUS       333
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
DOME            1
PYRAMID         1
HEXAGON         1
FLARE           1
Name: Shape Reported, dtype: int64


✅ NaN or missing values is counted a total of 2644 for the Shape Reported column. 



⭐️ In this case, it would make sense to displace the missing values into the VARIOUS' Series. 


In [36]:
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

In [37]:
ufo['Shape Reported'].value_counts(dropna=False)

VARIOUS      2977
LIGHT        2803
DISK         2122
TRIANGLE     1889
OTHER        1402
CIRCLE       1365
SPHERE       1054
FIREBALL     1039
OVAL          845
CIGAR         617
FORMATION     434
RECTANGLE     303
CYLINDER      294
CHEVRON       248
DIAMOND       234
EGG           197
FLASH         188
TEARDROP      119
CONE           60
CROSS          36
DELTA           7
CRESCENT        2
ROUND           2
FLARE           1
DOME            1
PYRAMID         1
HEXAGON         1
Name: Shape Reported, dtype: int64


✅ Various has now integrated the missing values.    



🙏🏻 Thank you ! 

👋🏻 See you in the next one !
