<a href="https://colab.research.google.com/github/ECV21/Course-Data-Analysis-with-Python-FreeCodeCamp/blob/main/Data_Cleaning_FreeCodeCamp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Missing Data

what does "missing data" mean? What is a missing value? It's depends on the origin of the data and the context it was generated. For example, for a survey, a SALARY field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy".

In [1]:
#impot libraries

import pandas as pd
import numpy as np

In [3]:
#For python, all these values are considered "falsy"
falsy_values = (0, False, None, '', [], {})
print(falsy_values)

(0, False, None, '', [], {})


In [4]:
any(falsy_values)

False

In [5]:
#Numpy has a special "nullable" value for numbers which is p.nan. It's NaN: "Not a Number"

np.nan

nan

In [6]:
#The np.nan value is kind of a virus. Everything that it touches becomes np.nan

3 + np.nan

nan

In [7]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [9]:
a.sum() #All elements will become nan

nan

In [10]:
a.mean() #nan will be ignored

nan

In [12]:
#nan is better tha regular NONE values, which in the previous examples would have rasied an exception:
3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [13]:
#For numerical array, the NONE value is replaced by np.nan

a = np.array([1, 2, 3, np.nan, None, 4], dtype="float")
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [15]:
#As we said, np.nan is like a virus. If you have any NAN value in an array
# and you try to perform an operation on it, you'll get unexpected results:
a.mean()

nan

In [16]:
a.sum()

nan

In [19]:
#Numpy also supports an "infinite" type:

np.inf

inf

In [20]:
#which also behaves as a virus:

3 + np.inf

inf

In [21]:
np.inf / 3

inf

In [22]:
np.inf / np.inf

nan

In [26]:

b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=float)

In [27]:
b.sum()

nan

#Checking for NAN or INF

There are two functions: np.isnan and np.isinf that will perform the desired checks

In [29]:
#Checking for NAN
np.isnan(np.nan)

True

In [30]:
#checking for INF
np.isinf(np.inf)

True

In [31]:
#and the joint operation can be performed with np.isfinite

np.isfinite(np.inf), np.isfinite(np.nan)

(False, False)

In [32]:
#np.isnan and np.isinf also take arrays as inputs, and return boolean arrays as results:
np.isnan([1, 2, 3, np.nan, np.inf, 4])

array([False, False, False,  True, False, False])

In [33]:
np.isinf([1, 2, 3, np.nan, np.inf, 4])

array([False, False, False, False,  True, False])

In [34]:
np.isfinite([1, 2, 3, np.nan, np.inf, 4])

array([ True,  True,  True, False, False,  True])

#Filtering them out

Whenever you're trying to perform an operation with numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid NAN propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In [35]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [38]:
#It returns a new array that contains only the non-NaN elements of a
a[~np.isnan(a)]

array([1., 2., 3., 4.])

In [40]:
#which is equivalent to:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

In [41]:
#and whit that result, all the operation can be now performed:

a[np.isfinite(a)].sum()

10.0

In [42]:
a[np.isfinite(a)].mean()

2.5

#Handling missing data with pandas

Pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values

In [43]:
#import libraries
import pandas as pd
import numpy as np



#Pandas utility functions

Similarly to numpy, pandas also has a few utility functions to identify and detect null values:

In [44]:
pd.isnull(np.nan)

True

In [45]:
pd.isnull(None)

True

In [46]:
pd.isna(np.nan)

True

In [47]:
pd.isna(None)

True

In [48]:
#The opposite ones also exist:

pd.notnull(None)

False

In [49]:
pd.notnull(np.nan)

False

In [50]:
pd.notna(np.nan)

False

In [51]:
pd.notnull(1)

True

In [52]:
#These functions also work with Series and DataFrame

pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [53]:
pd.notnull(pd.Series([1, np.nan, 7]))

0     True
1    False
2     True
dtype: bool

In [54]:
#dataframe

pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


#Pandas

Pandas manages missing values more gracefully than numpy. NANs will no longer behave as "viruses", and operations will just ignore them completely

In [56]:
pd.Series([1, 2, np.nan]).count()

2

In [57]:
pd.Series([1, 2, np.nan]).sum()

3.0

In [58]:
pd.Series([1, 2, np.nan]).mean()

1.5

##Filtering missing data

As we saw with numpy, we could combine boolean selection + pd.isnull to filter out those NAN and values:

In [59]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64