# Missing Data

- The purpose of this is to explain how to manage missing data in <b>series and in data frames</b>.
- lets see the techniques for modifying data frames and data sets that contain cells with missing values.
- Missing data is a frequent occurrence in real world data science.
- For example, people may not answer all the questions within a survey or a data set may be constructed  
from multiple sources, not all of which contain identical time  
and decease or identical data types.  

In [1]:
import pandas as pd
import numpy as np

starting_date='20160701' # used for creating dates index
sample_numpy_data=np.array(np.arange(24)).reshape(6,4)
dates_index=pd.date_range(starting_date,periods=6)
sample_df=pd.DataFrame(sample_numpy_data,dates_index,columns=list('ABCD'))


sample_df_2=sample_df.copy()
sample_df_2['Fruits']=['apple','orange','bananas','strawberry','blueberry','pineapple']

sample_series=pd.Series([1,2,3,4,5,6], index=pd.date_range(starting_date,periods=6))
sample_df_2['Extra Data']=sample_series*3+1
second_numpy_array=np.array(np.arange(len(sample_df_2))) *100 +7
sample_df_2['G']=second_numpy_array

sample_df_2

Unnamed: 0,A,B,C,D,Fruits,Extra Data,G
2016-07-01,0,1,2,3,apple,4,7
2016-07-02,4,5,6,7,orange,7,107
2016-07-03,8,9,10,11,bananas,10,207
2016-07-04,12,13,14,15,strawberry,13,307
2016-07-05,16,17,18,19,blueberry,16,407
2016-07-06,20,21,22,23,pineapple,19,507


### Missing data
- pandas uses np.nan to represent missing data
- By default, it is not included in computations

documentation - https://pandas.pydata.org/pandas-docs/stable/missing_data.html

##### reindex()
- A data frames re-index function conforms a data frame to a new index with optional filling logic, placing not available or not a number in locations having no value in the previous index.
- A new object is produced unless the new index is equivalent to the current one and copy=False
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

In [5]:
browser_index=['Firefox','Chrome','Safari','IE10','Konqueror']

browser_df=pd.DataFrame({
    'http_status':[200,200,404,404,301],
    'response_time':[0.04,0.02,0.07,0.08,1.0]
},
    index=browser_index)
browser_df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


###### 1.reindex() creates a copy (not a view)

In [6]:
new_index=['Safari','Iceweasel','Comodo Dragon','IE10','Chrome']
browser_df_2=browser_df.reindex(new_index)
browser_df_2

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


In [7]:
#Above if you see safari and chrome data is available in the ORIGINAL but no data available for the new indexes LIKE comodo..ice..NOT AVIALBLE in the ORIGINAL
#The reindex method added NaN for the missing data
#The result is that the rows that correspond to labels that were not present in our original data set are set to <b>not a number</b>, as you can see NaN, NaN, et cetera.

##### 2.Drop rows that are missing data - dropna
- dropna - Return object with labels on given axis omitted where alternately any or all of the data are missing

<br>

- <b>dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)</b>

</br>

- doc - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

In [8]:
browser_df_3=browser_df_2.dropna(how='any')
#how : {'any', 'all'}
#    * any : if any NA values are present, drop that label
#    * all : if all values are NA, drop that label
browser_df_3
#see below all the rows that has NaN or NA are dropped

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
IE10,404.0,0.08
Chrome,200.0,0.02


##### 3.Fiil in Missing data
- DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

- documentation - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

- Fill NA/NaN values using the specified method/value

- **It returns a copy..wont modify the original

In [9]:
browser_df_2.fillna(value=0.05555)
browser_df_2

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


#### 4.Get boolean mask where values are nan
- isnull

In [11]:
pd.isnull(browser_df_2)

Unnamed: 0,http_status,response_time
Safari,False,False
Iceweasel,True,True
Comodo Dragon,True,True
IE10,False,False
Chrome,False,False


##### 5.NaN propagates during arithmetic operations

In [12]:
browser_df_2*17

Unnamed: 0,http_status,response_time
Safari,6868.0,1.19
Iceweasel,,
Comodo Dragon,,
IE10,6868.0,1.36
Chrome,3400.0,0.34
