# Missing Data

- The purpose of this is to explain how to manage missing data in <b>series and in data frames</b>.
- lets see the techniques for modifying data frames and data sets that contain cells with missing values.
- Missing data is a frequent occurrence in real world data science.
- For example, people may not answer all the questions within a survey or a data set may be constructed  
from multiple sources, not all of which contain identical time  
and decease or identical data types.  

Problems faced with data quality
- Missing values
- Null values
- Character compatibility
- Duplicated data
- Corrupted data  

With data cleaning, we can overcome above issues  
Pandas provide many methods like  
- reindex( ), 
- dropna( ), 
- fillna( ), 
- isnull( )
- drop_duplicates( )

### drop duplicates

In [15]:
import pandas as pd
ddf=pd.DataFrame({'A':pd.Series([1,2,3,3,5,5]),'B':pd.Series([1,2,3,4,5,5])})
ddf

Unnamed: 0,A,B
0,1,1
1,2,2
2,3,3
3,3,4
4,5,5
5,5,5


observe below , duplicate means not in the A column ,  
a record is combination of both A and B so 5,5 repeated two times  
so it willbe removed  
but 3,3 and 3,4 are not duplicates records

In [16]:
ddf.drop_duplicates()

Unnamed: 0,A,B
0,1,1
1,2,2
2,3,3
3,3,4
4,5,5


In [17]:
import pandas as pd
import numpy as np

starting_date='20160701' # used for creating dates index
sample_numpy_data=np.array(np.arange(24)).reshape(6,4)
dates_index=pd.date_range(starting_date,periods=6)
sample_df=pd.DataFrame(sample_numpy_data,dates_index,columns=list('ABCD'))


sample_df_2=sample_df.copy()
sample_df_2['Fruits']=['apple','orange','bananas','strawberry','blueberry','pineapple']

sample_series=pd.Series([1,2,3,4,5,6], index=pd.date_range(starting_date,periods=6))
sample_df_2['Extra Data']=sample_series*3+1
second_numpy_array=np.array(np.arange(len(sample_df_2))) *100 +7
sample_df_2['G']=second_numpy_array

sample_df_2

Unnamed: 0,A,B,C,D,Fruits,Extra Data,G
2016-07-01,0,1,2,3,apple,4,7
2016-07-02,4,5,6,7,orange,7,107
2016-07-03,8,9,10,11,bananas,10,207
2016-07-04,12,13,14,15,strawberry,13,307
2016-07-05,16,17,18,19,blueberry,16,407
2016-07-06,20,21,22,23,pineapple,19,507


### Missing data
- pandas uses np.nan to represent missing data
- By default, it is not included in computations

documentation - https://pandas.pydata.org/pandas-docs/stable/missing_data.html

##### reindex()
- A data frames re-index function conforms a data frame to a new index with optional filling logic, placing not available or not a number in locations having no value in the previous index.
- A new object is produced unless the new index is equivalent to the current one and copy=False
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

In [18]:
browser_index=['Firefox','Chrome','Safari','IE10','Konqueror']

browser_df=pd.DataFrame({
    'http_status':[200,200,404,404,301],
    'response_time':[0.04,0.02,0.07,0.08,1.0]
},
    index=browser_index)
browser_df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


###### 1.reindex() creates a copy (not a view)

In [19]:
new_index=['Safari','Iceweasel','Comodo Dragon','IE10','Chrome']
browser_df_2=browser_df.reindex(new_index)
browser_df_2

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


In [20]:
#Above if you see safari and chrome data is available in the ORIGINAL but no data available for the new indexes LIKE comodo..ice..NOT AVIALBLE in the ORIGINAL
#The reindex method added NaN for the missing data
#The result is that the rows that correspond to labels that were not present in our original data set are set to <b>not a number</b>, as you can see NaN, NaN, et cetera.

##### 2.Drop rows that are missing data - dropna
- dropna - Return object with labels on given axis omitted where alternately any or all of the data are missing

<br>

- <b>dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)</b>

</br>

- doc - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

In [21]:
browser_df_3=browser_df_2.dropna(how='any')
#how : {'any', 'all'}
#    * any : if any NA values are present, drop that label
#    * all : if all values are NA, drop that label
browser_df_3
#see below all the rows that has NaN or NA are dropped

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
IE10,404.0,0.08
Chrome,200.0,0.02


##### 3.Fiil in Missing data
- DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

- documentation - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

- Fill NA/NaN values using the specified method/value

- **It returns a copy..wont modify the original

In [22]:
browser_df_2.fillna(value=0.05555)
browser_df_2

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


#### 4.Get boolean mask where values are nan
- isnull

In [23]:
pd.isnull(browser_df_2)

Unnamed: 0,http_status,response_time
Safari,False,False
Iceweasel,True,True
Comodo Dragon,True,True
IE10,False,False
Chrome,False,False


##### We can also get the count

In [27]:
browser_df_2['http_status'].isnull().value_counts()

False    3
True     2
Name: http_status, dtype: int64

#### 5 Extract string data

In [32]:
sample_df_2['Fruits'].str.extract('(\w+)')

  """Entry point for launching an IPython kernel.


2016-07-01         apple
2016-07-02        orange
2016-07-03       bananas
2016-07-04    strawberry
2016-07-05     blueberry
2016-07-06     pineapple
Freq: D, Name: Fruits, dtype: object

##### 6 replace string data

In [35]:
sample_df_2['new col']=sample_df_2['Fruits'].str.replace('^apple','pandu')
sample_df_2

Unnamed: 0,A,B,C,D,Fruits,Extra Data,G,new col
2016-07-01,0,1,2,3,apple,4,7,pandu
2016-07-02,4,5,6,7,orange,7,107,orange
2016-07-03,8,9,10,11,bananas,10,207,bananas
2016-07-04,12,13,14,15,strawberry,13,307,strawberry
2016-07-05,16,17,18,19,blueberry,16,407,blueberry
2016-07-06,20,21,22,23,pineapple,19,507,pineapple


##### 5.NaN propagates during arithmetic operations

In [24]:
browser_df_2*17

Unnamed: 0,http_status,response_time
Safari,6868.0,1.19
Iceweasel,,
Comodo Dragon,,
IE10,6868.0,1.36
Chrome,3400.0,0.34


6 Example

In [44]:
import pandas as pd
df=pd.read_csv('/Users/syamkumarj/Documents/Workspace/python_anaconda_workspace/Pandas/Employees.csv'
              ,sep=',',parse_dates=[0])

In [45]:
df

Unnamed: 0,Date,Temp 1,Temp 2,City choice
0,2016-01-01,24.0,26.0,2
1,2016-01-02,,27.0,2
2,2016-01-03,22.0,,1
3,2016-01-04,23.0,24.0,2
4,2016-01-05,25.0,24.0,2


here missing values automatically filled with NaN

In [47]:
df.describe()

Unnamed: 0,Temp 1,Temp 2,City choice
count,4.0,4.0,5.0
mean,23.5,25.25,1.8
std,1.290994,1.5,0.447214
min,22.0,24.0,1.0
25%,22.75,24.0,2.0
50%,23.5,25.0,2.0
75%,24.25,26.25,2.0
max,25.0,27.0,2.0


In [48]:
df.fillna(df.mean()) # takes column wise mean and fills it
#that is why column1 missing has 23.5 while column2 has 25.25

Unnamed: 0,Date,Temp 1,Temp 2,City choice
0,2016-01-01,24.0,26.0,2
1,2016-01-02,23.5,27.0,2
2,2016-01-03,22.0,25.25,1
3,2016-01-04,23.0,24.0,2
4,2016-01-05,25.0,24.0,2


7 Example Bad csv file
- we have two columns but one row has 3 values
- when you read normally , you get error 
- inorder to tell parser, to skip those bad lines, we need to. 
use attribute **error_bad_lines**

In [53]:
import pandas as pd
df=pd.read_csv('/Users/syamkumarj/Documents/Workspace/python_anaconda_workspace/Pandas/mybad.csv'
              , error_bad_lines=True)
df

Unnamed: 0,Char,No1,No2,Unnamed: 3
0,A,24,26,
1,B,22,27,2.0
2,C,22,22,
3,D,23,24,
4,E,25,24,
