# Handling Missing Data

## What is missing data?

Not all missing data is equal. At the heart of the matter, there exists the need to distinguish between two types of missingness:

* **Unknown but existing data**: This is data that we know exists, however, due to sparse or incomplete sampling, we do not actually know the value of it. There is some value there, and it would be useful to try and apply some sort of missing data interpolation technique in order to discover it.

   For example, in 2013 *The New York Times* published [a survey of income mobility in the United States](http://www.nytimes.com/2013/07/22/business/in-climbing-income-ladder-location-matters.html). As it happens often in datasets which drill this deep (to a county level), there were several counties for which the newspaper could not trace data. Yet it would be possible, and easy, if it was truly necessary to do so, to interpolate reasonable values for these counties based on data from the surrounding ones, for instance, or based on data from other counties with similar demographic profiles. This is fundamentally speaking, data that *can* be filled by some means.

<!-- ![Map of the US by Income Mobility](location_matters.png "Map of the US by Income Mobility") -->
 
  
* **Data that doesn't exist**: data that does not exist at all, in any shape or form.

  For example, it would make no sense to ask the average household income for residents of an industrial park or other such location where no people actually live. It would not *really* make sense to use 0 as a [sentinal value](https://en.wikipedia.org/wiki/Sentinel_value) in this case, either, because the existance of such a number implies in the first place the existance of people for whom an average can be taken&mdash;otherwise in trying to compute an average you are making a [divide by zero error](https://en.wikipedia.org/wiki/Division_by_zero)! This is, fundamentally speaking, data that *cannot* be filled by any means.


This is an important distinction to keep in mind, and implementing it in some standard way significantly complicates the picture. It means that to ask the question "is this data entry filled?" one must actually consider three possible answers: "Yes", "No, but it can be", and "No, and it cannot be". There seem to be two dominant paradigms for handling this distinction:

* **Bitpatterns**: Embed sentinal values into the array itself. For instance for integer data one might take `0` or `-9999` to signal unknown but existant data. This requires no overhead but can be confusing and oftentimes robs you of values that you might otherwise want to use (like `0` or `-9999`).


* **Masks**: Use a seperate boolean array to "mask" the data whenever missing data needs to be represented. This requires making a second array and knowing when to apply it to the dataset, but is more robust.

[Numpy](http://www.numpy.org/) is the linear algebra and vectorized mathematical operation library which underpins the Python scientific programming stack, and its methodologies inform how everything else works. Numpy has masks: these are provided via the `numpy.ma` module. But it has no native bitpatterns! There is still no performant native bitpattern `NA` type available whatsoever.

The lack of a native `NA` type, as is the case in, say, R, is a **huge** problem for libraries, like [Pandas](http://pandas.pydata.org/), that should be able to efficiently handle large datasets.

Indeed, **Pandas does not use the `numpy.ma` mask**. Masks are simply non-performant above for the purposes of a library that is expected to be able to handle literally millions of entries entirely in-memory, as `pandas` does. Pandas instead defines and uses its own null value sentinels, particularly `NaN` (`np.nan`) for null numbers and `NaT` (a psuedo-native handled under-the-hood); and then allows you to apply your own `isnull()` mask to your dataset (more on that shortly). 


### ``None``: Pythonic missing data

The first sentinel value used by Pandas is ``None``, a Python singleton object that is often used for missing data in Python code.
Because it is a Python object, ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):

In [2]:
import numpy as np
import pandas as pd

In [2]:
val = np.array([6, None, 7, 5, 6])
val

array([6, None, 7, 5, 6], dtype=object)

This ``dtype=object`` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.
While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

In [3]:
%timeit np.arange(1E5, dtype='object').sum()
print()

6.86 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [4]:
%timeit np.arange(1E5, dtype='int').sum()
print()

136 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



The use of Python objects in an array also means that if you perform aggregations like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally get an error:

### ``NaN``: Missing numerical data

The other missing data representation, ``NaN`` (acronym for *Not a Number*), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [2]:
import numpy as np
val1 = np.array([1, np.nan, 7, 1, 8]) 
val1.dtype

dtype('float64')

Notice that NumPy prefered a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.
You should be aware that ``NaN`` is a bit like data virus–it infects any other object it touches.
Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``:

In [3]:
6 + np.nan

nan

In [4]:
7 *  np.nan

nan

NumPy does provide some special aggregations that will ignore these missing values:

In [5]:
np.nansum(val1)

17.0

Keep in mind that ``NaN`` is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.

### NaN and None in Pandas

``NaN`` and ``None`` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [5]:
import pandas as pd
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Notice that in addition to casting the integer array to floating point, Pandas automatically converts the ``None`` to a ``NaN`` value.
(Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included).

While this type of magic may feel a bit hackish compared to the more unified approach to NA values in domain-specific languages like R, the Pandas sentinel/casting approach works quite well in practice and in my experience only rarely causes issues.

The following table lists the upcasting conventions in Pandas when NA values are introduced:

|Typeclass     | Conversion When Storing NAs | NA Sentinel Value      |
|--------------|-----------------------------|------------------------|
| ``floating`` | No change                   | ``np.nan``             |
| ``object``   | No change                   | ``None`` or ``np.nan`` |
| ``integer``  | Cast to ``float64``         | ``np.nan``             |
| ``boolean``  | Cast to ``object``          | ``None`` or ``np.nan`` |

Keep in mind that in Pandas, string data is always stored with an ``object`` dtype.

## Operating on Null Values

As we have seen, Pandas treats ``None`` and ``NaN`` as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

- ``isnull()``: Generate a boolean mask indicating missing values
- ``notnull()``: Opposite of ``isnull()``
- ``dropna()``: Return a filtered version of the data
- ``fillna()``: Return a copy of the data with missing values filled or imputed

We will conclude this section with a brief exploration and demonstration of these routines.

### Create dataframe with missing values

In [21]:
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
        'age': [42, np.nan, 36, 24, 73], 
        'sex': ['m', 'm', 'f', 'm', 'f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,m,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


### Drop missing observations

In [17]:
df_no_missing = df.dropna()
df_no_missing

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


### Drop rows where all cells in that row is NA

In [8]:
df_cleaned = df.dropna(how='all')
df_cleaned

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,m,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


### Create a new column full of missing values

In [9]:
df['location'] = np.nan
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,m,,,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


### Drop column if they only contain missing values

In [10]:
df.dropna(axis=1, how='all')

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,m,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


### Drop rows that contain less than five observations

In [12]:
df.dropna(thresh=2, axis = 1)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


### Fill in missing data with zeros

In [13]:
df.fillna(0)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,0.0
1,0,0,0.0,m,0.0,0.0,0.0
2,Tina,Ali,36.0,f,0.0,0.0,0.0
3,Jake,Milner,24.0,m,2.0,62.0,0.0
4,Amy,Cooze,73.0,f,3.0,70.0,0.0


### Fill in missing in preTestScore with the mean value of preTestScore

inplace=True means that the changes are saved to the df right away

In [15]:
# df
df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,m,3.0,,
2,Tina,Ali,36.0,f,3.0,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


### Fill in missing in postTestScore with each sex's mean value of postTestScore

In [27]:
print(df)
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
1        NaN       NaN   NaN   m           NaN           43.5
2       Tina       Ali  36.0   f           NaN           70.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0


Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,m,,43.5
2,Tina,Ali,36.0,f,,70.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


### Select some raws but ignore the missing data points

In [27]:
df[df['age'].notnull() & df['sex'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
2,Tina,Ali,36.0,f,3.0,70.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [31]:
df
# # back-fill
df.fillna(method='bfill')

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,Tina,Ali,36.0,m,2.0,43.5
2,Tina,Ali,36.0,f,2.0,70.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [37]:
df.fillna(method='ffill', axis=0)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,Jason,David,36.0,m,3.0,43.5
2,Tina,Ali,36.0,f,3.0,70.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


### Outlier Treatment

In [32]:
import pandas as pd
import numpy as np
import os
os.chdir('D:/Education/My Courses/Acadgild/Data Science Python/Batch/Week 5/Session 9/Datasets')
os.listdir()

['bb_data.csv',
 'dob.csv',
 'LoanPredictionTesting.csv',
 'LoanPredictionTraining.csv',
 'Sample - Superstore (1).xls',
 'sample_tsv.txt',
 'superstore_np.xls',
 'winequality-white.csv',
 'x.npy']

In [33]:
df = pd.read_csv('LoanPredictionTesting.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [37]:
# Check the distribution of loan amount
# IQR - for outlier detection
Q1 = df["LoanAmount"].quantile(0.25)
Q3 = df["LoanAmount"].quantile(0.75)
IQR = Q3 - Q1
print(Q3 + 1.5 * IQR)
Q1 - 1.5 * IQR

# print("IQR  ",IQR)
# print("Lower boundary  ",(Q1 - 1.5 * IQR))
# print("Upper Boundary  ",(Q3 +  1.5 * IQR))

#Check for Outliers
IQR_check =(df["LoanAmount"] < (Q1 - 1.5 * IQR)) | (df["LoanAmount"] > (Q3 + 1.5 * IQR))
print("Total outlier data points based on IQR Rule:",df.loc[IQR_check,:].shape[0])
df.loc[IQR_check,:]

## Flooring
df.loc[df["LoanAmount"] < (Q1 - 1.5 * IQR),"LoanAmount"] = df["LoanAmount"].quantile(0.05)

## Capping 
print(df["LoanAmount"].quantile(0.95))
df.loc[df["LoanAmount"] > (Q3 + 1.5 * IQR),"LoanAmount"] = df["LoanAmount"].quantile(0.95)


244.625
Total outlier data points based on IQR Rule: 18
239.74999999999994


In [38]:
df.iloc[8]

Loan_ID              LP001059
Gender                   Male
Married                   Yes
Dependents                  2
Education            Graduate
Self_Employed             NaN
ApplicantIncome         13633
CoapplicantIncome           0
LoanAmount             239.75
Loan_Amount_Term          240
Credit_History              1
Property_Area           Urban
Name: 8, dtype: object

# Hierarchical Indexing

Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas ``Series`` and ``DataFrame`` objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
While Pandas does provide ``Panel`` and ``Panel4D`` objects that natively handle three-dimensional and four-dimensional data. This is a far more common pattern in practice is to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional ``Series`` and two-dimensional ``DataFrame`` objects.

In this section, we'll traverse across the direct creation of ``MultiIndex`` objects, considerations when indexing, slicing, and computing statistics across multiple indexed data, and useful routines for conversions between simple and hierarchically indexed representations of your data.

We begin with the standard imports:

### Create dataframe

In [39]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore
0,Nighthawks,1st,Miller,4,25
1,Nighthawks,1st,Jacobson,24,94
2,Nighthawks,2nd,Ali,31,57
3,Nighthawks,2nd,Milner,2,62
4,Dragoons,1st,Cooze,3,70
5,Dragoons,1st,Jacon,4,25
6,Dragoons,2nd,Ryaner,24,94
7,Dragoons,2nd,Sone,31,57
8,Scouts,1st,Sloan,2,62
9,Scouts,1st,Piger,3,70


### Set the hierarchical index but leave the columns inplace

In [43]:
df.set_index(['regiment', 'company'], drop = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,preTestScore,postTestScore
regiment,company,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nighthawks,1st,Miller,4,25
Nighthawks,1st,Jacobson,24,94
Nighthawks,2nd,Ali,31,57
Nighthawks,2nd,Milner,2,62
Dragoons,1st,Cooze,3,70
Dragoons,1st,Jacon,4,25
Dragoons,2nd,Ryaner,24,94
Dragoons,2nd,Sone,31,57
Scouts,1st,Sloan,2,62
Scouts,1st,Piger,3,70


### Set the hierarchical index to be by regiment, and then by company

In [44]:
df.set_index(['regiment', 'company'], inplace = True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,name,preTestScore,postTestScore
regiment,company,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nighthawks,1st,Miller,4,25
Nighthawks,1st,Jacobson,24,94
Nighthawks,2nd,Ali,31,57
Nighthawks,2nd,Milner,2,62
Dragoons,1st,Cooze,3,70
Dragoons,1st,Jacon,4,25
Dragoons,2nd,Ryaner,24,94
Dragoons,2nd,Sone,31,57
Scouts,1st,Sloan,2,62
Scouts,1st,Piger,3,70


### View the index

In [25]:
df.index

MultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],
           labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],
           names=['regiment', 'company'])

### Swap the levels in the index

In [45]:
df.swaplevel('regiment', 'company')

Unnamed: 0_level_0,Unnamed: 1_level_0,name,preTestScore,postTestScore
company,regiment,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1st,Nighthawks,Miller,4,25
1st,Nighthawks,Jacobson,24,94
2nd,Nighthawks,Ali,31,57
2nd,Nighthawks,Milner,2,62
1st,Dragoons,Cooze,3,70
1st,Dragoons,Jacon,4,25
2nd,Dragoons,Ryaner,24,94
2nd,Dragoons,Sone,31,57
1st,Scouts,Sloan,2,62
1st,Scouts,Piger,3,70


### Summarize the results by regiment

In [46]:
df.sum(level='regiment')

Unnamed: 0_level_0,preTestScore,postTestScore
regiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Dragoons,62,246
Nighthawks,61,238
Scouts,10,264
