This notebook will describe what are **missing data** and correct them, then it deals with **qualitative** data. This different transformations will be necessary in order to build our predictive models. 

As usual, let's start by importing all the libraries needed !

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# First, open the data
name = "train"
houses = pd.read_csv('data/' + name + '.csv')

In [None]:
# Then, look at .describe()


Yes, that it is not super easy to read the number of missing data with **count**, so let's visualize how much missing data there is in this dataset and which features are affected.
You will need to call the functions:
- `isnull()` which transforms the dataset, each missing value will become 1 and all the other 0.
- and then `sum()` which will compute the sum of each columns, this way you will observe the number of missing values for each features

In [None]:
numNullCol = 
numNullCol

Then select only the non zero ones and anwser the questions
- How many features are affected by missing data ?
- How many missing data are there in total ?
- Which feature is the most affected ? Use the function `.sort_values()` in order to sort the given dataset

In [None]:
numNullCol = 
numNullCol

However a good visualization is better than words, use the function `.plot.bar()` in order to display numNullCol

# Missing values

Why do we even bother with missing data ?

A computer needs to have simple structure to deal with and if there is a missing data, it does not really know how to deal with it... Does it mean that the data is not available ? Does it mean that the variable is not applicable in this case ? Or simply that somebody has forgotten to type it in the file ?

Usually it requires the expertise of a specialist in order to deal with the missing data. In our case, it is pretty straightforward but if now you have to deal with missing data in a medical record, how do you replace missing ECG ? You will certainly need the help of a doctor !

So now, let's see how to clean this data. To do so, we have to see what are the existing values of these columns to have a better idea on how to complete them. We will explore three examples and you will have to do the same for the rest of them !

First let's look at `Alley`, the file data_description.md defines this variable as "Type of alley access".

In [None]:
houses.Alley.unique() 

This first command shows all the value observed in the dataset, `nan` indicates a missing data. And the two other values indicates the type of alley. So, in this case, we can assume that the house does not have an alley. So we will replace all the `nan` values with `Absent`.

In [None]:
houses.Alley = houses.Alley.fillna('Absent') 
# Replace all absent data by the given value (be careful to give the same type of data)
houses.Alley.unique() # To verify there is no more 'nan'

Perfect, one is corrected ! Let's look at the 18 others... That sounds quite long, isn't it ? So let's do a little refresher on the loop in python. You have certainly noticed some `for` loops in some previous functions.

In Python, a for loop allows you to iterate over a list.

    breakfast_list = ['bread', 'butter', 'jam']

    for item in breakfast_list:
        print item
        
For example, the code above prints out the following output:

    bread
    butter
    jam

So let's first look at the values present in the features GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond. Complete the following function to see the different values of each feature

In [None]:
for feature in ['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']:
    values = 
    print(feature, values)

So, it appears that GarageType, GarageFinish, GarageQual and GarageCond are qualitative variables concerning the garage. It seems natural to think that there is no garage in these houses when we enconter a nan value. So, let's replace `nan` with "Absent". 

Write a for loop that does this transformation.

In [None]:
for

We have not change the `nan` value of the variable GarageYrBlt because it is a numeric feature and it is not a good idea to replace it with a String value. However, what should we do ? This question is less easy than it appears. Replacing the value with 0 can impact the model if the model is to simple: a linear regression will take the year as a numerical value and try fit a line to the point. You can observe the change in the following plot :

In [None]:
plt.figure()
sns.regplot(houses.SalePrice, houses.GarageYrBlt, label = "Nan Ignored", scatter_kws={"alpha": 0.25})
sns.regplot(houses.SalePrice, houses.GarageYrBlt.fillna(0), label = "Nan replaced by zero", scatter_kws={"alpha": 0.25})
plt.legend()
plt.show()

So, what should we do ? Find pros and cons for each of the proposition and change the column with your choice
- Delete the column 
- Replace with a quantitative column
- Replace with 0 but use a model which can handle this error
- Compute the difference between YrSold and GarageYrBlt and replace the missing values with -1

In [None]:
houses.GarageYrBlt = 

Let's do a similar for loop for all the variables concerning the garage and correct them 

In [None]:
for 

Now, let's analyze a simplier quantitative value: LotFrontage which is "Linear feet of street connected to property", so it seems natural in this case to replace the missing values with 0.

In [None]:
houses.LotFrontage.unique()

In [None]:
houses.LotFrontage = houses.LotFrontage.fillna(0)

By now it should only remain 7 features. Do it in a similar way than previously 

In [None]:
assert houses.isna().sum().sum() == 0, "Verify the cleaning section, it seems there are still some errors"

Perfect ! You have cleaned the data now and you certainly understand why it takes a lot of time to datascientists to collect but also analyse and clean the data. So you certainly don't want to redo this work next time, so let's save it

In [None]:
houses.to_csv('data/' + name + '_cleaned.csv') # Save in a second file (it is important to always keep original)

# Qualitative data

As discuss in the first notebook, it is really important to deal with the qualitative data because some model will not be able to deal with them if there are not transformed.
First let's see which features are qualitative.

In [None]:
houses.select_dtypes(exclude='number').describe()

How many different variables are qualitative ? And how many unique values are there (if you sum over all the qualitative features) ?

What can we do, discuss the following possibilities 
- Ignore them 
- Replace the different values by a number (example: Imagine a feature "ColorWall" with values "Green", "Blue", "Yellow", you can replace all "Green" with 0, "Blue" with 1 and etc)
- Encode the features with several column (example: Replace the feature "ColorWall" with three different columns "ColorWallIsGreen", "ColorWallIsBlue" and "ColorWallIsYellow" and if the house has the value "Green" in "ColorWall" then it will have 1 in "ColorWallIsGreen" and 0 in the two others) 

We will use the dummy coding proposed by pandas : `pd.get_dummies(dataset)`

In [None]:
houses = 

In [None]:
houses.to_csv('data/' + name + '_cleaned_nocategorical.csv') # Save in a third file

In [None]:
assert len(houses.select_dtypes(exclude='number').columns) == len(houses.columns) , "There remain some categorical features"

There is a second file which is composed of the same features and that will be useful in the future. 
Replace in the first cell the variable `name` of the section **Dealing with missing data** with "kaggle" and rerun the notebook by clicking on Kernel > Restart & Run All and verify that every cell has run.