<img src='https://weclouddata.com/wp-content/uploads/2016/11/logo.png' width='30%'>
-------------

<h3 align='center'> Applied Machine Learning Course - Lab Week 6 </h3>
<h1 align='center'> Converting Categorical Features using DictVectorizer
 </h1>

<br>
<center align="left"> Developed by:</center>
<center align="left"> WeCloudData Academy </center>


# Background

In realistic datasets, it is common to see extra values for certain categorical features in test data which are not present in training data. `pandas.get_dummies()` function cannot directly handle these cases. However, `sklearn`'s `DictVectorizer` is the perfect tool to perform categorical feature extraction.

The idea is: 
1. As a transformer, when `fit` on training data, `DictVectorizer` memorizes the universe of all possible values for each categorical features together with the mapping from each value to a unique column number in transformed feature matrix. 
2. In test time, when seeing a value which is in the universe known to `DictVectorizer` at training time, the vectorizer will look up that `value -> column ID` mapping and find the corresponding column number for this values; **when seeing an unknown value, it will simply ignore it.**

All these bookkeeping is done painlessly by sklean. All you need to do is call `DictVectorizer.fit()` on your training data and `DictVectorizer.transform()` on your test data.

# Dataset

Read the competition overview at https://www.kaggle.com/c/home-credit-default-risk#description.
We are trying to predict whether each credit applicant is going to have payment difficulty or not.


In [8]:
import pandas as pd
import numpy as np
    

filename = 'data/application_train.csv'
train_df = pd.read_csv(filename)
train_df = train_df.sample(5000) # the original dataset is pretty huge, so we just randomly sample 5k out of it
train_df.shape

(5000, 122)

In [2]:
train_df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
43349,150183,0,Cash loans,M,Y,Y,0,94500.0,199080.0,6381.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
230313,366773,0,Cash loans,F,N,Y,0,108000.0,528633.0,20263.5,...,0,0,0,0,,,,,,
27966,132513,0,Cash loans,F,Y,Y,0,54000.0,385164.0,17095.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,3.0
67948,178796,0,Cash loans,M,N,N,0,72000.0,808650.0,26086.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
233411,370353,0,Cash loans,F,N,N,0,135000.0,490495.5,29335.5,...,0,0,0,0,0.0,0.0,0.0,4.0,0.0,0.0


## Sanity check the number of object columns

In [26]:
train_df.select_dtypes(include=['category','object']).dtypes

NAME_CONTRACT_TYPE            object
CODE_GENDER                   object
FLAG_OWN_CAR                  object
FLAG_OWN_REALTY               object
NAME_TYPE_SUITE               object
NAME_INCOME_TYPE              object
NAME_EDUCATION_TYPE           object
NAME_FAMILY_STATUS            object
NAME_HOUSING_TYPE             object
OCCUPATION_TYPE               object
WEEKDAY_APPR_PROCESS_START    object
ORGANIZATION_TYPE             object
FONDKAPREMONT_MODE            object
HOUSETYPE_MODE                object
WALLSMATERIAL_MODE            object
EMERGENCYSTATE_MODE           object
dtype: object

## Split dataset into X and y

In [45]:
y = train_df['TARGET'].values
X = train_df.drop(['TARGET'],axis=1)

In [47]:
X.shape

(5000, 121)

Next, we need to prepare a list with dict-like objects for the `DictVectorizer`, check [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html).

`[{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]`

First, let's see the usage of method: `.iterrows()`

In [56]:
df_test = pd.DataFrame([[1, 1.5],[3,5]], columns=['int', 'float'])
print(df_test)

for a, b in df_test.iterrows():
    print(dict(b))

   int  float
0    1    1.5
1    3    5.0
{'int': 1.0, 'float': 1.5}
{'int': 3.0, 'float': 5.0}


In [57]:
data = [dict(row) for _, row in X.iterrows()] #_ stands for index, which we don't care

## Split dataset into train and test


In [61]:
from sklearn.model_selection import train_test_split

data_train, data_val, y_train, y_val = train_test_split(data, y, test_size=0.2, stratify=y)

In [62]:
print(data_train[0])

{'SK_ID_CURR': 384265, 'NAME_CONTRACT_TYPE': 'Cash loans', 'CODE_GENDER': 'F', 'FLAG_OWN_CAR': 'N', 'FLAG_OWN_REALTY': 'N', 'CNT_CHILDREN': 1, 'AMT_INCOME_TOTAL': 135000.0, 'AMT_CREDIT': 675000.0, 'AMT_ANNUITY': 32602.5, 'AMT_GOODS_PRICE': 675000.0, 'NAME_TYPE_SUITE': 'Unaccompanied', 'NAME_INCOME_TYPE': 'Working', 'NAME_EDUCATION_TYPE': 'Secondary / secondary special', 'NAME_FAMILY_STATUS': 'Married', 'NAME_HOUSING_TYPE': 'House / apartment', 'REGION_POPULATION_RELATIVE': 0.009334, 'DAYS_BIRTH': -13440, 'DAYS_EMPLOYED': -6595, 'DAYS_REGISTRATION': -7535.0, 'DAYS_ID_PUBLISH': -1615, 'OWN_CAR_AGE': nan, 'FLAG_MOBIL': 1, 'FLAG_EMP_PHONE': 1, 'FLAG_WORK_PHONE': 1, 'FLAG_CONT_MOBILE': 1, 'FLAG_PHONE': 1, 'FLAG_EMAIL': 0, 'OCCUPATION_TYPE': 'High skill tech staff', 'CNT_FAM_MEMBERS': 3.0, 'REGION_RATING_CLIENT': 2, 'REGION_RATING_CLIENT_W_CITY': 2, 'WEEKDAY_APPR_PROCESS_START': 'WEDNESDAY', 'HOUR_APPR_PROCESS_START': 17, 'REG_REGION_NOT_LIVE_REGION': 0, 'REG_REGION_NOT_WORK_REGION': 0, 'LIV

## Convert categoricals to one-hot-encoding
When your data comes as a list of dictionaries, Scikit-Learn's DictVectorizer will do one-hot encoding for you.[Reference](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html)

In [63]:
from sklearn.feature_extraction import DictVectorizer
      
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(data_train) # fit the DictVectorizer on training data

In [68]:
features = vectorizer.get_feature_names()

In [69]:
print(f'Number of features after conversion: {len( features)}') # any feature name looking like `FEATURE=VAL` is a result of categorical feature conversion)}')

Number of features after conversion: 245


We then transform the categoricals in test dataset by using the vectorizer fitted from the training set.

In [70]:
test_data = [{'EMERGENCYSTATE_MODE' : 'UNK', 'AMT_ANNUITY': 1000, 'APARTMENTS_AVG': 250}] # `UNK` is an unseen value for `EMERGENCYSTATE_MODE`
X_test = vectorizer.transform(test_data)


In [80]:
X_test.shape

(1, 245)

In [78]:
for (i, feature) in enumerate(features):
    if X_test[0, i]:
        print(f'{feature}: {X_test[0, i]}')

AMT_ANNUITY: 1000.0
APARTMENTS_AVG: 250.0


> **when DictVectorizer() seeing an unknown value, it will simply ignore it.**

# Practice: 

Use Pycharm to finish the following:

1. Do `train_val_split` as above.
2. Do proper feature extraction by fitting `DictVectorizer` on your training data and use the fitted vectorizer to transform you validation data.
3. Fit a `BaggingClassifier` using `DecisionTree` as base_estimator on your training data:
   - Use all default values. 
   - Reference: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
4. Report the `oob_score` for each of your fitted base classifier and the performance on your **test data**.
5. (Optional) Compare the validation performance of your bagging model with that of a standard `DecisionTree` model. 

Note: the original dataset is quite large. You can randomly sample a good amount of rows from it for this task.