Data preprocessing
----------------------

Data preprocessing is the most important step in the model preparation. It takes 90% of the time to prepare and clean the data so it can be processed by a predictive algorithm.

Here we have the data from Rossmann competition https://www.kaggle.com/c/rossmann-store-sales.

It is a good example of a dataset with many different types of data.

In [53]:
import pandas as pd 

In [54]:
training_data = pd.read_csv("data/rossmann/train.csv")
store_data = pd.read_csv("data/rossmann/store.csv")

There are information about the Sales (our target).

In [61]:
print "Training data shape", training_data.shape
training_data.head()

Training data shape (1017209, 9)


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


And the stores themselves

In [62]:
print "Store data shape", store_data.shape
store_data.head()

Store data shape (1115, 10)


Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270,9,2008,0,,,
1,2,a,a,570,11,2007,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130,12,2006,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620,9,2009,0,,,
4,5,a,a,29910,4,2015,0,,,


Let's join the data

In [63]:
combined_data = pd.merge(training_data, store_data, on="Store")
print "Combined data shape", combined_data.shape
combined_data.head()

Combined data shape (1017209, 18)


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,5263,555,1,1,0,1,c,a,1270,9,2008,0,,,
1,1,4,2015-07-30,5020,546,1,1,0,1,c,a,1270,9,2008,0,,,
2,1,3,2015-07-29,4782,523,1,1,0,1,c,a,1270,9,2008,0,,,
3,1,2,2015-07-28,5011,560,1,1,0,1,c,a,1270,9,2008,0,,,
4,1,1,2015-07-27,6102,612,1,1,0,1,c,a,1270,9,2008,0,,,


Exercise
----------------------

1. Identify types of data present in the dataset:
    - what would you do with each type of data?
    - are there missing values?
2. Write transformer `PandasSelector` which can select subsets of columns from the dataset.
3. Write transformers for each type of data that convert selected columns to numerical values.
4. Combine it all into 1 pipeline using `make_pipeline` and `make_union` functions.

Hint: you will need those imports
```python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline, make_union
```

Explanation:
`BaseEstimator` and `TransformerMixin` are the classes from which you need to inherit in the your transformer class. They are needed for proper pipeline serialization (saving).

`DictVectorizer` is a transformer that can create a matrix from a dictionary of values - it is helpful to convert categorical variables. 

For example:
Let's say you have 2 columns which you want to convert to a matrix: `StoreType` and `Assortment`

In [37]:
# let's convert 2 columns to a list of dictionaries
data_as_dict = store_data.ix[:, ["StoreType","Assortment"]].T.to_dict().values()[:10]
data_as_dict

[{'Assortment': 'a', 'StoreType': 'c'},
 {'Assortment': 'a', 'StoreType': 'a'},
 {'Assortment': 'a', 'StoreType': 'a'},
 {'Assortment': 'c', 'StoreType': 'c'},
 {'Assortment': 'a', 'StoreType': 'a'},
 {'Assortment': 'a', 'StoreType': 'a'},
 {'Assortment': 'c', 'StoreType': 'a'},
 {'Assortment': 'a', 'StoreType': 'a'},
 {'Assortment': 'c', 'StoreType': 'a'},
 {'Assortment': 'a', 'StoreType': 'a'}]

In [38]:
from sklearn.feature_extraction import DictVectorizer
categorical_transformer = DictVectorizer()
categorical_transformer.fit_transform(data_as_dict).todense() # by default DictVectorizer returns sparse matrix

matrix([[ 1.,  0.,  0.,  1.],
        [ 1.,  0.,  1.,  0.],
        [ 1.,  0.,  1.,  0.],
        [ 0.,  1.,  0.,  1.],
        [ 1.,  0.,  1.,  0.],
        [ 1.,  0.,  1.,  0.],
        [ 0.,  1.,  1.,  0.],
        [ 1.,  0.,  1.,  0.],
        [ 0.,  1.,  1.,  0.],
        [ 1.,  0.,  1.,  0.]])

Exercise template
-----------------

Your final process should like like this:
    
```python
processing_pipeline = make_pipeline(
    # fill missing values
    MissingValuesFiller(),
    
    # combine features
    make_union(
        make_pipeline(
            # select categorical data
            # do something with categorical data
        ),
        make_pipeline(
            # select date
            # do something with dates
            # use .dt attribute of pandas column
        ),
        make_pipeline(
            # select numerical data
            # do something with numerical data
        ),
        make_pipeline(
            # make some feature engineering
        )
    )
)
```

In [52]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline, make_union

import arrow

class PandasSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x.ix[:,self.columns]
    
    
class PandasToDict(BaseEstimator, TransformerMixin):

    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x.T.to_dict().values()

    
class ExtractDateAttributes(BaseEstimator, TransformerMixin):
    
    def __init__(self, date_format="YYYY-MM-DD",
                 attributes=["timestamp","year","month","day"]):
        self.date_format = date_format
        self.attributes = attributes
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        assert x.shape[1] == 1, "This transformer can handle 1 date"
        
        # convert data to date
        dt = x.ix[:,0].map(lambda t: arrow.get(t))
        
        # create an empty DataFrame
        df = pd.DataFrame()
        
        for attr in self.attributes:
            df[attr] = dt.map(lambda v: getattr(v,attr))
            
        return df
    
"""
selector = PandasSelector(["Date"])
dt = selector.fit_transform(combined_data)

date_attributes_extractor = ExtractDateAttributes()
date_attributes_extractor.fit_transform(dt.ix[:100,:])
"""
print


