In [8]:
from utils import css_from_file
css_from_file('style/style.css')

Data preprocessing
----------------------

Data preprocessing is the most important step in the model preparation. It takes 90% of the time to prepare and clean the data so it can be processed by a predictive algorithm.

Here we have the data from Rossmann competition https://www.kaggle.com/c/rossmann-store-sales.

It is a good example of a dataset with many different types of data.

In [None]:
import pandas as pd 
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
training_data = pd.read_csv("data/rossmann/train.csv")
store_data = pd.read_csv("data/rossmann/store.csv")

There are information about the Sales (our target).

In [None]:
print "Training data shape", training_data.shape
training_data.head()

And the stores themselves

In [None]:
print "Store data shape", store_data.shape
store_data.head()

Let's join the data

In [None]:
combined_data = pd.merge(training_data, store_data, on="Store")

# sample the data
combined_data = combined_data.sample(frac=0.1).reset_index()

print "Combined data shape", combined_data.shape
combined_data.head()

Exercise
----------------------

1. Identify types of data present in the dataset:
    - what would you do with each type of data?
    - are there missing values?
2. Write transformer `PandasSelector` which can select subsets of columns from the dataset.
3. Write transformers for each type of data that convert selected columns to numerical values.
4. Combine it all into 1 pipeline using `make_pipeline` and `make_union` functions.

Hint: you will need those imports
```python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline, make_union
```

Explanation:
`BaseEstimator` and `TransformerMixin` are the classes from which you need to inherit in the your transformer class. They are needed for proper pipeline serialization (saving).

`DictVectorizer` is a transformer that can create a matrix from a dictionary of values - it is helpful to convert categorical variables. 

For example:
Let's say you have 2 columns which you want to convert to a matrix: `StoreType` and `Assortment`

In [None]:
# let's convert 2 columns to a list of dictionaries
data_as_dict = store_data.ix[:, ["StoreType","Assortment"]].T.to_dict().values()[:10]
data_as_dict

In [None]:
from sklearn.feature_extraction import DictVectorizer
categorical_transformer = DictVectorizer()
categorical_transformer.fit_transform(data_as_dict).todense() # by default DictVectorizer returns sparse matrix

Exercise template
-----------------

Your final process should like like this:
    
```python
from sklearn.preprocessing import Imputer

processing_pipeline = make_pipeline(
    # fill missing values
    Imputer(),
    
    # combine features
    make_union(
        make_pipeline(
            # select categorical data
            # do something with categorical data
        ),
        make_pipeline(
            # select date
            # do something with dates
            # first convert text to date using pd.to_datetime
            # use .dt attribute of pandas column
        ),
        make_pipeline(
            # select numerical data
            # do something with numerical data
        ),
        make_pipeline(
            # make some feature engineering
        )
    )
)
```

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import Imputer

# put your solution here

**Double click to see the solution**

<div class='spoiler'>

class PandasSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x.ix[:,self.columns]
    
    
class PandasToDict(BaseEstimator, TransformerMixin):

    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x.T.to_dict().values()

    
class ExtractDateAttributes(BaseEstimator, TransformerMixin):
    
    def __init__(self, date_format=None,
                 attributes=["year","month","day","weekday"]):
        self.date_format = date_format
        self.attributes = attributes
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        assert x.shape[1] == 1, "This transformer can handle 1 date"
        
        # convert data to date
        dt = pd.to_datetime(x.ix[:,0])
        
        # create an empty DataFrame
        df = pd.DataFrame()
        
        for attr in self.attributes:
            df[attr] = getattr(dt.dt, attr)
            
        return df
    

processing_pipeline = make_pipeline(
    # Select used variables
    PandasSelector(["Open", "Promo", "SchoolHoliday", 
                    "Date", "StoreType", "Assortment",
                    "CompetitionDistance", "CompetitionOpenSinceMonth",
                    "CompetitionOpenSinceYear", "Promo2",
                    "Promo2SinceWeek", "Promo2SinceYear"]),
    
    # combine features
    make_union(
        make_pipeline(
            PandasSelector(["Open", "Promo", "SchoolHoliday", "CompetitionDistance", 
                            "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", 
                            "Promo2", "Promo2SinceWeek", "Promo2SinceYear"]),
            Imputer(strategy='mean')
        ),
        make_pipeline(
            PandasSelector(["Date"]),
            ExtractDateAttributes()
        ),
        make_pipeline(
            PandasSelector(["StoreType", "Assortment"]),
            PandasToDict(),
            DictVectorizer(sparse=False)
        )
    )
)

</div>

To combine together your data processing pipeline and predictive algorithm you can chain them using `make_pipeline` function.

Exercise:

1. **Run and understand the code below**
   - explain each line <br/><br/>
   
2. **Why the CV is so low?**
   - the problem is to predict future prices is cross validation a good method to check if models learns well? <br/><br/>
   
3. **Use different splitting scheme and compare results** 

In [None]:
from cross_validation import cross_val_predict
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from metrics import rmspe

est = RandomForestRegressor(verbose=True, n_jobs=-1)

pred = cross_val_predict(make_pipeline(processing_pipeline, est), 
                         combined_data, 
                         np.log1p(combined_data.Sales),
                         cv=5)

print rmspe(combined_data.Sales, np.expm1(pred))