# Welcome to Feature Factory for Rossmann Store Sales 

FeatureFactory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with Jupyter Notebook. Jupyter Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the Jupyter Notebook python kernel and can be accessed at any time, by any cell. More information can be found [here](https://jupyter.readthedocs.io/en/latest/content-quickstart.html).

# Rossmann Store Sales Machine Learning Competition

## Problem Statement

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In this competition, you are challenged to identify and derive or generate the features which would help the most in making the prediction of 6 weeks of daily sales for the 1,115 stores located across Germany.

## Data

The dataset is in a relational format, split among mutliple files. When using `commands.get_sample_dataset()` to retrieve the dataset, the files are provided as a list of *pandas* `DataFrame` objects.

The following step-by-step example shows this in detail.

### Sales Data

The sales table contains information about the sales which each store made each day within the analyzed period along with some information about the store circumstances on that date.

Note that each (Store, Date) combination is unique within the table.

| Data Fields | Definition |
|-------------|------------|
|Store        |a unique Id for each store corresdponding to *store* field in the *Stores* table|
|DayOfWeek    |Integer indicating the day of the week|
|Date         |Date when the sales took place|
|Sales        |the turnover for any given day (this is what needs to be predicted)|
|Customers    |the number of customers on a given day|
|Open         |an indicator for whether the store was open: 0 = closed, 1 = open|
|Promo        |indicates whether a store is running a promo on that day: 0 = store is running a promo, 1 = store is not running any promo|
|StateHoliday |indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None|
|SchoolHoliday|indicates if the (Store, Date) was affected by the closure of public schools|


### Stores Data

The stores table contains general information about each store, including store type, competition information and regular promotions.

|       Data Fields       | Definition |
|-------------------------|------------|
|Store                    |a unique Id for each store|
|StoreType                |differentiates between 4 different store models: a, b, c, d|
|Assortment               |describes an assortment level: a = basic, b = extra, c = extended|
|CompetitionDistance      |distance in meters to the nearest competitor store|
|CompetitionOpenSinceMonth|gives the approximate month of the time the nearest competitor was opened|
|CompetitionOpenSinceYear |gives the approximate year of the time the nearest competitor was opened|
|Promo2                   |Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating|
|Promo2SinceWeek          |describes the calendar week when the store started participating in Promo2|
|Promo2SinceYear          |describes the year when the store started participating in Promo2|
|PromoInterval            |describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store|

## Begin

Execute this cell to initialize your FeatureFactory session.

In [None]:
from featurefactory.problems.rossman import commands

## Load sample data

Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset is a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the sales data
# dataset[1] <- this refers to the stores data

In [None]:
dataset[0][:5]

In [None]:
dataset[1][:5]

## Example: write and register a feature

The name you give to the function is the name which will be used later on to register your feature extraction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of users, one row per user, and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, the extracted features should preserve the order of the user table.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [None]:
def example_feature(dataset):
    return dataset[0][['Customers']]

&nbsp;
&nbsp;

Evaluate the score of your feature extraction function before submitting it.

You can make use of the `cross_validate` command as many times a needed in order to have a preview of what the score of your function will be.

In [None]:
commands.cross_validate(example_feature)

&nbsp;
&nbsp;

Register your function in the system

Once you are satisfied with the results, you can call the `register_feature` command passing your function as an argument.
This will `cross_validate` the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [None]:
commands.register_feature(example_feature)

&nbsp;
&nbsp;

Optional: Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.

In [None]:
def imports():    # We need to import pandas within our functions
    global pd
    import pandas as pd

def example_feature(dataset):
    """Return a dataset containing only some of the features."""
    imports()

    customers = dataset[0][['Customers']]
    promo_2 = dataset[0].merge(dataset[1], on='Store')['Promo2'].fillna(0)
    return pd.concat([customers, promo_2], axis=1)

commands.register_feature(example_feature)

## Write and register your features here