# Welcome to Feature Factory for Walmart 

FeatureFactory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with Jupyter Notebook. Jupyter Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the Jupyter Notebook python kernel and can be accessed at any time, by any cell. More information can be found [here](https://jupyter.readthedocs.io/en/latest/content-quickstart.html).

# Walmart Machine Learning Competition

## Problem Statement

In this competition, you are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact.

## Data

The dataset is in a relational format, split among mutliple files. When using `commands.get_sample_dataset()` to retrieve the dataset, the files are provided as a list of *pandas* `DataFrame` objects.

The step-by-step example below shows this in detail.

### Sales Data

The sales table contains the details about the weekly sales which each store department made within the studied period of time. This table also includes information about whether a particular date was a holiday or not.

| Data Fields | Definition |
|-------------|------------|
|Store        |the store number|
|Dept         |the department number|
|Date         |the week|
|Weekly_Sales |sales for the given department in the given store. This is what needs to be predicted.|
|IsHoliday    |whether the week is a special holiday week|

### Stores Data

The stores data contains anonymized information about each store, indicating its type and its size.

| Data Fields | Definition |
|-------------|------------|
|Store        |the store number|
|Type         |Type of store|
|Size         |Size of the store|

### Features Data

This table contains additional data related to the store, department, and regional activity for the given dates.

|         Data Fields         | Definition |
|-----------------------------|------------|
|Store       |the store number|
|Date        |the week|
|Temperature |average temperature in the region|
|Fuel_Price  |cost of fuel in the region|
|MarkDown1   |anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.|
|CPI         |the consumer price index|
|Unemployment|the unemployment rate|
|IsHoliday   |whether the week is a special holiday week|

## Begin

Execute this cell to initialize your FeatureFactory session.

In [None]:
from problems.walmart import commands

## Load sample data

Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset is a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the sales data
# dataset[1] <- this refers to the stores data
# dataset[2] <- this refers to the features data

In [None]:
dataset[0][:5]

In [None]:
dataset[1][:5]

In [None]:
dataset[2][:5]

## Example: write and register a feature

The name you give to the function is the name which will be used later on to register your feature extraction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of users, one row per user, and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, the extracted features should preserve the order of the user table.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [None]:
def example_feature(dataset):
    return dataset[0][['Dept', 'IsHoliday']].fillna(0)

&nbsp;
&nbsp;

Evaluate the score of your feature extraction function before submitting it.

You can make use of the `cross_validate` command as many times a needed in order to have a preview of what the score of your function will be.

In [None]:
commands.cross_validate(example_feature)

&nbsp;
&nbsp;

Register your function in the system

Once you are satisfied with the results, you can call the `register_feature` command passing your function as an argument.
This will `cross_validate` the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [None]:
commands.register_feature(example_feature)

&nbsp;
&nbsp;

Optional: Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.

In [None]:
def imports():    # We need to import pandas within our functions
    global pd
    import pandas as pd

def one_hot(feature):
    """Perform one-hot-encoding to a feature column."""
    return pd.get_dummies(feature)
    
def example_feature(dataset):
    """Return a dataset containing only some of the features."""
    imports()
    
    dept_holiday = dataset[0][['Dept', 'IsHoliday']].fillna(0)
    store_type = one_hot(dataset[0].merge(dataset[1], on='Store')['Type'])
    return pd.concat([dept_holiday, store_type], axis=1)


commands.register_feature(example_feature)

## Write and register your features here