&nbsp;
&nbsp;

# Welcome to Feature Factory for Airbnb

Feature factory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with what IPython Notebook. IPython Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the IPython Notebook python kernel and can be accessed at any time, by any cell. More information can be found at http://ipython.org/notebook.html

# Creating your own IPython Notebook

To get started with Feature Factory, please clone the Template notebook. To do this, click "File"->"Make a Copy". This should spawn a new tab within your browser with the copied notebook. Rename the notebook to your liking and make all edits on that notebook.

&nbsp;
&nbsp;


# Airbnb Machine Learning Competition

## Problem Statement

Instead of waking to overlooked "Do not disturb" signs, Airbnb travelers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts.

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking. Kagglers who impress with their answer (and an explanation of how they got there) will be considered for an interview for the opportunity to join Airbnb's Data Science and Analytics team.

In this competition, you are challenged to identify and derive or generate the features which would help the most in predicting in which country a new user will make his or her first booking.

## Data

In this challenge, you are given a list of USA users along with their demographics, web session records and some summary statistics.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that 'NDF' is different from 'other' because 'other' means there was a booking, but is to a country not included in the list, while 'NDF' means there wasn't a booking.

The dataset is in a relational format, split among mutliple files. When using **commands.get_sample_dataset()** to retrieve the dataset, the files are provided as a list of *pandas.DataFrame* objects.

The step-by-step example below shows this in detail.

### Users Data


The users table contains generic information about each user, such as some registration details, as well as the destination of the first booking, which is the variable to predict.

| Data Fields | Definition |
|-------------|------------|
|id           |user id|
|date_account_created|the date of account creation|
|timestamp_first_active|timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up|
|date_first_booking|date of first booking|
|gender||
|age||
|signup_method||
|signup_flow|the page a user came to signup up from|
|language|international language preference|
|affiliate_channel|what kind of paid marketing|
|affiliate_provider|where the marketing is e.g. google, craigslist, other|
|first_affiliate_tracked|whats the first marketing the user interacted with before the signing up|
|signup_app||
|first_device_type||
|first_browser||
|country_destination|this is the target variable you are to predict|

### Sessions Data

The sessions data contains information about what actions the user performed on the website before making the first booking.

| Data Fields | Definition |
|-------------|------------|
|user_id      |to be joined with the column 'id' in users table|
|action       ||
|action_type  ||
|action_detail||
|device_type  ||
|secs_elapsed |||

### Countries Data

The countries table contains some information about the destination countries.

|         Data Fields         | Definition |
|-----------------------------|------------|
|country_destination          |Two letter country code|
|lat_destination              |country latitude|
|lng_destination              |country longitude|
|distance_km                  |distance from USA, in km|
|destination_km2              |country size in km2|
|destination_language         |country language|
|language_levenshtein_distance|country language levenshtein distance from USA english|

### Age Gender  Buckets

The Age Gender Buckets table contains some demographic information broadly explaining the age distribution by genders in the destination countries.

|         Data Fields         | Definition |
|-----------------------------|------------|
|age_bucket||
|country_destination|Two letter country code, as in countries table|
|gender||
|population_in_thousands||
|year|||


&nbsp;
&nbsp;

## Step-by-Step Example

Step 1: Import the feature factory infrastructure

In [1]:
from problems.airbnb import commands

&nbsp;
&nbsp;

Step 2: Create a username/password or login into an existing account. If you create an account and it is successful, you don't need to login - you are logged in automatically. 

In [2]:
commands.create_user('a_user', 'a_password')

user successfully created


In [3]:
commands.login('a_user', 'a_password')

user successfully logged in


&nbsp;
&nbsp;

Step 3: To ensure that this notebook is mapped to your username, it is required that you execute the command below. 

In [4]:
commands.add_notebook('a_notebook_name')

Notebook a_notebook_name successfully registered


&nbsp;
&nbsp;

Step 4: Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset as a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [5]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the users data
# dataset[1] <- this refers to the sessions data
# dataset[2] <- this refers to the country data
# dataset[3] <- this refers to the age_gender buckets data

In [6]:
dataset[0][:5]

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


In [7]:
dataset[1][:5]

Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,lookup,,,Windows Desktop,435.0


In [8]:
dataset[2][:5]

Unnamed: 0,country_destination,lat_destination,lng_destination,distance_km,destination_km2,destination_language,language_levenshtein_distance
0,AU,-26.853388,133.27516,15297.744,7741220.0,eng,0.0
1,CA,62.393303,-96.818146,2828.1333,9984670.0,eng,0.0
2,DE,51.165707,10.452764,7879.568,357022.0,deu,72.61
3,ES,39.896027,-2.487694,7730.724,505370.0,spa,92.25
4,FR,46.232193,2.209667,7682.945,643801.0,fra,92.06


In [9]:
dataset[3][:5]

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,male,1.0,2015.0
1,95-99,AU,male,9.0,2015.0
2,90-94,AU,male,47.0,2015.0
3,85-89,AU,male,118.0,2015.0
4,80-84,AU,male,199.0,2015.0


&nbsp;
&nbsp;

Step 5: Define your feature extraction function.

The name you give to the function is the name which will be used later on to register your feature extaction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of users, one row per user, and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, the extracted features should preserve the order of the user table.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [10]:
def example_feature(dataset):
    return dataset[0][['age']].fillna(0)

&nbsp;
&nbsp;

Step 6: Evaluate the score of your feature extraction function before submitting it.

You can make use of the cross_validate command as many times a needed in order to have a preview of what the score of your function will be.

In [11]:
commands.cross_validate(example_feature)

Obtaining dataset
Extracting features
Cross validating


0.57681210168311758

&nbsp;
&nbsp;

Step 7: Register your function in the system

Once you are satisfied with the results, you can call the add_feature command passing your function as an argument.
This will cross_validate the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [12]:
commands.add_feature(example_feature)

Obtaining dataset
Extracting features
Cross validating
Your feature example_feature scored 0.5768121016831176
Feature example_feature successfully registered


&nbsp;
&nbsp;

Step 8: (Optional) Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.

In [13]:
def imports():    # We need to import pandas within our functions
    global pd
    import pandas as pd

def one_hot(feature):
    """Perform one-hot-encoding to a feature column."""
    return pd.get_dummies(feature)
    
def example_feature(dataset):
    """Return a dataset containing only some of the features."""
    imports()
    
    age = dataset[0][['age']].fillna(0)
    signup = one_hot(dataset[0]['signup_method'])
    return pd.concat([age, signup], axis=1)


commands.add_feature(example_feature)

Obtaining dataset
Extracting features
Cross validating
Your feature example_feature scored 0.6277137756164709
Feature example_feature successfully registered
