# Airbnb New User Bookings

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

Description:
"New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand. In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking."

## Set Up, Initial Exploration

Imports, set options.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# enables inline plots
%matplotlib inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 3)

**These are the files kaggle has provided us with:**

**train_users_2.csv** - the training set of users

**test_users.csv** - the test set of users

**sessions.csv** - web sessions log for users

**countries.csv** - summary statistics of destination countries in this dataset and their locations

**age_gender_bkts.csv** - summary statistics of users' age group, gender, country of destination

**sample_submission.csv** - correct format for submitting your predictions

## train_users_2.csv

**The train_users_2.csv file contains the following columns:**

id: user id

date_account_created: the date of account creation

timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up

date_first_booking: date of first booking

gender

age

signup_method

signup_flow: the page a user came to signup up from

language: international language preference

affiliate_channel: what kind of paid marketing

affiliate_provider: where the marketing is e.g. google, craigslist, other

first_affiliate_tracked: whats the first marketing the user interacted with before the signing up

signup_app

first_device_type

first_browser

country_destination: this is the target variable you are to predict

**Let's take a closer look at the columns.**

In [3]:
import zipfile
local_path = '/Users/eloiseheydenrych/Downloads'

z = zipfile.ZipFile(local_path + '/train_users_2.csv.zip')
df = pd.read_csv(z.open('train_users_2.csv'), parse_dates=[1,2])
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 213451 entries, 0 to 213450
Data columns (total 16 columns):
id                         213451 non-null object
date_account_created       213451 non-null datetime64[ns]
timestamp_first_active     213451 non-null datetime64[ns]
date_first_booking         88908 non-null object
gender                     213451 non-null object
age                        125461 non-null float64
signup_method              213451 non-null object
signup_flow                213451 non-null int64
language                   213451 non-null object
affiliate_channel          213451 non-null object
affiliate_provider         213451 non-null object
first_affiliate_tracked    207386 non-null object
signup_app                 213451 non-null object
first_device_type          213451 non-null object
first_browser              213451 non-null object
country_destination        213451 non-null object
dtypes: datetime64[ns](2), float64(1), int64(1), object(12)
memory usage: 

From the above, we can see that we have a lot of missing values in 'date_first_booking' (which makes sense because only some of the people with accounts will have made a booking). We're also missing data in 'age' and 'first_affiliate_tracked'.

Let's take a look at the first ten rows.

In [4]:
df.head(10)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,2009-03-19 04:32:55,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,2009-05-23 17:48:09,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,2009-06-09 23:12:47,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,2009-10-31 06:01:29,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,2009-12-08 06:11:05,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US
5,osr2jwljor,2010-01-01,2010-01-01 21:56:19,2010-01-02,-unknown-,,basic,0,en,other,other,omg,Web,Mac Desktop,Chrome,US
6,lsw9q7uk0j,2010-01-02,2010-01-02 01:25:58,2010-01-05,FEMALE,46.0,basic,0,en,other,craigslist,untracked,Web,Mac Desktop,Safari,US
7,0d01nltbrs,2010-01-03,2010-01-03 19:19:05,2010-01-13,FEMALE,47.0,basic,0,en,direct,direct,omg,Web,Mac Desktop,Safari,US
8,a1vcnhxeij,2010-01-04,2010-01-04 00:42:11,2010-07-29,FEMALE,50.0,basic,0,en,other,craigslist,untracked,Web,Mac Desktop,Safari,US
9,6uh8zyj2gn,2010-01-04,2010-01-04 02:37:58,2010-01-04,-unknown-,46.0,basic,0,en,other,craigslist,omg,Web,Mac Desktop,Firefox,US


## test_users.csv

In [5]:
z = zipfile.ZipFile(local_path + '/test_users.csv.zip')
df_testusers = pd.read_csv(z.open('test_users.csv'),parse_dates=[1,2])
df_testusers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62096 entries, 0 to 62095
Data columns (total 15 columns):
id                         62096 non-null object
date_account_created       62096 non-null datetime64[ns]
timestamp_first_active     62096 non-null datetime64[ns]
date_first_booking         0 non-null float64
gender                     62096 non-null object
age                        33220 non-null float64
signup_method              62096 non-null object
signup_flow                62096 non-null int64
language                   62096 non-null object
affiliate_channel          62096 non-null object
affiliate_provider         62096 non-null object
first_affiliate_tracked    62076 non-null object
signup_app                 62096 non-null object
first_device_type          62096 non-null object
first_browser              62096 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(10)
memory usage: 7.6+ MB


In [6]:
df_testusers.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
0,5uwns89zht,2014-07-01,2014-07-01 00:00:06,,FEMALE,35.0,facebook,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
1,jtl0dijy2j,2014-07-01,2014-07-01 00:00:51,,-unknown-,,basic,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
2,xx0ulgorjt,2014-07-01,2014-07-01 00:01:48,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome
3,6c6puo6ix0,2014-07-01,2014-07-01 00:02:15,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,IE
4,czqhjk3yfe,2014-07-01,2014-07-01 00:03:05,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Safari


## sessions.csv

**The sessions.csv file contains the following columns:**

user_id: to be joined with the column 'id' in users table

action

action_type

action_detail

device_type

secs_elapsed


Let's load the sessions.csv file and take a closer look at the columns.

In [8]:
z = zipfile.ZipFile(local_path + '/sessions.csv.zip')
df_sessions = pd.read_csv(z.open('sessions.csv'))
df_sessions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10567737 entries, 0 to 10567736
Data columns (total 6 columns):
user_id          object
action           object
action_type      object
action_detail    object
device_type      object
secs_elapsed     float64
dtypes: float64(1), object(5)
memory usage: 564.4+ MB


In [3]:
df_sessions.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
0,5uwns89zht,2014-07-01,2014-07-01 00:00:06,,FEMALE,35.0,facebook,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
1,jtl0dijy2j,2014-07-01,2014-07-01 00:00:51,,-unknown-,,basic,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
2,xx0ulgorjt,2014-07-01,2014-07-01 00:01:48,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome
3,6c6puo6ix0,2014-07-01,2014-07-01 00:02:15,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,IE
4,czqhjk3yfe,2014-07-01,2014-07-01 00:03:05,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Safari


In [4]:
# create new column with all ones
df['entered_age'] = 1 

# fill in zeros in entered_age where the age column was NaN
df.loc[np.isnan(df['age']), 'entered_age'] = 0

df.head(20)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,entered_age
0,5uwns89zht,2014-07-01,2014-07-01 00:00:06,,FEMALE,35.0,facebook,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari,1
1,jtl0dijy2j,2014-07-01,2014-07-01 00:00:51,,-unknown-,,basic,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari,0
2,xx0ulgorjt,2014-07-01,2014-07-01 00:01:48,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome,0
3,6c6puo6ix0,2014-07-01,2014-07-01 00:02:15,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,IE,0
4,czqhjk3yfe,2014-07-01,2014-07-01 00:03:05,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Safari,0
5,szx28ujmhf,2014-07-01,2014-07-01 00:03:36,,FEMALE,28.0,basic,0,en,sem-brand,google,omg,Web,Windows Desktop,Chrome,1
6,guenkfjcbq,2014-07-01,2014-07-01 00:05:14,,MALE,48.0,basic,25,en,direct,direct,untracked,iOS,iPhone,-unknown-,1
7,tkpq0mlugk,2014-07-01,2014-07-01 00:06:49,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,0
8,3xtgd5p9dn,2014-07-01,2014-07-01 00:08:37,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,0
9,md9aj22l5a,2014-07-01,2014-07-01 00:22:45,,-unknown-,,basic,0,en,sem-non-brand,google,omg,Web,Windows Desktop,Firefox,0


**Imputation, new columns in train_users_2 data**

Currently the dates are being stored in a format that is hard to work with. Let's start by creating some month columns.

In [5]:
#create month columns
df['Month_Account_Created'] = df['date_account_created'].dt.month
df['Month_First_Active'] = df['timestamp_first_active'].dt.month
# note -- other things could be tried here. Maybe people who book at 4am choose different
#destinations than those who book at 1pm?, etc.
del df['date_first_booking']