## New User Rercruitment - Potential User Activiation Identification

In this project, we want to identify first time experience users who have the potential to make a booking. This is important because, we want to increase user activation by targeting these users who have the potential to be activated. User activation is when a user have started using the product, in our case - made a booking on AirBnB.

Data - We will use airbnb recruiting-new-user-bookings dataset from kaggle competition for this case study.

An important design of this project is to determine what constitutes users who have the potential to make a booking. We will define this with users who have made an account and had their first activity within 72 hours (3 days) of each other. All other users will be deemed as low potential.

#### Data
* train_users.csv - the training set of users
* test_users.csv - the test set of users
* id: user id
* date_account_created: the date of account creation
* timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking: date of first booking
* gender
* age
* signup_method
* signup_flow: the page a user came to signup up from
* language: international language preference
* affiliate_channel: what kind of paid marketing
* affiliate_provider: where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
* signup_app
* first_device_type
* first_browser
* country_destination: this is the target variable you are to predict

## EDA and Preprocessing

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
train_df = pd.read_csv('train_users_2.csv')
test_df = pd.read_csv('test_users.csv')

In [3]:
train_df.head(10)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US
5,osr2jwljor,2010-01-01,20100101215619,2010-01-02,-unknown-,,basic,0,en,other,other,omg,Web,Mac Desktop,Chrome,US
6,lsw9q7uk0j,2010-01-02,20100102012558,2010-01-05,FEMALE,46.0,basic,0,en,other,craigslist,untracked,Web,Mac Desktop,Safari,US
7,0d01nltbrs,2010-01-03,20100103191905,2010-01-13,FEMALE,47.0,basic,0,en,direct,direct,omg,Web,Mac Desktop,Safari,US
8,a1vcnhxeij,2010-01-04,20100104004211,2010-07-29,FEMALE,50.0,basic,0,en,other,craigslist,untracked,Web,Mac Desktop,Safari,US
9,6uh8zyj2gn,2010-01-04,20100104023758,2010-01-04,-unknown-,46.0,basic,0,en,other,craigslist,omg,Web,Mac Desktop,Firefox,US


In [4]:
train_df.shape, test_df.shape

((213451, 16), (62096, 15))

In [5]:
train_df.columns

Index(['id', 'date_account_created', 'timestamp_first_active',
       'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
       'language', 'affiliate_channel', 'affiliate_provider',
       'first_affiliate_tracked', 'signup_app', 'first_device_type',
       'first_browser', 'country_destination'],
      dtype='object')

Remove the country_destination column from train_df and combine both train and test datasets

In [6]:
booking = pd.concat([train_df.drop('country_destination',1), test_df])
booking.shape

(275547, 15)

In [7]:
booking.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 275547 entries, 0 to 62095
Data columns (total 15 columns):
id                         275547 non-null object
date_account_created       275547 non-null object
timestamp_first_active     275547 non-null int64
date_first_booking         88908 non-null object
gender                     275547 non-null object
age                        158681 non-null float64
signup_method              275547 non-null object
signup_flow                275547 non-null int64
language                   275547 non-null object
affiliate_channel          275547 non-null object
affiliate_provider         275547 non-null object
first_affiliate_tracked    269462 non-null object
signup_app                 275547 non-null object
first_device_type          275547 non-null object
first_browser              275547 non-null object
dtypes: float64(1), int64(2), object(12)
memory usage: 33.6+ MB


We need to covnert the datetime columns to datetime objects

In [8]:
booking['date_account_created'] = pd.to_datetime(booking['date_account_created'].astype(str))
booking['timestamp_first_active'] = pd.to_datetime(booking['timestamp_first_active'].astype(str))
booking['date_first_booking'] = pd.to_datetime(booking['date_first_booking'].astype(str))

In [9]:
booking.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
0,gxn3p5htnn,2010-06-28,2009-03-19 04:32:55,NaT,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome
1,820tgsjxq7,2011-05-25,2009-05-23 17:48:09,NaT,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome
2,4ft3gnwmtx,2010-09-28,2009-06-09 23:12:47,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE
3,bjjt8pjhuk,2011-12-05,2009-10-31 06:01:29,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox
4,87mebub9p4,2010-09-14,2009-12-08 06:11:05,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome


In [10]:
booking_delta = booking.copy()
booking_delta['delta_account_created'] = (booking_delta.date_first_booking - booking_delta.date_account_created)
booking_delta['delta_first_active'] = (booking_delta.date_first_booking - booking_delta.timestamp_first_active)
booking_delta['delta_account_active'] = (booking_delta.timestamp_first_active - booking_delta.date_account_created)

* delta_account_created > 0 means the account was created before the first booking
* delta_first_active > 0 means the first activity was done before the first booking
* delta_account_active > 0 means the account was created before the first activity

* if delta_account_created or delta_first_active is NaN, they never booked a listing

We want to identify the customers who showed interest in booking a reservation.

In [11]:
days_account_created = [delta.days for delta in booking_delta['delta_account_created']]
booking_delta['delta_account_created'] = days_account_created

days_first_active = [delta.days for delta in booking_delta['delta_first_active']]
booking_delta['delta_first_active'] = days_first_active

days_account_active = [delta.days for delta in booking_delta['delta_account_active']]
booking_delta['delta_account_active'] = days_account_active

In [46]:
booking_interest = booking_delta.copy()
booking_interest['booked'] = [0 if pd.isnull(pd.NaT) else 1 for x in booking_delta['date_first_booking']]