## New User Recruitment - Identifying Potential Users to be Converted to Customers

In this project, we want to identify first time experience users who will be potential (quality) customers. Lead Velocity Rate is a SaaS metric defined to be how many users are you working on to converting to actual customers. 


This is important because, we want to convert these users to actual customers. By correctly identifying which new users have the potential to become actual customers, companies like AirBnB can run personalized promotion to convert these users to lower the risk of losing potential customers to competitors.

Terminologies:
* User activation is when a user have started using the product, in our case - makes a booking on AirBnB.
* Lead Velocity Rate is the number of user you are working on to be converted to actual customers compared to previous month.
* User is someone who have interacted with the site.
* Potential Customers are users who are likely to be converted to be customers.

Data - We will use airbnb recruiting-new-user-bookings [dataset](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data) from kaggle competition for this case study.

An important design of this project is to determine who are (quality) potential customers. We will define this with users who have made an account and had their first activity within 72 hours (3 days) of each other. All other users will be deemed as low potential. This classification problem will identity if a customer is a target.

* Design Reasoning: In order to book a reservation on AirBnB, you must have an account. To filter out the users have the most potential, we want to identify the users who have made the searches and also made an account on AirBnB.

#### Data
* train_users.csv - the training set of users
* test_users.csv - the test set of users
* id: user id
* date_account_created: the date of account creation
* timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking: date of first booking
* gender
* age
* signup_method
* signup_flow: the page a user came to signup up from
* language: international language preference
* affiliate_channel: what kind of paid marketing
* affiliate_provider: where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
* signup_app
* first_device_type
* first_browser
* country_destination: this is the target variable you are to predict

## EDA and Preprocessing

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
train_df = pd.read_csv('train_users_2.csv')
test_df = pd.read_csv('test_users.csv')

In [4]:
train_df.shape, test_df.shape

((213451, 16), (62096, 15))

In [5]:
train_df.columns

Index(['id', 'date_account_created', 'timestamp_first_active',
       'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
       'language', 'affiliate_channel', 'affiliate_provider',
       'first_affiliate_tracked', 'signup_app', 'first_device_type',
       'first_browser', 'country_destination'],
      dtype='object')

#### Remove the country_destination column from train_df and combine both train and test datasets

In [6]:
booking = pd.concat([train_df.drop('country_destination',1), test_df])
booking.shape

(275547, 15)

In [7]:
booking.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 275547 entries, 0 to 62095
Data columns (total 15 columns):
id                         275547 non-null object
date_account_created       275547 non-null object
timestamp_first_active     275547 non-null int64
date_first_booking         88908 non-null object
gender                     275547 non-null object
age                        158681 non-null float64
signup_method              275547 non-null object
signup_flow                275547 non-null int64
language                   275547 non-null object
affiliate_channel          275547 non-null object
affiliate_provider         275547 non-null object
first_affiliate_tracked    269462 non-null object
signup_app                 275547 non-null object
first_device_type          275547 non-null object
first_browser              275547 non-null object
dtypes: float64(1), int64(2), object(12)
memory usage: 33.6+ MB


#### We need to covnert the datetime columns to datetime objects

In [8]:
booking['date_account_created'] = pd.to_datetime(booking['date_account_created'].astype(str))
booking['timestamp_first_active'] = pd.to_datetime(booking['timestamp_first_active'].astype(str))
booking['date_first_booking'] = pd.to_datetime(booking['date_first_booking'].astype(str))

#### We want to create a column that calculates the time delta between account made and first activity

#### We want to create our target column which will denote 1 if the user have potential
* We define potential to be someone who has made an account and had his/her first activity within 24 hours of each other

In [123]:
booking['potential_customer'] = [1 if time.days == 0  else 0 for time in booking['time_delta']]

#### We want to filter the dataset into customers who have made an account and had a first activity within the same day

In [130]:
booking_filter = booking[booking['potential_customer'] == 1]
booking_filter.time_book_delta.isnull().sum()

186590

#### We will define customers who made a booking 3 days or more at risk of losing them to competitors

In [145]:
booking_at_risk = booking_filter.copy()
booking_at_risk['at_risk'] = [1 if time.days > 3 else 0 for time in booking_at_risk['time_book_delta']]

In [148]:
booking_at_risk.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,time_delta,time_book_delta,potential_customer,at_risk
5,osr2jwljor,2010-01-01,2010-01-01 21:56:19,2010-01-02,-unknown-,,basic,0,en,other,other,omg,Web,Mac Desktop,Chrome,21:56:19,1 days,1,0
6,lsw9q7uk0j,2010-01-02,2010-01-02 01:25:58,2010-01-05,FEMALE,46.0,basic,0,en,other,craigslist,untracked,Web,Mac Desktop,Safari,01:25:58,3 days,1,0
7,0d01nltbrs,2010-01-03,2010-01-03 19:19:05,2010-01-13,FEMALE,47.0,basic,0,en,direct,direct,omg,Web,Mac Desktop,Safari,19:19:05,10 days,1,1
8,a1vcnhxeij,2010-01-04,2010-01-04 00:42:11,2010-07-29,FEMALE,50.0,basic,0,en,other,craigslist,untracked,Web,Mac Desktop,Safari,00:42:11,206 days,1,1
9,6uh8zyj2gn,2010-01-04,2010-01-04 02:37:58,2010-01-04,-unknown-,46.0,basic,0,en,other,craigslist,omg,Web,Mac Desktop,Firefox,02:37:58,0 days,1,0


In [147]:
booking_at_risk.at_risk.value_counts()

0    233290
1     42090
Name: at_risk, dtype: int64

In [141]:
booking_filter.time_book_delta.iloc[3].days

206

We also want to fill in 0 for the customers who did not book a a reservation

In [None]:
booking_filter.fillna(0, inplace = True)

In [63]:
booking.date_first_booking.notnull().sum()

88908

In [70]:
booking.potential_customer.value_counts()

1    275376
0       171
Name: potential_customer, dtype: int64

* delta_account_created > 0 means the account was created before the first booking
* delta_first_active > 0 means the first activity was done before the first booking
* delta_account_active > 0 means the account was created before the first activity

* if delta_account_created or delta_first_active is NaN, they never booked a listing

We want to identify the customers who showed interest in booking a reservation.

In [11]:
days_account_created = [delta.days for delta in booking_delta['delta_account_created']]
booking_delta['delta_account_created'] = days_account_created

days_first_active = [delta.days for delta in booking_delta['delta_first_active']]
booking_delta['delta_first_active'] = days_first_active

days_account_active = [delta.days for delta in booking_delta['delta_account_active']]
booking_delta['delta_account_active'] = days_account_active

In [46]:
booking_interest = booking_delta.copy()
booking_interest['booked'] = [0 if pd.isnull(pd.NaT) else 1 for x in booking_delta['date_first_booking']]