## Import required libraries
We need to import pandas, numpy and sklearn libraries. From sklearn, we need to import preprocessing modules like Imputer. The Imputer package helps to impute the missing values.


In [2]:
# Required Python Machine learning Packages
import pandas as pd
import numpy as np
# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
# To model the Logistic Regression classifier
from sklearn.linear_model import LogisticRegression
# To model the Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier
# To model the Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
# To model the Linear SVC classifier
from sklearn.svm import LinearSVC
# To model the K Nearest Neighbors classifier
from sklearn.neighbors import KNeighborsClassifier
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score
# To generate classification report
from sklearn.metrics import classification_report

#To display graph
import matplotlib.pyplot as plt
import seaborn as sns
# Draw inline
%matplotlib inline

## Data Import

For importing the data, we are using pandas read_csv() method. This method is a very simple and fast method for importing data.

We are passing three parameters. The ‘train_users_2.csv’ parameter is the file name. The delimiter parameter is for giving the information the delimiter that is separating the data. Here, we are using ‘ *, *’ delimiter. This delimiter is to show delete the spaces before and after the data values. This is very helpful when there is inconsistency in spaces used with data values.

In [3]:
user_df = pd.read_csv('train_users_2.csv', delimiter=' *, *', engine='python')
user_df.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


## Handling Missing Data

We need to check whether there is any empty value in our dataset or not. We can do this using isnull() method.

In [4]:
user_df.isnull().sum()

id                              0
date_account_created            0
timestamp_first_active          0
date_first_booking         124543
gender                          0
age                         87990
signup_method                   0
signup_flow                     0
language                        0
affiliate_channel               0
affiliate_provider              0
first_affiliate_tracked      6065
signup_app                      0
first_device_type               0
first_browser                   0
country_destination             0
dtype: int64

The above output shows that there is “null” or NaN values in 'date_first_booking, 'age' and 'first_affiliate_tracked'.

Also, from scanning through our dataset we could see that many columns in our dataset contain "-unknown-"

Let’s try to test whether any categorical attribute contains a “-unknown-” in it or not. At times there exists “-unknown-” in place of missing values. Using the below code snippet we are going to test whether our user_df data frame consists of categorical variables with values as “-unknown-”.

In [5]:
for value in ['gender', 'signup_method',
          'language', 'affiliate_channel',
          'affiliate_provider','first_affiliate_tracked', 'signup_app',
          'first_device_type', 'first_browser']:
    print(value,":", sum(user_df[value] == '-unknown-'))

gender : 95688
signup_method : 0
language : 0
affiliate_channel : 0
affiliate_provider : 0
first_affiliate_tracked : 0
signup_app : 0
first_device_type : 0
first_browser : 27266


The output of the above code snippet shows that there are "-unknown-" in gender and first_browser attribute. We will need to transform those values into NaN by below code.

In [6]:
user_df.gender.replace('-unknown-', np.NaN, inplace=True)
user_df.first_browser.replace('-unknown-', np.NaN, inplace=True)

Now, we can check percentage of missing data. 

In [7]:
user_nan = (user_df.isnull().sum() / user_df.shape[0]) * 100
user_nan[user_nan > 0]

date_first_booking         58.347349
gender                     44.829024
age                        41.222576
first_affiliate_tracked     2.841402
first_browser              12.773892
dtype: float64


## Data preprocessing

For preprocessing, we are going to make a duplicate copy of our original dataframe.We are duplicating user_df to user_df_rev dataframe.

In [8]:
user_df_rev = user_df

We can remove 'id' column since we do not interest in it. We also remove 'date_first_booking' since in test dataset 'date_first_booking' is empty.

In [9]:
# Remove unused column
user_df_rev.drop('id', axis=1, inplace=True)
user_df_rev.drop('date_first_booking', axis=1, inplace=True)

We can format date columns to make it more readable.

In [10]:
user_df_rev['date_account_created'] = pd.to_datetime(user_df['date_account_created'], format='%Y-%m-%d')
user_df_rev['timestamp_first_active'] = pd.to_datetime(user_df['timestamp_first_active'], format='%Y%m%d%H%M%S')
user_df_rev['date_account_created'].fillna(user_df.timestamp_first_active, inplace=True)
user_df_rev.head()

Unnamed: 0,date_account_created,timestamp_first_active,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,2010-06-28,2009-03-19 04:32:55,,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,2011-05-25,2009-05-23 17:48:09,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,2010-09-28,2009-06-09 23:12:47,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,2011-12-05,2009-10-31 06:01:29,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,2010-09-14,2009-12-08 06:11:05,,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


We can use describe() method. It can be used to generate various summary statistics, excluding NaN values. We are passing an “include” parameter with value as “all”, this is used to specify that. we want summary statistics of all the attributes.

In [11]:
user_df_rev.describe(include= 'all')

Unnamed: 0,date_account_created,timestamp_first_active,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
count,213451,213451,117763,125461.0,213451,213451.0,213451,213451,213451,207386,213451,213451,186185,213451
unique,1634,213451,3,,3,,25,8,18,7,4,9,51,12
top,2014-05-13 00:00:00,2013-07-01 05:26:34,FEMALE,,basic,,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
freq,674,1,63041,,152897,,206314,137727,137426,109232,182717,89600,63845,124543
first,2010-01-01 00:00:00,2009-03-19 04:32:55,,,,,,,,,,,,
last,2014-06-30 00:00:00,2014-06-30 23:58:24,,,,,,,,,,,,
mean,,,,49.668335,,3.267387,,,,,,,,
std,,,,155.666612,,7.637707,,,,,,,,
min,,,,1.0,,0.0,,,,,,,,
25%,,,,28.0,,0.0,,,,,,,,


The feature with a high rate of NaN was age. Let's see:

In [12]:
user_df_rev.age.describe()

count    125461.000000
mean         49.668335
std         155.666612
min           1.000000
25%          28.000000
50%          34.000000
75%          43.000000
max        2014.000000
Name: age, dtype: float64

The maximum age in dataset was 2014!! The data must contain outliers or wrong input age. Since, the longest confirmed human lifespan record is 122 and according to the Aribnb Eligibility Terms, user must older than 18 years old. Let's see how many of these users in the dataset.  

In [13]:
print(sum(user_df_rev.age > 122)/len(user_df_rev))
print(sum(user_df_rev.age < 18)/len(user_df_rev))

0.0036589193772809687
0.0007402167242130512


We will replace these ages by NaN. We assume that maximum age for travel is 90 and minimum is 15. 

In [14]:
user_df_rev.loc[user_df_rev.age > 90, 'age'] = np.NaN
user_df_rev.loc[user_df_rev.age < 15, 'age'] = np.NaN
user_df_rev['age'].fillna(-1, inplace=True)

## One-Hot Encoder
Replace all categorical data (data that is in categories, not numbers) with one-hot encoding

In [15]:
le = preprocessing.LabelEncoder()
gender_cat = le.fit_transform(user_df.gender.astype(str))
signup_method_cat = le.fit_transform(user_df.signup_method)
language_cat   = le.fit_transform(user_df.language)
affiliate_channel_cat = le.fit_transform(user_df.affiliate_channel)
affiliate_provider_cat = le.fit_transform(user_df.affiliate_provider)
first_affiliate_tracked_cat = le.fit_transform(user_df.first_affiliate_tracked.astype(str))
signup_app_cat = le.fit_transform(user_df.signup_app)
first_device_type_cat = le.fit_transform(user_df.first_device_type)
first_browser_cat = le.fit_transform(user_df.first_browser.astype(str))

In [16]:
#initialize the encoded categorical columns
user_df_rev['gender_cat'] = gender_cat
user_df_rev['signup_method_cat'] = signup_method_cat
user_df_rev['language_cat'] = language_cat
user_df_rev['affiliate_channel_cat'] = affiliate_channel_cat
user_df_rev['affiliate_provider_cat'] = affiliate_provider_cat
user_df_rev['first_affiliate_tracked_cat'] = first_affiliate_tracked_cat
user_df_rev['signup_app_cat'] = signup_app_cat
user_df_rev['first_device_type_cat'] = first_device_type_cat
user_df_rev['first_browser_cat'] = first_browser_cat

In [17]:
#drop the old categorical columns from dataframe
dummy_fields = ['gender', 'signup_method', 'language', 
                  'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
                  'signup_app', 'first_device_type', 'first_browser']
user_df_rev = user_df_rev.drop(dummy_fields, axis = 1)

In [18]:
user_df_rev = user_df_rev.reindex(['date_account_created', 'timestamp_first_active',
                                   'gender_cat', 'age', 'signup_method_cat', 'language_cat',
                                   'affiliate_channel_cat', 'affiliate_provider_cat', 'first_affiliate_tracked_cat',
                                   'signup_app_cat', 'first_device_type_cat', 'first_browser_cat', 
                                   'country_destination'], axis= 1)
 
user_df_rev.head()

Unnamed: 0,date_account_created,timestamp_first_active,gender_cat,age,signup_method_cat,language_cat,affiliate_channel_cat,affiliate_provider_cat,first_affiliate_tracked_cat,signup_app_cat,first_device_type_cat,first_browser_cat,country_destination
0,2010-06-28,2009-03-19 04:32:55,3,-1.0,1,5,2,4,7,2,3,7,NDF
1,2011-05-25,2009-05-23 17:48:09,1,38.0,1,5,7,8,7,2,3,7,NDF
2,2010-09-28,2009-06-09 23:12:47,0,56.0,0,5,2,4,7,2,6,20,US
3,2011-12-05,2009-10-31 06:01:29,0,42.0,1,5,2,4,7,2,3,16,other
4,2010-09-14,2009-12-08 06:11:05,3,41.0,0,5,2,4,7,2,3,7,US


In [19]:
'''
# Add new date related fields
print("Adding new fields...")
user_df_rev['day_account_created'] = user_df_rev['date_account_created'].dt.weekday
user_df_rev['month_account_created'] = user_df_rev['date_account_created'].dt.month
user_df_rev['quarter_account_created'] = user_df_rev['date_account_created'].dt.quarter
user_df_rev['year_account_created'] = user_df_rev['date_account_created'].dt.year
user_df_rev['hour_first_active'] = user_df_rev['timestamp_first_active'].dt.hour
user_df_rev['day_first_active'] = user_df_rev['timestamp_first_active'].dt.weekday
user_df_rev['month_first_active'] = user_df_rev['timestamp_first_active'].dt.month
user_df_rev['quarter_first_active'] = user_df_rev['timestamp_first_active'].dt.quarter
user_df_rev['year_first_active'] = user_df_rev['timestamp_first_active'].dt.year
user_df_rev['created_less_active'] = (user_df_rev['date_account_created'] - user_df_rev['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking']
for column in columns_to_drop:
    if column in user_df_rev.columns:
        user_df_rev.drop(column, axis=1, inplace=True)

        
        
user_df_rev = user_df_rev.reindex(['day_account_created', 'month_account_created',
                                   'quarter_account_created', 'year_account_created',
                                   'hour_first_active', 'day_first_active',
                                   'month_first_active', 'quarter_first_active',
                                   'year_first_active', 'created_less_active',
                                   'gender_cat', 'age', 'signup_method_cat', 'language_cat',
                                   'affiliate_channel_cat', 'affiliate_provider_cat', 'first_affiliate_tracked_cat',
                                   'signup_app_cat', 'first_device_type_cat', 'first_browser_cat', 
                                   'country_destination'], axis= 1)
 
user_df_rev.head()             
'''

'\n# Add new date related fields\nprint("Adding new fields...")\nuser_df_rev[\'day_account_created\'] = user_df_rev[\'date_account_created\'].dt.weekday\nuser_df_rev[\'month_account_created\'] = user_df_rev[\'date_account_created\'].dt.month\nuser_df_rev[\'quarter_account_created\'] = user_df_rev[\'date_account_created\'].dt.quarter\nuser_df_rev[\'year_account_created\'] = user_df_rev[\'date_account_created\'].dt.year\nuser_df_rev[\'hour_first_active\'] = user_df_rev[\'timestamp_first_active\'].dt.hour\nuser_df_rev[\'day_first_active\'] = user_df_rev[\'timestamp_first_active\'].dt.weekday\nuser_df_rev[\'month_first_active\'] = user_df_rev[\'timestamp_first_active\'].dt.month\nuser_df_rev[\'quarter_first_active\'] = user_df_rev[\'timestamp_first_active\'].dt.quarter\nuser_df_rev[\'year_first_active\'] = user_df_rev[\'timestamp_first_active\'].dt.year\nuser_df_rev[\'created_less_active\'] = (user_df_rev[\'date_account_created\'] - user_df_rev[\'timestamp_first_active\']).dt.days\n\n# Dro

In [20]:
'''
feature_cols = user_df_rev.columns.tolist()
# remove income from feature list
feature_cols.remove('country_destination')
 
scaled_features = {}
for each in feature_cols:
    mean, std = user_df_rev[each].mean(), user_df_rev[each].std()
    scaled_features[each] = [mean, std]
    user_df_rev.loc[:, each] = (user_df_rev[each] - mean)/std

user_df_rev.head()
'''

"\nfeature_cols = user_df_rev.columns.tolist()\n# remove income from feature list\nfeature_cols.remove('country_destination')\n \nscaled_features = {}\nfor each in feature_cols:\n    mean, std = user_df_rev[each].mean(), user_df_rev[each].std()\n    scaled_features[each] = [mean, std]\n    user_df_rev.loc[:, each] = (user_df_rev[each] - mean)/std\n\nuser_df_rev.head()\n"

## Data Slicing

Let’s split the data into training and test set. We can easily perform this step using sklearn’s train_test_split() method.

In [21]:
feature_cols = user_df_rev.columns.tolist()
# remove income from feature list
feature_cols.remove('country_destination')
feature_cols.remove('date_account_created')
feature_cols.remove('timestamp_first_active')
X = user_df_rev[feature_cols]  
y = user_df_rev['country_destination']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

Using above code snippet, we have divided the data into features(X) and target(y) set.
The X_train & y_train consists of training data and the X_test & y_test consists of testing data.


## Gaussian Naive Bayes Implementation

After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module.

In [22]:
my_gnb = GaussianNB()
my_gnb.fit(X_train, y_train)
y_predict_gnb = my_gnb.predict(X_test)

We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

## Accuracy of our Gaussian Naive Bayes model

It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

In [23]:
print('accuracy of Gaussian Naive Bayes is', accuracy_score(y_test, y_predict_gnb))

accuracy of Gaussian Naive Bayes is 0.555854864023


## Random Forest Implementation
We are going to use sklearn’s RandomForest module.

In [24]:
my_rf = RandomForestClassifier(n_jobs=2, random_state=0)
my_rf.fit(X_train, y_train)
y_predict_rf = my_rf.predict(X_test)

We have built a RandomForest classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

## Accuracy of our Random Forest model

It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

In [25]:
print('accuracy of Random Forest is', accuracy_score(y_test, y_predict_rf))

accuracy of Random Forest is 0.600313883488


## Other classifiers
We can try using other classifiers from scikit too. We will use Logistic Regression, Decision Tree, Linear SVC and K Nearest Neighbors respectively. 

In [26]:
# Using logistic Regression classifier
my_logreg = LogisticRegression()
my_logreg.fit(X_train, y_train)
y_predict_logreg = my_logreg.predict(X_test)
print('accuracy of Logistic Regression is', accuracy_score(y_test, y_predict_logreg))

# Using Decision Tree classifier
my_dt = DecisionTreeClassifier()
my_dt.fit(X_train, y_train)
y_predict_dt = my_dt.predict(X_test)
print('accuracy of Decision Tree is', accuracy_score(y_test, y_predict_dt))
'''
# Using K Nearest Neighbors classifier (k = 3)
my_knn = KNeighborsClassifier(n_neighbors=3)
my_knn.fit(X_train, y_train)
y_predict_knn = my_knn.predict(X_test)
print('accuracy of K Nearest Neighbors is', accuracy_score(y_test, y_predict_knn))
'''

accuracy of Logistic Regression is 0.600407580052
accuracy of Decision Tree is 0.588531540606


"\n# Using K Nearest Neighbors classifier (k = 3)\nmy_knn = KNeighborsClassifier(n_neighbors=3)\nmy_knn.fit(X_train, y_train)\ny_predict_knn = my_knn.predict(X_test)\nprint('accuracy of K Nearest Neighbors is', accuracy_score(y_test, y_predict_knn))\n"

## Add more data
So far the highest accuracy is about 60%, which seems not good enough. We need more data to create better model.

Let's see how sessions data looks like.

In [27]:
sessions_df = pd.read_csv('sessions.csv', delimiter=' *, *', engine='python')
sessions_df.head()

Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,lookup,,,Windows Desktop,435.0


So, sessions data contains activity of each user. We can check how many users were logged by Airbnb by merge user data and sessions data then compared with number of users before merge.

In [28]:
user_df = pd.read_csv('train_users_2.csv', delimiter=' *, *', engine='python')
sessions_df.rename(columns = {'user_id':'id'}, inplace = True)
merge_df = pd.merge(user_df, sessions_df, on=['id'], how='inner')
len(merge_df.id.unique())/len(user_df)

0.34581707277079987

Airbnb provided only 34% of users in train data. Probably too less to be used.

Let's see what's in age_gender_bkt.csv

In [29]:
age_gender_df = pd.read_csv('age_gender_bkts.csv', delimiter=' *, *', engine='python')
age_gender_df.head()

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,male,1.0,2015.0
1,95-99,AU,male,9.0,2015.0
2,90-94,AU,male,47.0,2015.0
3,85-89,AU,male,118.0,2015.0
4,80-84,AU,male,199.0,2015.0


In [30]:
age_gender_df.drop('year', axis=1, inplace=True)

So this dataset provide population statistic of people in destination country. This could helpful provided information about people in each destination country.
Let's try add this information to our dataset.
First we create new dataframe of each country. We could write our own function for this purpose.

In [31]:
def convert_df(country):
    country_df = age_gender_df[age_gender_df.country_destination==country]
    country_df = country_df.groupby(country_df.age_bucket).sum()
    pop_sum = country_df.population_in_thousands.sum()
    country_df.population_in_thousands/=pop_sum
    country_df.rename(columns={'population_in_thousands':country+'_pop'}, inplace=True)
    return country_df

country_list = age_gender_df.country_destination.unique()
country_df_dict = {}
for country in country_list:
    country_df_dict[country] = convert_df(country)
au_df = country_df_dict['AU']
au_df

Unnamed: 0_level_0,AU_pop
age_bucket,Unnamed: 1_level_1
0-4,0.06709
10-14,0.060611
100+,0.000209
15-19,0.06291
20-24,0.067174
25-29,0.072984
30-34,0.072984
35-39,0.066798
40-44,0.069306
45-49,0.065669


In [32]:
age_buckets = age_gender_df.age_bucket.unique()
age_buckets

array(['100+', '95-99', '90-94', '85-89', '80-84', '75-79', '70-74',
       '65-69', '60-64', '55-59', '50-54', '45-49', '40-44', '35-39',
       '30-34', '25-29', '20-24', '15-19', '10-14', '5-9', '0-4'], dtype=object)

To merge table we need extra column for joining. We could write our own function for this.

In [33]:
def get_age_range(age_value):

    for age_range in age_buckets:
        if age_range.find('+')!=-1:
            if age_value>=100:
                return age_range
        else:
            a,b = age_range.split('-')
            if age_value>=int(a) and age_value<=int(b):
                return age_range

user_df_rev['age_bucket'] = user_df_rev['age'].apply(get_age_range)
user_df_rev.head()

Unnamed: 0,date_account_created,timestamp_first_active,gender_cat,age,signup_method_cat,language_cat,affiliate_channel_cat,affiliate_provider_cat,first_affiliate_tracked_cat,signup_app_cat,first_device_type_cat,first_browser_cat,country_destination,age_bucket
0,2010-06-28,2009-03-19 04:32:55,3,-1.0,1,5,2,4,7,2,3,7,NDF,
1,2011-05-25,2009-05-23 17:48:09,1,38.0,1,5,7,8,7,2,3,7,NDF,35-39
2,2010-09-28,2009-06-09 23:12:47,0,56.0,0,5,2,4,7,2,6,20,US,55-59
3,2011-12-05,2009-10-31 06:01:29,0,42.0,1,5,2,4,7,2,3,16,other,40-44
4,2010-09-14,2009-12-08 06:11:05,3,41.0,0,5,2,4,7,2,3,7,US,40-44


Now we can join all the table.

In [34]:
for country, country_df in country_df_dict.items():
    user_df_rev = user_df_rev.join(country_df, on='age_bucket')
user_df_rev.head()

Unnamed: 0,date_account_created,timestamp_first_active,gender_cat,age,signup_method_cat,language_cat,affiliate_channel_cat,affiliate_provider_cat,first_affiliate_tracked_cat,signup_app_cat,...,AU_pop,CA_pop,DE_pop,ES_pop,FR_pop,GB_pop,IT_pop,NL_pop,PT_pop,US_pop
0,2010-06-28,2009-03-19 04:32:55,3,-1.0,1,5,2,4,7,2,...,,,,,,,,,,
1,2011-05-25,2009-05-23 17:48:09,1,38.0,1,5,7,8,7,2,...,0.066798,0.066377,0.058744,0.086499,0.056907,0.061764,0.068804,0.058167,0.078518,0.063608
2,2010-09-28,2009-06-09 23:12:47,0,56.0,0,5,2,4,7,2,...,0.060737,0.072259,0.072418,0.063767,0.062539,0.061341,0.066154,0.06891,0.06683,0.06752
3,2011-12-05,2009-10-31 06:01:29,0,42.0,1,5,2,4,7,2,...,0.069306,0.065457,0.062886,0.084232,0.068372,0.066541,0.07914,0.069148,0.078612,0.06295
4,2010-09-14,2009-12-08 06:11:05,3,41.0,0,5,2,4,7,2,...,0.069306,0.065457,0.062886,0.084232,0.068372,0.066541,0.07914,0.069148,0.078612,0.06295


In [35]:
user_df_rev.fillna(-1, inplace=True)
user_df_rev.head()

Unnamed: 0,date_account_created,timestamp_first_active,gender_cat,age,signup_method_cat,language_cat,affiliate_channel_cat,affiliate_provider_cat,first_affiliate_tracked_cat,signup_app_cat,...,AU_pop,CA_pop,DE_pop,ES_pop,FR_pop,GB_pop,IT_pop,NL_pop,PT_pop,US_pop
0,2010-06-28,2009-03-19 04:32:55,3,-1.0,1,5,2,4,7,2,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,2011-05-25,2009-05-23 17:48:09,1,38.0,1,5,7,8,7,2,...,0.066798,0.066377,0.058744,0.086499,0.056907,0.061764,0.068804,0.058167,0.078518,0.063608
2,2010-09-28,2009-06-09 23:12:47,0,56.0,0,5,2,4,7,2,...,0.060737,0.072259,0.072418,0.063767,0.062539,0.061341,0.066154,0.06891,0.06683,0.06752
3,2011-12-05,2009-10-31 06:01:29,0,42.0,1,5,2,4,7,2,...,0.069306,0.065457,0.062886,0.084232,0.068372,0.066541,0.07914,0.069148,0.078612,0.06295
4,2010-09-14,2009-12-08 06:11:05,3,41.0,0,5,2,4,7,2,...,0.069306,0.065457,0.062886,0.084232,0.068372,0.066541,0.07914,0.069148,0.078612,0.06295


In [36]:
feature_cols = user_df_rev.columns.tolist()
# remove income from feature list
feature_cols.remove('country_destination')
feature_cols.remove('age_bucket')
feature_cols.remove('date_account_created')
feature_cols.remove('timestamp_first_active')
X = user_df_rev[feature_cols]  
y = user_df_rev['country_destination']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [37]:
my_gnb = GaussianNB()
my_gnb.fit(X_train, y_train)
y_predict_gnb = my_gnb.predict(X_test)
print('accuracy of Gaussian Naive Bayes is', accuracy_score(y_test, y_predict_gnb))

my_rf = RandomForestClassifier(n_jobs=2, random_state=0)
my_rf.fit(X_train, y_train)
y_predict_rf = my_rf.predict(X_test)
print('accuracy of Random Forest is', accuracy_score(y_test, y_predict_rf))

# Using Decision Tree classifier
my_dt = DecisionTreeClassifier()
my_dt.fit(X_train, y_train)
y_predict_dt = my_dt.predict(X_test)
print('accuracy of Decision Tree is', accuracy_score(y_test, y_predict_dt))
'''
# Using K Nearest Neighbors classifier (k = 3)
my_knn = KNeighborsClassifier(n_neighbors=3)
my_knn.fit(X_train, y_train)
y_predict_knn = my_knn.predict(X_test)
print('accuracy of K Nearest Neighbors is', accuracy_score(y_test, y_predict_knn))
'''

accuracy of Gaussian Naive Bayes is 0.544986062636
accuracy of Random Forest is 0.599142676442
accuracy of Decision Tree is 0.588812630297


"\n# Using K Nearest Neighbors classifier (k = 3)\nmy_knn = KNeighborsClassifier(n_neighbors=3)\nmy_knn.fit(X_train, y_train)\ny_predict_knn = my_knn.predict(X_test)\nprint('accuracy of K Nearest Neighbors is', accuracy_score(y_test, y_predict_knn))\n"

## Generate Report
Now we can use SciKit-Learn’s built in metrics such as a classification report to evaluate how well our model performed:

In [38]:
print('Random Forest report')
print(classification_report(y_test, y_predict_rf))
print('Gaussian Naive Bayes report')
print(classification_report(y_test, y_predict_gnb))
print('Logistic Regression report')
print(classification_report(y_test, y_predict_logreg))
print('Decision Tree report')
print(classification_report(y_test, y_predict_dt))
#print('K Nearest Neighbors report')
#print(classification_report(y_test, y_predict_knn))

Random Forest report
             precision    recall  f1-score   support

         AU       0.00      0.00      0.00       119
         CA       0.00      0.00      0.00       289
         DE       0.00      0.00      0.00       219
         ES       0.01      0.00      0.00       463
         FR       0.03      0.01      0.01       978
         GB       0.00      0.00      0.00       454
         IT       0.04      0.01      0.01       561
        NDF       0.67      0.82      0.74     24901
         NL       0.00      0.00      0.00       156
         PT       0.00      0.00      0.00        39
         US       0.46      0.41      0.43     12516
      other       0.08      0.02      0.03      1996

avg / total       0.53      0.60      0.56     42691

Gaussian Naive Bayes report


  'precision', 'predicted', average, warn_for)


             precision    recall  f1-score   support

         AU       0.00      0.00      0.00       119
         CA       0.00      0.00      0.00       289
         DE       0.00      0.00      0.00       219
         ES       0.00      0.00      0.00       463
         FR       0.00      0.00      0.00       978
         GB       0.00      0.00      0.00       454
         IT       0.00      0.00      0.00       561
        NDF       0.76      0.56      0.64     24901
         NL       0.00      0.00      0.00       156
         PT       0.00      0.00      0.00        39
         US       0.39      0.75      0.51     12516
      other       0.04      0.00      0.01      1996

avg / total       0.56      0.54      0.52     42691

Logistic Regression report
             precision    recall  f1-score   support

         AU       0.00      0.00      0.00       119
         CA       0.00      0.00      0.00       289
         DE       0.00      0.00      0.00       219
         ES    