<h1 style="font-size: 34px; margin-bottom: 2px; line-height: 0px;">AirBnB - Capstone Project 1 Data Analysis</h1>
<h3 style="line-height: 2px; font-style: italic;"> Timothy Baney<h3>

* <a href="#intro" style="color: black; text-decoration: none;">Introduction</a>
* <a href="#import" style="color: black; text-decoration: none;">Import Libraries</a>
* <a href="#data-structure" style="color: black; text-decoration: none;">Feature Engineering</a>
    * <a href="#observ-variable" style="color: black; text-decoration: none;">Create Mapping functions</a>
    * <a href="#missing-values" style="color: black; text-decoration: none;">One Hot Encoding</a>
* <a href="#import" style="color: black; text-decoration: none;">Handle Missing Values</a>
* <a href="#problem-nature" style="color: black; text-decoration: none;">Algorithms
* <a href="#summary" style="color: black; text-decoration: none;">Summary</a> 

### <p id="intro" style="margin-bottom: 0px; line-height: 1px;">Introduction</p>
For this notebook, I am going to take the cleansed data that I have explored, and analyze how different machine learning algorithms perform on fitting a prediction model. I will first engineer the data so only key features are kept, and to make sure everything is numerical. I will than split my data to get a holdout set, train the analyzed algorithm on the training set of the data, and than score it with cross validation using 10 folds to find what the best algorithm is.

### <p id="import">Import Libraries</p>

In [96]:
import datetime
%matplotlib inline

import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB

from datetime import datetime as dt

import scipy

import matplotlib.pylab as pylab

ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')

pylab.rcParams[ 'figure.figsize' ] = 12 , 5
plt.style.use("fivethirtyeight")

  interactivity=interactivity, compiler=compiler, result=result)


### Feature Engineering - Create Mapping Functions 
Part of the feature engineering will include making functions to take multiple features in a column, and to return a new feature. One will get the time difference in days between signing up on the AirBnB site, and the users' first booking. Another will convert the date of the users' first booking, and return the season that the booking was made in. The others will see if the user language preference matches the destinations native language, and narrow all first browsers used to the top six browsers, and one browser titled 'other' to handle all other browsers.

In [97]:
def getTopBrowsers(row):
    top_browsers = ['Chrome', 'Safari', 'Firefox', 'IE', 'Mobile Safari']
    
    if row['first_browser'] not in top_browsers:
        return 'Other'
    else:
        return row['first_browser']
    
def getLapsedTime(row):
    try:
        account_creation = datetime.datetime.strptime(row['date_account_created'], '%Y-%m-%d')
        first_booking = datetime.datetime.strptime(row['date_first_booking'], '%Y-%m-%d')
        time_delta = first_booking - account_creation
    except Exception as e:
        print(e)
        print(row['id'])
    
    return time_delta.days

lang_dict = {'en': 'eng', 'fr': 'fra', 'it': 'ita', 'es': 'spa', 'de': 'deu', 'nl': 'nld', 'pt': 'por'}

def convertLang(row):
    if row['language'] in lang_dict:
        return lang_dict[row['language']]
    else:
        return row['language']
    
ucb['language'] = ucb.apply(lambda x: convertLang(x), axis=1)

def langPrefMatch(row):
    if row['language'] == row['destination_language ']:
        return 1
    else:
        return 0
    
def getSeason(row):
    seasons = {
        '01': 'Winter',
        '02': 'Winter',
        '03': 'Spring',
        '04': 'Spring',
        '05': 'Spring',
        '06': 'Summer',
        '07': 'Summer',
        '08': 'Summer',
        '09': 'Autumn',
        '10': 'Autumn',
        '11': 'Autumn',
        '12': 'Winter'
    }
    
    try:
        return seasons[str(row['date_first_booking']).split('-')[1]]
    except:
        return np.nan

### Feature Engineering - One Hot Encoding
Before analyzing performance with various algorithms, it is important that we turn all of our categorical data into one hot encoding values. One hot encoding takes all the category values of a column, and makes them all their own unique column with a boolean value, a one, or a zero, that says whether or not that row is, or is not that category class.

In [98]:
# Remove all ghost rows "Ones without IDs"
ucb = ucb[~ucb['id'].isnull()]

# Remove Country Destination One Hot Values used in Data Story
ucb = ucb.drop(['US', 'FR', 'IT', 'GB', 'ES', 'CA', 'DE', 'NL', 'AU', 'PT'], axis=1)

# Remove rows where the user never booked a trip, since this won't provide any value for 
# Country predictions
ucb = ucb.loc[ucb['country_destination'] != 'NDF']

# Create Language Matches Column
ucb['lang_match'] = ucb.apply(lambda x: langPrefMatch(x), axis=1)

# Remove attributes that are directly related to the country e.g. Latitude, Longitude
ucb = ucb.drop(['lat_destination', 'lng_destination', 'timestamp_first_active'], axis=1)
ucb = ucb.drop(['language_levenshtein_distance', 'id'], axis=1)
ucb = ucb.drop(['language', 'destination_language ', 'destination_km2', 'distance_km'], axis=1)

# Create Season Booked One Hot Columns [Autumn, Winter, Spring, Summer]
ucb['season_booked'] = ucb.apply(lambda x: getSeason(x), axis=1)

# Turn Seasons into one hot encoding values
season = pd.get_dummies(ucb['season_booked'])
ucb = ucb.join(season)
ucb = ucb.drop('season_booked', axis=1)

# Get most prevalant browsers used, labeling all other browsers as other
ucb['first_browser'] = ucb.apply(lambda x: getTopBrowsers(x), axis=1)

# Turn gender into one hot encoding values "Male, Female"
ucb = ucb[(ucb['gender'] == 'male') | (ucb['gender'] == 'female')]
gender_oh = pd.get_dummies(ucb['gender'])
ucb = ucb.join(gender_oh)
ucb = ucb.drop('gender', axis=1)

# Turn signup method into one hot encoding values
su_method = pd.get_dummies(ucb['signup_method'])
ucb = ucb.join(su_method)
ucb = ucb.drop('signup_method', axis=1)
ucb = ucb.rename(columns={'basic': 'signup_basic', 'facebook': 'signup_facebook', 'google': 'signup_google'})

# Turn signup app into one hot encoding values
su_app = pd.get_dummies(ucb['signup_app'])
ucb = ucb.join(su_app)
ucb = ucb.drop('signup_app', axis=1)

# Turn first device type into one hot encoding values
dev_type = pd.get_dummies(ucb['first_device_type'])
ucb = ucb.join(dev_type)
ucb = ucb.drop('first_device_type', axis=1)

# Turn affiliate channel into one hot encoding values
af_channel = pd.get_dummies(ucb['affiliate_channel'])
ucb = ucb.join(af_channel)
ucb = ucb.drop('affiliate_channel', axis=1)
ucb = ucb.rename(columns={'api': 'ch_api', 'content': 'ch_content', 'direct': 'ch_direct',
                          'other': 'ch_other', 'remarketing': 'ch_remarketing', 'sem-brand': 'ch_sem_brand',
                          'sem-non-brand': 'ch_sem_non_brand', 'seo': 'ch_seo'})

# Turn affiliate provider into one hot encoding values
af_provider = pd.get_dummies(ucb['affiliate_provider'])
ucb = ucb.join(af_provider)
ucb = ucb.drop('affiliate_provider', axis=1)

# Turn first affiliate tracked into one hot encoding values
fa_oh = pd.get_dummies(ucb['first_affiliate_tracked'])
ucb = ucb.join(fa_oh)
ucb = ucb.drop('first_affiliate_tracked', axis=1)

# Turn first browser into one hot encoding values
fbwsr_oh = pd.get_dummies(ucb['first_browser'])
ucb = ucb.join(fbwsr_oh)
ucb = ucb.drop('first_browser', axis=1)

# Create 'AccountCreation-BookingTime' column for difference
ucb['days_to_book'] = ucb.apply(lambda x: getLapsedTime(x), axis=1)

# Remove date_account_created, and first_booking columns
ucb = ucb.drop(['date_account_created', 'date_first_booking'], axis=1)

### Handle Missing Values
Our data still has some missing values. to fix this we will simply remove the rows with null values.

In [99]:
print(ucb['Web'].count())
ucb = ucb.loc[~ucb['actions_total_count'].isnull()]
ucb = ucb.loc[~ucb['average_action_duration'].isnull()]
ucb = ucb.loc[~ucb['dest_age_pop'].isnull()]
print(ucb['Web'].count())

59714
15532


In [100]:
X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values

### Create Holdout Data

The get a more accurate scoring of the algorithms used, we will seperate our data by a 30/70 split, the 30% being the 'holdout' data. The holdout data isn't used to train any models, and doesn't include the target feature, so it simulates unseen data to the model. We will also cross validate on ten folds to ensure the most precise measure of performance for each algorithm we can. 

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)

### Algorithm 1 - KNearest Neighbors

In [110]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

knn_score = knn.score(X_test, y_test)
knn_fold_score = model_selection.cross_val_score(knn, X_test, y_test, cv=10).mean()
knn_fold_score

0.9500057714942598

### Algorithm 2 - Decision Trees

In [111]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

dt_score = dtree.score(X_test, y_test)
dt_fold_score = model_selection.cross_val_score(dtree, X_test, y_test, cv=10).mean()
dt_fold_score

0.98650441234251862

### Algorithm 3 - Random Forests

In [112]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

rf_score = rf.score(X_test, y_test)
rf_fold_score = model_selection.cross_val_score(rf, X_test, y_test, cv=10).mean()
rf_fold_score

0.93415195676023066

### Algorithm 4 - Logistic Regression

In [113]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

lr_score = logreg.score(X_test, y_test)
lr_fold_score = model_selection.cross_val_score(logreg, X_test, y_test, cv=10).mean()
lr_fold_score

0.8938056469778124

### Algorithm 5 - Gaussian Naive Bayes

In [114]:
gaus = GaussianNB()
gaus.fit(X_train, y_train)

gaus_score = gaus.score(X_test, y_test)
gaus_fold_score = model_selection.cross_val_score(gaus, X_test, y_test, cv=10).mean()
gaus_fold_score

0.91441572355861089

### Algorithm 6 - Gradient Boosting

In [115]:
grd = GradientBoostingClassifier()
grd.fit(X_train, y_train)

grd_score = grd.score(X_test, y_test)
gaus_fold_score = model_selection.cross_val_score(gaus, X_test, y_test, cv=10).mean()
gaus_fold_score

0.91441572355861089

### Algorithm 7 - Perceptron

In [116]:
per = Perceptron()
per.fit(X_train, y_train)

per_score = per.score(X_test, y_test)
per_fold_score = model_selection.cross_val_score(per, X_test, y_test, cv=10).mean()
per_fold_score

0.8148258083366009

### Stacking Method 1 - Voting Ensemble

In [120]:
from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(estimators=[('knn', knn), ('dtree', dtree), ('grd', grd)], voting='soft')
voting.fit(X_train, y_train)
voting_score = voting.score(X_test, y_test)

### Results

In [135]:
algorithm_scores = [
    {'name': 'KNearest Neighbors', 'score': knn_fold_score},
    {'name': 'Decision Trees', 'score': dt_fold_score},
    {'name': 'Random Forests', 'score': rf_fold_score},
    {'name': 'Logistic Regression', 'score': lr_fold_score},
    {'name': 'Gaussian Naive Bayes', 'score': gaus_fold_score},
    {'name': 'Gradient Boosting', 'score': grd_score},
    {'name': 'Perceptron', 'score': per_fold_score},
    {'name': 'Voting Ensemble', 'score': voting_score}
]
    

alg_scores = sorted(algorithm_scores, key=lambda x: x['score'], reverse=True)

alg_scores

[{'name': 'Voting Ensemble', 'score': 0.99399141630901289},
 {'name': 'Gradient Boosting', 'score': 0.99248927038626611},
 {'name': 'Decision Trees', 'score': 0.98650441234251862},
 {'name': 'KNearest Neighbors', 'score': 0.9500057714942598},
 {'name': 'Random Forests', 'score': 0.93415195676023066},
 {'name': 'Gaussian Naive Bayes', 'score': 0.91441572355861089},
 {'name': 'Logistic Regression', 'score': 0.8938056469778124},
 {'name': 'Perceptron', 'score': 0.8148258083366009}]

In [136]:
pd.DataFrame(alg_scores)

Unnamed: 0,name,score
0,Voting Ensemble,0.993991
1,Gradient Boosting,0.992489
2,Decision Trees,0.986504
3,KNearest Neighbors,0.950006
4,Random Forests,0.934152
5,Gaussian Naive Bayes,0.914416
6,Logistic Regression,0.893806
7,Perceptron,0.814826


### Brief Discussion