<h1 style="font-size: 34px; margin-bottom: 2px; line-height: 0px;">AirBnB - Capstone Project 1 Data Analysis</h1>
<h3 style="line-height: 2px; font-style: italic;"> Timothy Baney<h3>

* <a href="#intro" style="color: black; text-decoration: none;">Introduction</a>
* <a href="#import" style="color: black; text-decoration: none;">Import Libraries</a>
* <a href="#feat-eng-map" style="color: black; text-decoration: none;">Feature Engineering</a>
    * <a href="#feat-eng-map" style="color: black; text-decoration: none;">Create Mapping functions</a>
    * <a href="#feat-eng-oh" style="color: black; text-decoration: none;">One Hot Encoding</a>
* <a href="#missing-values" style="color: black; text-decoration: none;">Handle Missing Values</a>
* <a href="#over-sample" style="color: black; text-decoration: none;">Over Sample Minority Classes</a>
* <a href="#holdout-data" style="color: black; text-decoration: none;">Create Holdout Data</a>
* <a href="#init-models" style="color: black; text-decoration: none;">Initiate Scikit Learn Algorithm Classes</a>
* <a href="#scoring-function" style="color: black; text-decoration: none;">Create Scoring Function</a>
* <a href="#baseline-normal" style="color: black; text-decoration: none;">Create Baselines for All Algorithms</a>
* <a href="#baseline-over" style="color: black; text-decoration: none;">Create Baselines for Over-sampled Data</a>
* <a href="#hyp-knn" style="color: black; text-decoration: none;">Hyper Parameter Tuning</a>
    * <a href="#hyp-knn" style="color: black; text-decoration: none;">KNN</a>
    * <a href="#hyp-dtree" style="color: black; text-decoration: none;">Decision Trees</a>
    * <a href="#hyp-rf" style="color: black; text-decoration: none;">Random Forests</a>
    * <a href="#hyp-lr" style="color: black; text-decoration: none;">Logistic Regression</a>
    * <a href="#hyp-grd" style="color: black; text-decoration: none;">Gradient Boosting</a>
* <a href="#init-tuned" style="color: black; text-decoration: none;">Initialize Tuned Algorithms</a>
* <a href="#feature-selection" style="color: black; text-decoration: none;">Perform Feature Selection</a>
* <a href="#final-results" style="color: black; text-decoration: none;">Final Results</a>
* <a href="#make-predictions" style="color: black; text-decoration: none;">Make Predictions</a>

### <p id="intro" style="margin-bottom: 0px; line-height: 1px;">Introduction</p>
For this notebook, I am going to take the cleansed data that I have explored, and analyze how different machine learning algorithms perform on fitting a prediction model. I will first engineer the data so only key features are kept, and to make sure everything is numerical. I will then split my data to get a holdout set, train several algorithms, and than score them with cross validation using 10 folds to compare the algorithms with respect to their scores.

### <p id="import">Import Libraries</p>

In [13]:
import datetime
import random
import pylab
import math
import itertools
from datetime import datetime as dt

import pandas as pd
import seaborn as sns
import scipy

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab

%matplotlib inline

import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as GSCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE, RFECV
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB

from imblearn.over_sampling import SMOTE

from spark_sklearn import GridSearchCV

import findspark
findspark.init()
import pyspark
# sc = pyspark.SparkContext()

ucb = pd.read_csv('clean_airbnb.csv')
test = pd.read_csv('airbnb_test.csv')
user_ids = test['id'].values

pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)

import warnings
warnings.filterwarnings('ignore')

### <p id="feat-eng-map">Feature Engineering - Create Mapping Functions </p>
Part of the feature engineering will include making functions to take multiple features in a column, and to return a new feature. One will get the time difference in days between signing up on the AirBnB site, and the users' first booking. Another will convert the date of the users' first booking, and return the season that the booking was made in. The others will see if the user language preference matches the destinations native language, and narrow all first browsers used to the top six browsers, and one browser titled 'other' to handle all other browsers.

In [14]:
def getTopBrowsers(row):
    top_browsers = ['Chrome', 'Safari', 'Firefox', 'IE', 'Mobile Safari']
    
    if row['first_browser'] not in top_browsers:
        return 'Other'
    else:
        return row['first_browser']
    
def getLapsedTime(row):
    try:
        account_creation = datetime.datetime.strptime(row['date_account_created'], '%Y-%m-%d')
        first_booking = datetime.datetime.strptime(row['date_first_booking'], '%Y-%m-%d')
        time_delta = first_booking - account_creation
    except Exception as e:
        print(e)
        print(row['id'])
    
    return time_delta.days

lang_dict = {'en': 'eng', 'fr': 'fra', 'it': 'ita', 'es': 'spa', 'de': 'deu', 'nl': 'nld', 'pt': 'por'}

def convertLang(row):
    if row['language'] in lang_dict:
        return lang_dict[row['language']]
    else:
        return row['language']
    
ucb['language'] = ucb.apply(lambda x: convertLang(x), axis=1)

def langPrefMatch(row):
    if row['language'] == row['destination_language ']:
        return 1
    else:
        return 0
    
def getSeason(row):
    seasons = {
        '01': 'Winter',
        '02': 'Winter',
        '03': 'Spring',
        '04': 'Spring',
        '05': 'Spring',
        '06': 'Summer',
        '07': 'Summer',
        '08': 'Summer',
        '09': 'Autumn',
        '10': 'Autumn',
        '11': 'Autumn',
        '12': 'Winter'
    }
    
    try:
        return seasons[str(row['date_first_booking']).split('-')[1]]
    except:
        return np.nan
    
def fillMissingGender(row):
    genders = ['MALE', 'FEMALE']
    if row['gender'] != 'MALE' and row['gender'] != 'FEMALE':
        return random.choice(genders)

### <p id="feat-eng-oh">Feature Engineering - One Hot Encoding</p>
Before analyzing performance with various algorithms, it is important that we turn all of our categorical data into one hot encoding values. One hot encoding takes all the category values of a column, and makes them all their own unique column with a boolean value, a one, or a zero, that says whether or not that row is, or is not that category class.

In [15]:
# Remove all ghost rows "Ones without IDs"
ucb = ucb[~ucb['id'].isnull()]

# Remove Country Destination One Hot Values used in Data Story
ucb = ucb.drop(['US', 'FR', 'IT', 'GB', 'ES', 'CA', 'DE', 'NL', 'AU', 'PT'], axis=1)

# Remove rows where the user never booked a trip, since this won't provide any value for 
# Country predictions
ucb = ucb.loc[ucb['country_destination'] != 'NDF']

# Create Language Matches Column
ucb['lang_match'] = ucb.apply(lambda x: langPrefMatch(x), axis=1)

# Remove attributes that are directly related to the country e.g. Latitude, Longitude
ucb = ucb.drop(['lat_destination', 'lng_destination', 'timestamp_first_active'], axis=1)
ucb = ucb.drop(['language_levenshtein_distance', 'id'], axis=1)
ucb = ucb.drop(['language', 'destination_language ', 'destination_km2', 'distance_km'], axis=1)

test = test.drop(['timestamp_first_active', 'id', 'language'], axis=1)

# Create Season Booked One Hot Columns [Autumn, Winter, Spring, Summer]
ucb['season_booked'] = ucb.apply(lambda x: getSeason(x), axis=1)

# Turn Seasons into one hot encoding values
season = pd.get_dummies(ucb['season_booked'])
ucb = ucb.join(season)
ucb = ucb.drop('season_booked', axis=1)

# Get most prevalant browsers used, labeling all other browsers as other
ucb['first_browser'] = ucb.apply(lambda x: getTopBrowsers(x), axis=1)
test['first_browser'] = ucb.apply(lambda x: getTopBrowsers(x), axis=1)

# Turn gender into one hot encoding values "Male, Female"
ucb = ucb[(ucb['gender'] == 'male') | (ucb['gender'] == 'female')]
gender_oh = pd.get_dummies(ucb['gender'])
ucb = ucb.join(gender_oh)
ucb = ucb.drop('gender', axis=1)

test['gender'] = test.apply(lambda x: fillMissingGender(x), axis=1)
test_gender_oh = pd.get_dummies(test['gender'])
test = test.join(test_gender_oh)
test = test.drop('gender', axis=1)
test = test.rename(columns={'MALE': 'male', 'FEMALE': 'female'})

# Turn signup method into one hot encoding values
su_method = pd.get_dummies(ucb['signup_method'])
ucb = ucb.join(su_method)
ucb = ucb.drop('signup_method', axis=1)
ucb = ucb.rename(columns={'basic': 'signup_basic', 'facebook': 'signup_facebook', 'google': 'signup_google'})

test_su_method = pd.get_dummies(test['signup_method'])
test = test.join(test_su_method)
test = test.drop('signup_method', axis=1)
test = test.rename(columns={'basic': 'signup_basic', 'facebook': 'signup_facebook', 'google': 'signup_google'})

# Turn signup app into one hot encoding values
su_app = pd.get_dummies(ucb['signup_app'])
ucb = ucb.join(su_app)
ucb = ucb.drop('signup_app', axis=1)

test_su_app = pd.get_dummies(test['signup_app'])
test = test.join(test_su_app)
test = test.drop('signup_app', axis=1)

# Turn first device type into one hot encoding values
dev_type = pd.get_dummies(ucb['first_device_type'])
ucb = ucb.join(dev_type)
ucb = ucb.drop('first_device_type', axis=1)

test_dev_type = pd.get_dummies(test['first_device_type'])
test = test.join(test_dev_type)
test = test.drop('first_device_type', axis=1)

# Turn affiliate channel into one hot encoding values
af_channel = pd.get_dummies(ucb['affiliate_channel'])
ucb = ucb.join(af_channel)
ucb = ucb.drop('affiliate_channel', axis=1)
ucb = ucb.rename(columns={'api': 'ch_api', 'content': 'ch_content', 'direct': 'ch_direct',
                          'other': 'ch_other', 'remarketing': 'ch_remarketing', 'sem-brand': 'ch_sem_brand',
                          'sem-non-brand': 'ch_sem_non_brand', 'seo': 'ch_seo'})

test_af_channel = pd.get_dummies(test['affiliate_channel'])
test = test.join(test_af_channel)
test = test.drop('affiliate_channel', axis=1)
test = test.rename(columns={'api': 'ch_api', 'content': 'ch_content', 'direct': 'ch_direct',
                          'other': 'ch_other', 'remarketing': 'ch_remarketing', 'sem-brand': 'ch_sem_brand',
                          'sem-non-brand': 'ch_sem_non_brand', 'seo': 'ch_seo'})

# Turn affiliate provider into one hot encoding values
af_provider = pd.get_dummies(ucb['affiliate_provider'])
ucb = ucb.join(af_provider)
ucb = ucb.drop('affiliate_provider', axis=1)

test_af_provider = pd.get_dummies(test['affiliate_provider'])
test = test.join(test_af_provider)
test = test.drop('affiliate_provider', axis=1)

# Turn first affiliate tracked into one hot encoding values
fa_oh = pd.get_dummies(ucb['first_affiliate_tracked'])
ucb = ucb.join(fa_oh)
ucb = ucb.drop('first_affiliate_tracked', axis=1)

test_fa_oh = pd.get_dummies(test['first_affiliate_tracked'])
test = test.join(test_fa_oh)
test = test.drop('first_affiliate_tracked', axis=1)

# Turn first browser into one hot encoding values
fbwsr_oh = pd.get_dummies(ucb['first_browser'])
ucb = ucb.join(fbwsr_oh)
ucb = ucb.drop('first_browser', axis=1)

test_fbwsr_oh = pd.get_dummies(test['first_browser'])
test = test.join(test_fbwsr_oh)
test = test.drop('first_browser', axis=1)

# Create 'AccountCreation-BookingTime' column for difference
ucb['days_to_book'] = ucb.apply(lambda x: getLapsedTime(x), axis=1)

# Remove date_account_created, and first_booking columns
ucb = ucb.drop(['date_account_created', 'date_first_booking'], axis=1)
test = test.drop('date_account_created', axis=1)

In [16]:
test['age'] = test['age'].mean()
test = test.drop('date_first_booking', axis=1)

### <p id="missing-values">Handle Missing Values</p>
Our data still has some missing values. to fix this we will simply remove the rows with null values.

In [17]:
ucb = ucb.loc[~ucb['actions_total_count'].isnull()]
ucb = ucb.loc[~ucb['average_action_duration'].isnull()]
ucb = ucb.loc[~ucb['dest_age_pop'].isnull()]

### <p id="over-sample">Over Sample Minority Classes</p>
The target class has 10 categories, each a different country. The United States has far more occurences than the other countries so it is imbalanced, and will force our predictions to be mainly US. We may get a misleading high accuracy rate if the every single prediction is 'US'. To remedy this, we will do a few things. First I will not only get a 10 fold cross validation score for each algorithm, but will also get the true positive rate for each country using a confusiong matrix so we can score each individual country. I will also create a new dataset, using a strategy called over-sampling. The idea is simple. For minority classes, we can resample from the minority classes, and add new minority class occurences to the main data to even the playing field so to speak. You can under-sample majority classes, and over-sample minority classes, but for smaller datasets "anything in the tens of thousands or less", it is recommended to use over-sampling. I am using SMOTE "Synthetic Minority Over-sampling Technique' to over sample my minority classes. ** This takes a considerable amount of time to do, so I have saved the resulting dataset, and simply referenced it using Panda's .read_csv **

In [18]:
def overSampleMinority(df, minority):
    # Get all rows that are USA or minority, i.e. France
    # SMOTE will oversample French samples to match USA
    df = df[(df['US'] == 1) | (df[minority] == 1)]
    
    X = df.drop(['country_destination', 'US'], axis=1)
    y = df['US'].values
    
    cols = X.columns

    # Resample Data, Matching
    X_res, y_res = SMOTE(kind='regular').fit_sample(X, y)
    
    age = [X_res[row][0] for row in range(0, len(X_res))]
    
    resampled = pd.DataFrame(columns=df.columns)
    for res in range(1, len(X_res)):
        obj = {}
        for index, col in enumerate(cols):            
            obj[col] = X_res[res][index]
            obj['US'] = y_res[res]
        
        new_row = pd.Series(obj)
        resampled = resampled.append(new_row, ignore_index=True)
    
    return resampled[resampled[minority] == 1]

# france = overSampleMinority(ucb_ofit, 'FR')
# italy = overSampleMinority(ucb_ofit, 'IT')
# denmark = overSampleMinority(ucb_ofit, 'DE')
# australia = overSampleMinority(ucb_ofit, 'AU')
# portugal = overSampleMinority(ucb_ofit, 'PT')
# spain = overSampleMinority(ucb_ofit, 'ES')
# canada = overSampleMinority(ucb_ofit, 'CA')
# great_britain = overSampleMinority(ucb_ofit, 'GB')
# netherlands = overSampleMinority(ucb_ofit, 'NL')
# usa = ucb[ucb['country_destination'] == 'US']

# overSampledUcb = france.append(italy)
# overSampledUcb = overSampledUcb.append(denmark)
# overSampledUcb = overSampledUcb.append(australia)
# overSampledUcb = overSampledUcb.append(portugal)
# overSampledUcb = overSampledUcb.append(spain)
# overSampledUcb = overSampledUcb.append(canada)
# overSampledUcb = overSampledUcb.append(great_britain)
# overSampledUcb = overSampledUcb.append(netherlands)
# overSampledUcb = overSampledUcb.append(usa)

os_ucb = pd.read_csv('over_sample_data.csv')

### <p id="holdout-data">Create Holdout Data</p>
The get a more accurate scoring of the algorithms used, we will seperate our data by a 30/70 split, the 30% being the 'holdout' data. The holdout data isn't used to train any models, and doesn't include the target feature, so it simulates unseen data to the model. We will also cross validate on ten folds to ensure the most precise measure of performance for each algorithm we can. 

In [19]:
country_names = ['GLOBAL', 'AU', 'CA', 'DE', 'SP', 'FR', 'GB', 'IT', 'NL', 'PT', 'US']

X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values

osX = os_ucb.drop('country_destination', axis=1)
osY = os_ucb['country_destination'].values

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
osX_train, osX_test, osy_train, osy_test = train_test_split(osX, osY, test_size = .3, random_state=42)

### <p id="init-models">Initiate Scikit Learn Algorithm Classes</p>

In [21]:
knn = KNeighborsClassifier()
dtree = DecisionTreeClassifier()
rf = RandomForestClassifier(n_estimators=100)
logreg = LogisticRegression()
gaus = GaussianNB()
grd = GradientBoostingClassifier()
per = Perceptron()
voting = VotingClassifier(estimators=[('knn', knn), ('dtree', dtree), ('grd', grd)], voting='soft')

### <p id="scoring-function">Create Scoring Function</p>
The function will take an algorithm as its only parameter, will fit the algorithm with the training data, score it with cross validation, and create a list of true positive matches for each class using a confusion matrix.

In [22]:
def scoreFit(alg, xt, yt, xtst, ytst):
    alg.fit(xt, yt)
    alg_fold_score = model_selection.cross_val_score(alg, xtst, ytst, cv=10).mean()
    predictions = alg.predict(xtst)
    c_matrix = confusion_matrix(ytst, predictions)
    
    country_tp_percent = []
    for x in range(0, 10): # <= For row in matrix
        row_total = 0
        for y in range(0, 10): # <= For column in matrix
            row_total += c_matrix[x, y]
        country_tp_percent.append(c_matrix[x, x]/row_total)
        
    country_tp_percent = [alg_fold_score] + country_tp_percent
    return country_tp_percent

### <p id="baseline-normal">Create Baselines for All Algorithms </p>

In [23]:
knn_scores = pd.DataFrame({
    'KNN Score': scoreFit(knn, X_train, y_train, X_test, y_test)
})

dtree_scores = pd.DataFrame({
    'Decision Trees Score': scoreFit(dtree, X_train, y_train, X_test, y_test)
})

rf_scores = pd.DataFrame({
    'Random Forest Score': scoreFit(rf, X_train, y_train, X_test, y_test)
})

lr_scores = pd.DataFrame({
    'Logistric Regression Score': scoreFit(logreg, X_train, y_train, X_test, y_test)
})

gaus_scores = pd.DataFrame({
    'Gaussian NB Score': scoreFit(gaus, X_train, y_train, X_test, y_test)
})

grd_scores = pd.DataFrame({
    'Gradient Boosting Score': scoreFit(grd, X_train, y_train, X_test, y_test)
})

per_scores = pd.DataFrame({
    'Perceptron Score': scoreFit(per, X_train, y_train, X_test, y_test)
})

voting_scores = pd.DataFrame({
    'Voting Ensemble Score': scoreFit(voting, X_train, y_train, X_test, y_test)
})

baseline_scores = knn_scores.join(dtree_scores)
baseline_scores = baseline_scores.join(rf_scores)
baseline_scores = baseline_scores.join(lr_scores)
baseline_scores = baseline_scores.join(gaus_scores)
baseline_scores = baseline_scores.join(grd_scores)
baseline_scores = baseline_scores.join(per_scores)
baseline_scores = baseline_scores.join(voting_scores)

baseline_scores

Unnamed: 0,KNN Score,Decision Trees Score,Random Forest Score,Logistric Regression Score,Gaussian NB Score,Gradient Boosting Score,Perceptron Score,Voting Ensemble Score
0,0.950006,0.986294,0.935673,0.895094,0.914416,0.99014,0.814826,0.989285
1,0.903226,1.0,0.290323,0.16129,0.967742,1.0,0.0,1.0
2,0.883117,0.961039,0.896104,0.311688,0.922078,0.961039,0.0,0.961039
3,0.745098,1.0,0.313725,0.0,0.54902,0.941176,0.0,0.980392
4,0.693548,0.967742,0.572581,0.064516,0.58871,0.975806,0.0,0.983871
5,0.914286,0.967347,0.938776,0.914286,0.755102,0.987755,0.0,0.983673
6,0.633588,0.946565,0.938931,0.931298,0.938931,0.954198,0.198473,0.946565
7,0.812903,0.922581,0.6,0.012903,0.4,0.903226,0.0,0.941935
8,0.977778,1.0,0.311111,0.666667,0.911111,0.977778,0.0,1.0
9,1.0,0.923077,0.0,0.153846,0.692308,0.923077,0.0,0.923077


### <p id="baseline-over">Create Baselines for Over-sampled Data</p>

In [24]:
knn_scores = pd.DataFrame({
    'KNN Score': scoreFit(knn, osX_train, osy_train, osX_test, osy_test)
})

dtree_scores = pd.DataFrame({
    'Decision Trees Score': scoreFit(dtree, osX_train, osy_train, osX_test, osy_test)
})

rf_scores = pd.DataFrame({
    'Random Forest Score': scoreFit(rf, osX_train, osy_train, osX_test, osy_test)
})

lr_scores = pd.DataFrame({
    'Logistric Regression Score': scoreFit(logreg, osX_train, osy_train, osX_test, osy_test)
})

gaus_scores = pd.DataFrame({
    'Gaussian NB Score': scoreFit(gaus, osX_train, osy_train, osX_test, osy_test)
})

grd_scores = pd.DataFrame({
    'Gradient Boosting Score': scoreFit(grd, osX_train, osy_train, osX_test, osy_test)
})

per_scores = pd.DataFrame({
    'Perceptron Score': scoreFit(per, osX_train, osy_train, osX_test, osy_test)
})

voting_scores = pd.DataFrame({
    'Voting Ensemble Score': scoreFit(voting, osX_train, osy_train, osX_test, osy_test)
})

oversample_scores = knn_scores.join(dtree_scores)
oversample_scores = oversample_scores.join(rf_scores)
oversample_scores = oversample_scores.join(lr_scores)
oversample_scores = oversample_scores.join(gaus_scores)
oversample_scores = oversample_scores.join(grd_scores)
oversample_scores = oversample_scores.join(per_scores)
oversample_scores = oversample_scores.join(voting_scores)

oversample_scores

Unnamed: 0,Country,KNN Score,Decision Trees Score,Random Forest Score,Logistric Regression Score,Gaussian NB Score,Gradient Boosting Score,Perceptron Score,Voting Ensemble Score
0,GLOBAL,0.994413,0.99839,0.997995,0.713634,0.796193,0.99612,0.196799,0.999184
1,AU,0.987382,0.996357,0.993902,0.710264,0.797963,0.995195,0.193269,0.997651
2,CA,0.99922,1.0,1.0,0.974526,0.984923,1.0,0.0,1.0
3,DE,0.998913,1.0,1.0,0.712694,0.94754,0.999728,0.0,1.0
4,SP,0.998166,0.998166,0.997904,0.702306,0.736373,0.995807,0.0,0.998952
5,FR,0.985989,0.996886,0.994292,0.444214,0.725221,0.991178,1.0,0.997665
6,GB,0.991632,0.995554,0.996077,0.282688,0.434362,0.991109,0.0,0.997908
7,IT,0.983356,0.997047,0.999463,0.795705,0.955973,0.998658,0.0,1.0
8,NL,0.989002,0.996781,0.99383,0.264217,0.331277,0.984979,0.0,0.997586
9,PT,0.999739,0.999739,1.0,0.963485,0.950183,1.0,0.0,1.0


### <p id="hyp-knn">Hyper Parameter Tuning - KNN</p>

In [None]:
nbrs = [num for num in range(1, 10)]
lsize = [leaf for leaf in range(1, 100)]

parameters = {'leaf_size': lsize, 'n_neighbors': nbrs, 'weights': ['uniform', 'distance'],
              'algorithm': ['kd_tree']}

clf = GridSearchCV(sc, knn, parameters).fit(X_train, y_train)
clf.best_estimator_

# Best Parameters:
# leaf_size: 1
# n_neighbors: 1
# algorithm: 'kd_tree'
# weights: 'uniform'

### <p id="hyp-dtree">Hyper Parameter Tuning - Decision Trees</p>

In [None]:
max_feat = [x for x in range(1, 100)]
min_leaf = [x for x in range(1, 10)]

parameters = {'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'], 
              'max_features': max_feat, 'min_samples_leaf': min_leaf,
              'class_weight': ['balanced', 'None']}

clf = GridSearchCV(sc, dtree, parameters).fit(X_train, y_train)
best = clf.best_estimator_

# min_samples_leaf: 1
# criterion: 'entropy'
# max_features: 48
# splitter: 'best'
# class_weight: 'balanced'

### <p id="hyp-rf">Hyper Parameter Tuning - Random Forests</p>

In [None]:
max_feat = [x for x in range(1, 100)]
min_leaf = [x for x in range(1, 10)]
estimators = [x for x in range(1, 200)]
    
parameters = {'n_estimators': estimators, 'criterion': ['gini', 'entropy'], 
              'max_features': max_feat, 'min_samples_leaf': min_leaf,
              'class_weight': ['balanced', None], 'warm_start': [True, False], 'bootstrap': [True, False]}
              
clf = GridSearchCV(sc, rf, parameters).fit(X_train, y_train)
best = clf.best_estimator_

# N_estimators: 78
# min_samples_leaf: 1
# Bootstrap: False
# Max Features: 69
# Warm Start: False
# class_weight: 'balanced
# Criterion: 'Entropy'

### <p id="hyp-lr">Hyper Parameter Tuning -  Logistic Regression</p>

In [None]:
max_i = [x for x in range(1, 100)]

parameters = {'penalty': [11, 12], 'C': [.001, .01, .1, 1, 10, 100],  'fit_intercept': [True, False],
              'solver': ['liblinear', 'newton-cg', 'sag', 'lbfgs'], 'max_iter': max_i,
              'class_weight': ['balance', None], 'multi_class': ['ovr', 'multinomial']}
              
clf = GridSearchCV(sc, logreg, parameters).fit(X_train, y_train)
best = clf.best_estimator_

# Max_iter: 76
# C: 1
# solver: newton-cg
# class_weight: balanced
# multi_class: multinomial
# penalty: 11
# fit_intercept: True

### <p id="hyp-grd">Hyper Parameter Tuning - Gradient Boosting</p>

In [None]:
estimators = [x for x in range(1, 50)]
lrate = [x/1000 for x in range(1, 100)]
min_leaf = [x for x in range(1, 60)]

grd = GradientBoostingClassifier()

parameters = {'learning_rate': lrate, 'n_estimators': estimators,
              'max_features': ['log2', 'sqrt', range(1, 100)], 'min_samples_leaf': min_leaf}
              
clf = GridSearchCV(sc, grd, parameters).fit(X_train, y_train)
best = clf.best_estimator_

# n_estimators: 49
# learning_Rate: .094
# min_samples_leaf: 1
# max_features: 55

### <p id="init-tuned">Initialize Tuned Algorithms</p>

In [25]:
tuned_knn = KNeighborsClassifier(leaf_size=1, n_neighbors=1, algorithm='kd_tree', weights='uniform')
tuned_dtree = DecisionTreeClassifier(min_samples_leaf=1, criterion='entropy', splitter='best')
tuned_rf = RandomForestClassifier(n_estimators=78, min_samples_leaf=1, bootstrap=False, warm_start=False, class_weight='balanced', criterion='entropy')
tuned_logreg = LogisticRegression(max_iter=76, solver='newton-cg', class_weight='balanced', multi_class='multinomial')
tuned_grd = GradientBoostingClassifier(n_estimators=49, learning_rate=.094, min_samples_leaf=1)
tuned_voting = VotingClassifier(estimators=[('knn', tuned_knn), ('dtree', tuned_dtree), ('grd', tuned_grd)], voting='soft')

### <p id="feature-selection">Perform Feature Selection </p>

The data being used has 60 features. This could be lead to overfitting. To remedy this, I am going to find the most important features for every algorithm using ** Recursive Feature Elimination (RFE)**. From the scikit learn website ** * Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. * **

In [26]:
def getOptimalFeatures(alg):
    scores = []
    for feat_count in range(1, 60):
        rfe_alg = RFE(estimator=alg, n_features_to_select=feat_count, step=1)
        rfe_alg.fit(X_train, y_train)
        rfe_score = rfe_alg.score(X_test, y_test)
        print(feat_count)

        scores.append({'feat_count': feat_count, 'score': rfe_score})
        
    scores = sorted(scores, key=lambda x: x['score'], reverse=True)
    return scores

In [27]:
optimal_knn = tuned_knn
optimal_logreg = RFE(estimator=tuned_logreg, n_features_to_select=5, step=1)
optimal_dtree = RFE(estimator=tuned_dtree, n_features_to_select=3, step=1)
optimal_rf = RFE(estimator=tuned_rf, n_features_to_select=3, step=1)
optimal_grd = RFE(estimator=tuned_grd, n_features_to_select=4, step=1)
optimal_gaus = RFE(estimator=gaus, n_features_to_select=3, step=1)
optimal_per = RFE(estimator=per, n_features_to_select=7, step=1)
optimal_voting = VotingClassifier(estimators=[('optimal_knn', optimal_knn), ('optimal_dtree', optimal_dtree), ('optimal_grd', optimal_grd)], voting='soft')

### <p id="final-results">Final Results</p>

In [None]:
opt_knn_scores = pd.DataFrame({
    'Tuned KNN Score': scoreFit(optimal_knn, X_train, X_test, y_train, y_test)
})
opt_dt_scores = pd.DataFrame({
    'Tuned Decision Tree Score': scoreFit(optimal_dtree, X_train, X_test, y_train, y_test)
})
opt_rf_scores = pd.DataFrame({
    'Tuned Random Forest Score': scoreFit(optimal_rf, X_train, X_test, y_train, y_test)
})
opt_grd_scores = pd.DataFrame({
    'Tuned Gradient Boosting Score': scoreFit(optimal_grd, X_train, X_test, y_train, y_test)
})
opt_gaus_scores = pd.DataFrame({
    'Tuned Gaus Score': scoreFit(optimal_gaus, X_train, X_test, y_train, y_test)
})
opt_per_scores = pd.DataFrame({
    'Tuned Perceptron Score': scoreFit(optimal_per, X_train, X_test, y_train, y_test)
})
opt_voting_scores = pd.DataFrame({
    'Tuned Voting Score': scoreFit(optimal_voting, X_train,X_test, y_train, y_test)
})
all_scores = pd.DataFrame({
    'Country': country_names
})

alg_scores = [knn_scores, opt_knn_scores, dtree_scores, opt_dt_scores, rf_scores, opt_rf_scores, lr_scores,
gaus_scores, grd_scores, opt_grd_scores, per_scores, opt_per_scores, voting_scores, opt_voting_scores]

for score in alg_scores:
    all_scores = all_scores.join(score)
    
all_scores

### <p id="make-predictions">Make Predictions</p>

In [None]:
test = test.drop('weibo', axis=1)
test = test.drop('daum')
X = ucb.drop('country_destination', axis=1)
X = X[test.columns]
y = ucb['country_destination'].values

optimal_grd.fit(X, y)
predictions = optimal_grd.predict(test)

In [None]:
# ------------ PREPARE FOR SUBMISSION -------------- #
new_df = pd.DataFrame(columns=['id', 'country'])
new_df['id'] = user_ids
new_df['country'] = predictions
new_df.to_csv('airbnb_final.csv', index=False)
# ---------- END PREPARE FOR SUBMISSION ----------- #

### <p id="summary">Summary</p>
After submitting my predictions to the Kaggle competition with the imbalanced dataset I received a 23%. This was actually because a hundred percent of my predictions were 'US', so it was simply getting every 'US' right. I than fit the same algorithm with the data from the over-sampled minority classes, and received an even lower score, 21%. Considering my global ten fold cross validation scores for each algorithm, and true positive scores for each class for each algorithm were very high, I am highly surprised that I received such a low score. 