## Week 1 - a taste of Kaggle
   1. For this week's assignment we will be focusing on using only phone brand and models.
   2. We'll also try validating models using holdout data and with cross validation.
   3. In the end we will try averaging the results from different models and evalute the results.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
import time
from sklearn import preprocessing, pipeline, metrics, grid_search, cross_validation


#### Define a function that can be used for both Cross Validation and Grid Search for optimal parameters 

In [3]:
def search_model(train_x, train_y, est, param_grid, n_jobs, cv, refit=False):
##Grid Search for the best model
    model = grid_search.GridSearchCV(estimator  = est,
                                     param_grid = param_grid,
                                     scoring    = 'log_loss',
                                     verbose    = 10,
                                     n_jobs  = n_jobs,
                                     iid        = True,
                                     refit    = refit,
                                     cv      = cv)
    # Fit Grid Search Model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    print("Scores:", model.grid_scores_)
    return model

#### Load Data 

Again, we will only load train, test and phone brand/model data for week one.

In [4]:
# Load Data

print("# Load Phone Brand")
phone_brand = pd.read_csv("../input/phone_brand_device_model.csv",
                  dtype={'device_id': np.str})
phone_brand.drop_duplicates('device_id', keep='first', inplace=True)


print("# Load Train and Test")
train_data = pd.read_csv("../input/gender_age_train.csv",
                    dtype={'device_id': np.str})

test_data = pd.read_csv("../input/gender_age_test.csv",
                   dtype={'device_id': np.str})


full_data = pd.concat((train_data, test_data), axis=0, ignore_index=True)
train_size = len(train_data)
full_data = pd.merge(full_data, phone_brand, how='left',
                on='device_id', left_index=True)

print ("Data Loaded.")

# Load Phone Brand


IOError: File ../input/phone_brand_device_model.csv does not exist

#### Group features

As a generic practice we will be grouping categorical features, numeric features, target feature and ID column.

In [None]:
#Group columns - categorical, numerical, target and id
data_types = full_data.dtypes  

#ID
id_col = 'device_id'
#Target
target_col = 'group'

#Categorical columns:
cat_cols = list(data_types[data_types=='object'].index)
cat_cols.remove('group')
cat_cols.remove('gender')
cat_cols.remove('device_id')

#Numeric columns:
num_cols = list(data_types[data_types=='int64'].index) + list(data_types[data_types=='float64'].index)
num_cols.remove('age')


print ("ID column:", id_col)
print ("Target column:",target_col)
print ("Categorical column:",cat_cols)
print ("Numeric column:",num_cols)

#### Encode target

1. Many models require Y to be numberic so we need to encode target
2. The attribute classeses_ of LabelEncoder can be used for output

In [None]:
#Label target    full_data
LBL = preprocessing.LabelEncoder()
Y = LBL.fit_transform(full_data[target_col][:train_size])
target_names=LBL.classes_
print ("target group names:", target_names)

#### Data preprocessing - categorical features

We will preprocess categorical features with two encodings: one-hot-encoding (get_dummies) and label encoding.

In [None]:
# One-hot-encoding
full_ohe=pd.get_dummies(full_data[cat_cols],sparse=True)
full_ohe=sparse.csr_matrix(full_ohe)

# Lable encoding
full_le =pd.DataFrame()
for col in cat_cols:
    full_le[col]=LBL.fit_transform(full_data[col])


#### Cross Validation

* Here we will be using two models to demostrate how cross validation works: Random Forest and Naive Bayers
* The reason I wanted to demostrate Naive Bayers is because of the nature of the dataset we are using - there are only two categorical features which is a good candidate for Bayers
* We will also be exploring using different datasets for the same model
    * For Naive Bayers we will try using both brands and models, brands only and models only
    * For RF we will tray using label encoded data and OHE data.

**RF with LE **

In [None]:
param_grid = {
              }
model = search_model(full_le[:train_size]
                                         , Y
                                         , RandomForestClassifier(n_estimators=100)
                                         , param_grid
                                         , n_jobs=1
                                         , cv=4
                                         , refit=False)   

print ("best params:", model.best_params_)

**RF with OHE **

In [None]:
param_grid = {
              }
model = search_model(full_ohe[:train_size]
                                         , Y
                                         , RandomForestClassifier(n_estimators=100)
                                         , param_grid
                                         , n_jobs=1
                                         , cv=4
                                         , refit=False)   

print ("best params:", model.best_params_)

**NB with brand+model **

In [None]:
param_grid = {
              }
model = search_model(full_le.values[:train_size]
                                         , Y
                                         , GaussianNB()
                                         , param_grid
                                         , n_jobs=1
                                         , cv=4
                                         , refit=False)   

print ("best params:", model.best_params_)

**NB with brand only **

In [None]:
param_grid = {
              }
model = search_model(full_le[['phone_brand']][:train_size].values
                                         , Y
                                         , GaussianNB()
                                         , param_grid
                                         , n_jobs=1
                                         , cv=4
                                         , refit=False)   

print ("best params:", model.best_params_)

**NB with model only**

In [None]:
param_grid = {
              }
model = search_model(full_le[['device_model']][:train_size].values
                                         , Y
                                         , GaussianNB()
                                         , param_grid
                                         , n_jobs=1
                                         , cv=4
                                         , refit=False)   

print ("best params:", model.best_params_)

#### Holdout validation and simple ensemble

* For holdout we will be using the first 60000 rows of train data for training and the rest for validation.
     * We can also use sklearn.cross_validation.train_test_split to create train/validation sets. Please note that in order to have reproducible results you'll need to specify the random_state 
* For ensemble we will do it in two ways:
    * Train two Random Forest models with label-encoded data and one-hot-encoded data, evaluate the two models seperately, then average the predictions of two models and evaluate the averaged predictions
    * We will also train two NaiveBayers models but this time we will be using different features, one with phone brand and another with device model then average and evaulate the predictions.

** Create models **

In [None]:
clf_rf=RandomForestClassifier(n_estimators=100)
clf_gnb=GaussianNB()

** Create training and validation datasets **

In [None]:
X_train_le =full_le.values[:train_size][:60000]
X_val_le =full_le.values[:train_size][60000:]

X_train_ohe =full_ohe[:train_size][:60000]
X_val_ohe =full_ohe[:train_size][60000:]

X_train_brand =full_le[['phone_brand']][:train_size][:60000].values
X_val_brand =full_le[['phone_brand']][:train_size][60000:].values

X_train_model =full_le[['device_model']][:train_size][:60000].values
X_val_model =full_le[['device_model']][:train_size][60000:].values

y_train = Y[:60000]
y_val = Y[60000:]

** Random Forest with Label Encoding **

In [None]:
clf_rf.fit(X_train_le,y_train)
y_pred_le = clf_rf.predict_proba(X_val_le)
print ("Score for Random Forest using Label Encoding is %f" % (metrics.log_loss(y_val,y_pred_le)))

** Random Forest with One-Hot-Encoding **

In [None]:
clf_rf.fit(X_train_ohe,y_train)
y_pred_ohe = clf_rf.predict_proba(X_val_ohe)
print ("Score for Random Forest using One-Hot-Encoding is %f" % (metrics.log_loss(y_val,y_pred_ohe)))

** Simple ensemble by averaging predictions **

In [None]:
print ("Score for simple ensemble is %f" % (metrics.log_loss(y_val,(y_pred_ohe+y_pred_le)/2)))

** Naive Bayers with phone brand **

In [None]:
clf_gnb.fit(X_train_brand,y_train)
y_pred_brand = clf_gnb.predict_proba(X_val_brand)
print ("Score for Random Forest using Label Encoding is %f" % (metrics.log_loss(y_val,y_pred_brand)))

** Naive Bayers with device model**

In [None]:
clf_gnb.fit(X_train_model,y_train)
y_pred_model = clf_gnb.predict_proba(X_val_model)
print ("Score for Random Forest using Label Encoding is %f" % (metrics.log_loss(y_val,y_pred_model)))

** Ensemble NB models **

In [None]:
print ("Score for simple ensemble is %f" % (metrics.log_loss(y_val,(y_pred_brand+y_pred_model)/2)))

** Put all together **

In [None]:
print ("Score for simple ensemble is %f" % (metrics.log_loss(y_val,(y_pred_ohe+y_pred_le+y_pred_brand+y_pred_model)/4)))

#### Homework

* Keep exploring different models/ parameters/ approaches on only brand and models and see how best you can do for that
* You may also want to try manually labeling phone brand/model with additional information, .e.g. publish year, price etc. and see if they would help.
* Kindly send your findings/ feedbacks/ questions to TAs before next Thursday