#### Week 2 - Feature Engineering

This week we will be engineering two types of data.

Firstly, we will use "Leave-one-out" encoding, an approach heavily used by Owen Zhang, former #1 Kaggler, to engineer categorical features. The idea is to count the mean of the target, age and gender for this case, grouping by a given categorical feature, then use the means to replace the original value. Please be advised that this approach can easily lead to overfitting so we'll have to introduce noises to avoid overfitting.

Next, we will be working with transactional data. In this competition, there are multiple levels of transactional and hierarchical data. Essentially we will aggregate lower level records to the higher level by concatenating the values of each record into a big text feature. Then count the frequency of each value and convert the results into a sparse matrix. In this way, we can keep as much information as possible without sacrificing the performance thanks to the sparse matrix.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
import time
import random
from sklearn import preprocessing, pipeline, metrics, grid_search, cross_validation
import xgboost as xgb
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


Here we create a new function for Leave-one-out encoding

In [None]:
def search_model(train_x, train_y, est, param_grid, n_jobs, cv, refit=False):
##Grid Search for the best model
    model = grid_search.GridSearchCV(estimator  = est,
                                     param_grid = param_grid,
                                     scoring    = 'log_loss',
                                     verbose    = 10,
                                     n_jobs  = n_jobs,
                                     iid        = True,
                                     refit    = refit,
                                     cv      = cv)
    # Fit Grid Search Model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    print("Scores:", model.grid_scores_)
    return model

def loo_encode(data,cat_col,target_col,train_size,random_rate=0.05):
    print ("Leave-one-out encoding %s on %s" % (cat_col,target_col))
    aggr=data[:train_size].groupby(cat_col)[target_col].agg([np.mean,np.size,np.sum]).reset_index()
    data=pd.merge(data,aggr, how='left',on=cat_col)
    data['loo']=data['mean']
    data['loo'][:train_size]=data[:train_size].apply(lambda row: 0 if row['size']<=1
                                                     else (row['sum']-row[target_col])/(row['size']-1)*random.uniform(1-random_rate, 1+random_rate) ,
                                                     axis=1).values
 
    return data['loo'].fillna(0).values

In [None]:
# Load Data

print("# Load Phone Brand")
phone_brand = pd.read_csv("../input/phone_brand_device_model.csv",
                  dtype={'device_id': np.str})
phone_brand.drop_duplicates('device_id', keep='first', inplace=True)


print("# Load Train and Test")
train_data = pd.read_csv("../input/gender_age_train.csv",
                    dtype={'device_id': np.str})

test_data = pd.read_csv("../input/gender_age_test.csv",
                   dtype={'device_id': np.str})


full_data = pd.concat((train_data, test_data), axis=0, ignore_index=True)
train_size = len(train_data)
full_data = pd.merge(full_data, phone_brand, how='left',
                on='device_id', left_index=True)

print ("Data Loaded.")

In [None]:
#Group columns - categorical, numerical, target and id
data_types = full_data.dtypes  

#ID
id_col = 'device_id'
#Target
target_col = 'group'

#Categorical columns:
cat_cols = list(data_types[data_types=='object'].index)
cat_cols.remove('group')
cat_cols.remove('gender')
cat_cols.remove('device_id')

#Numeric columns:
num_cols = list(data_types[data_types=='int64'].index) + list(data_types[data_types=='float64'].index)
num_cols.remove('age')


print ("ID column:", id_col)
print ("Target column:",target_col)
print ("Categorical column:",cat_cols)
print ("Numeric column:",num_cols)

In [None]:
#Label target
LBL = preprocessing.LabelEncoder()

Y=LBL.fit_transform(full_data[target_col][:train_size])
    
target_names=LBL.classes_
print ("target group names:", target_names)

full_data['gender']=full_data['gender'].apply(lambda x:1 if x=='F' else 0)

device_id = full_data[train_size:]["device_id"].values


** Concatenate brand and model **

In [None]:
full_data['brand_model']=full_data['phone_brand']+full_data['device_model']
cat_cols.append('brand_model')


** Leave-one-out encode categorical features**

In [None]:
loo_cols = []
for c in cat_cols:
    for t in ['age','gender']:
        loo_col=c+'_'+t+'_loo'
        full_data[loo_col]=loo_encode(full_data[[c,t]],c,t,train_size,random_rate=0.05)
        loo_cols.append(loo_col)

** Use XGBclassifier as baseline **

In [None]:
X_train, X_val, y_train, y_val = train_test_split(full_data[loo_cols].values[:train_size], Y
                                                  , train_size=.80, random_state=1234)

clf = xgb.XGBClassifier()
clf.fit(X_train,y_train)
pred_val=clf.predict_proba(X_val)
print ("mlogloss: %f" % (metrics.log_loss(y_val, pred_val)))

** Aggregate transactional data onto higher granularity **

We will load events, app_events, app_labels seperately, then aggregate them by device.

In [None]:
start = time.time()
app_ev = pd.read_csv("../input/app_events.csv", dtype={'device_id': np.str})
print ("App Events loaded in %f seconds" %(time.time() - start))

start = time.time()
events = pd.read_csv("../input/events.csv", dtype={'device_id': np.str})
print ("Events loaded in %f seconds" %(time.time() - start))

start = time.time()
app_lab = pd.read_csv("../input/app_labels.csv", dtype={'device_id': np.str})
print ("App Labels loaded in %f seconds" %(time.time() - start))

start = time.time()
lab_cat = pd.read_csv("../input/label_categories.csv", dtype={'device_id': np.str})
print ("Label Categories loaded in %f seconds" %(time.time() - start))


** Concatenate applications, labels and label categories to a big text column for each device**

In [None]:
device_app = pd.merge(events[['device_id','event_id']]
                      , app_ev[['event_id','app_id']], on='event_id')[['device_id','app_id']].drop_duplicates()
device_label = pd.merge(device_app
                        , app_lab, on='app_id')[['device_id','label_id']].drop_duplicates()
device_category= pd.merge(device_label
                          , lab_cat, on='label_id')[['device_id','category']].drop_duplicates()
print ("device apps labels and categories aggregated in %f seconds" %(time.time() - start))


** Group categoris/labels/apps by device id and merge them into one big list **

In [None]:
device_category = device_category.groupby("device_id")["category"].apply(list)
device_label = device_label.groupby("device_id")["label_id"].apply(list)
device_app = device_app.groupby("device_id")["app_id"].apply(list)
del app_ev,events, lab_cat, app_lab
print device_category.shape, device_label.shape, device_app.shape


In [None]:
full_data["category"] = full_data["device_id"].map(device_category).apply(
    lambda x:' '.join(c for c in x) if x==x else '') 
full_data["label"] = full_data["device_id"].map(device_label).apply(
    lambda x:' '.join(str(c) for c in x) if x==x else '') 
full_data["app"] = full_data["device_id"].map(device_app).apply(lambda x:' '.join(str(c) for c in x) if x==x else '') 

full_data['device_model'] = full_data['device_model'].apply(lambda x:x.replace(' ','')) 
full_data['category'] = full_data['category'].apply(lambda x:x.replace(' ','')) 

** count frequecies of each key word (brand, model and app id), then convert the results to a sparse matrix**

In [None]:
counter = CountVectorizer(min_df=1)
matrix = full_data[["phone_brand", "device_model", "app"]].astype(np.str).apply(
    lambda x: " ".join(s for s in x), axis=1)
matrix = counter.fit_transform(matrix)

** XGB baseline - brand, model and application**

In [None]:
X_train, X_val, y_train, y_val = train_test_split(matrix[:train_size], Y, train_size=.80, random_state=1234)

clf = xgb.XGBClassifier()
clf.fit(X_train,y_train)
pred_val=clf.predict_proba(X_val)
print ("mlogloss: %f" % (metrics.log_loss(y_val, pred_val)))

Assuming everything went smoothly you should see a significant improvement on mlogss from 2.39~ to 2.33~.

Now we will use a trick called early stopping to find out the optimal number of iterations for XGB.

In [None]:
clf=xgb.XGBClassifier( n_estimators = 1000, learning_rate=0.3)
clf.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric='mlogloss',
            early_stopping_rounds=20)



Use the best iteration gained from previous step to re-train the model

In [None]:
best_iteration = clf.best_iteration_
best_score=clf.best_score_

print (best_iteration, best_score)

Create submission.

In [None]:
clf=xgb.XGBClassifier( n_estimators = best_iteration, learning_rate=0.3)
clf.fit(matrix[:train_size], Y)
pred=clf.predict_proba(matrix[train_size:])

result = pd.DataFrame(pred, columns=target_names)
result["device_id"] = device_id
result = result.set_index("device_id")
result.to_csv('brand_model_app_xgb.csv', index=True, index_label='device_id')

#### Homework

1. During the aggregation we dropped the duplicates of apps, labels and categories. What if we keep the duplicates? Would that help? Could there be any other aggregation strategies?

2. We only used apps in this notebook. Can you also try to labels, categories and different combinations?

3. Can we also try the combination of transactinal data and LOO features?

4. Is there any other feature engineerings that we can do?