## Toppa @ Berkeley 2019 Data Team Homework 3
The dataset used for this homework is taken from Avazu's CTR Prediction Dataset on Kaggle.  
Your task is to fill out all the places marked with #TODO or #YOUR CODE HERE.  
This codebase is taken from @susanli's solution, with small modifications.  

Author: Vincent, @susanli  
Your name:

In [None]:
import numpy as np
import random
import pandas as pd
from matplotlib import pyplot as plt

def parse_date(val): return pd.datetime.strptime(val, '%y%m%d%H')

train = pd.read_csv("modified_data/train_100000", parse_dates=['hour'], date_parser=parse_date)
train.head()

### Step 1: EDA
Please perform some basic exploratory data analysis.  
Specifically, please at least plot **5** different graphs, and __report your findings in a short markdown block__.
Some potential options include:
1. Number of clicks per hour/day
2. Banner position vs. Click rate
3. Site ids vs. Click rate
4. Device id vs. Click rate
5. etc

In [None]:
#YOUR CODE HERE

### Step 2: Feature Engineering
Now that you have a basic understanding of the dataset, we can start doing feature engineering.  
Step 1: Please complete the function convert_obj_to_int(). For this step, you only need to write one line.  
Step 2: Drop 'hour' and 'id' columns as they are not so important.  

In [None]:
def convert_obj_to_int(self):
    '''
    This function takes in a dataframe with mixed dtype, and return a dataframe with only int as its datatype.
    Essentially, this function transforms the columns with object data type into columns with int.
    Your task here is to implement a lambda function that maps an object to an integer.
    HINT: How did you solve the room number puzzle again?
    '''

    object_list_columns = self.columns  # Get a list of columns of the dataframe
    object_list_dtypes = self.dtypes  # Get a list of dtypes
    new_col_suffix = '_int'  # adding a suffix to new columns

    for index in range(0, len(object_list_columns)):
        if object_list_dtypes[index] == object:
            self[object_list_columns[index] +
                 new_col_suffix] = self[object_list_columns[index]].map(#YOUR CODE HERE) 
            self.drop([object_list_columns[index]], inplace=True, axis=1) # Dropping the original object column

    return self


train = convert_obj_to_int(train)

In [None]:
# Step 2: drop unneeded columns
train.drop('hour', axis=1, inplace=True)
train.drop('id', axis=1, inplace=True)

In [None]:
train.head()

At this point, you should pretty much have a dataframe that looks like the one in data.csv. 
If it's not (or you're not so confident about your implementation, feel free to just read data.csv as your engineered data for the training process below).

In [None]:
# Uncomment the following line to use my feature engineered dataset
# train = pd.read_csv("data.csv")
# train.drop('Unnamed: 0', axis=1, inplace=True)
# train.head()

In [None]:
# Selecting useful features
features = ['C1', 'banner_pos', 'device_type', 'device_conn_type', 'C14',
       'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'hour_of_day',
       'site_id_int', 'site_domain_int', 'site_category_int', 'app_id_int',
       'app_domain_int', 'app_category_int', 'device_id_int', 'device_ip_int',
       'device_model_int', 'day_of_week_int']

### Step 3: Model Construction
Now it's time to start building your own model!
Here are some models we are trying out for this hw:
1. LGBM
2. Logistic Regressions

For LGBM, it is a class of gradient boosted ensemble tree models; feel free to read more about it if you have the time.   
For logistic regression, we can just use the implementation in sklearn.   
__Please install LGBM if you haven't already.__  

In [None]:
# Initializing LGB model. No need to modify this code.

import lightgbm as lgb
X_train = train.loc[:, train.columns != 'click']
y_target = train.click.values

# create lightgbm dataset
msk = np.random.rand(len(X_train)) < 0.8
lgb_train = lgb.Dataset(X_train[msk], y_target[msk])
lgb_eval = lgb.Dataset(X_train[~msk], y_target[~msk], reference=lgb_train)

#### Intro to Tuning Hyperparameters
The model implementation of LGBM is already given here.  
All you need to do is to play around with the model hyperparameters. See if you can arrive at an optimal model.  
When you're done please report your findings (e.g. which hyperparameter "matters" the most to model accuracy?)

In [None]:
# specify your configurations as a dict
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': { 'binary_logloss'},
    'num_leaves': 31, # defauly leaves(31) amount for each tree
    'learning_rate': 0.08,
    'feature_fraction': 0.7, # will select 70% features before training each tree
    'bagging_fraction': 0.3, #feature_fraction, but this will random select part of data
    'bagging_freq': 5, #  perform bagging at every 5 iteration
    'verbose': 0
}

print('Start training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=4000,
                valid_sets=lgb_eval,
                early_stopping_rounds=500)

In [None]:
print(gbm.best_score)
print(gbm.best_iteration)

#### (Optional) Run the dataset with Xgboost

XGBOOST is a similar framework, feel free to play around with it and report your findings.

In [None]:
from operator import itemgetter
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import roc_auc_score

def run_default_test(train, test, features, target, random_state=0):
    eta = 0.1
    max_depth = 5
    subsample = 0.8
    colsample_bytree = 0.8
    print('XGBoost params. ETA: {}, MAX_DEPTH: {}, SUBSAMPLE: {}, COLSAMPLE_BY_TREE: {}'.format(
        eta, max_depth, subsample, colsample_bytree))
    params = {
        "objective": "binary:logistic",
        "booster": "gbtree",
        "eval_metric": "logloss",
        "eta": eta,
        "max_depth": max_depth,
        "subsample": subsample,
        "colsample_bytree": colsample_bytree,
        "silent": 1,
        "seed": random_state
    }
    num_boost_round = 260
    early_stopping_rounds = 20
    test_size = 0.2

    X_train, X_valid = train_test_split(
        train, test_size=test_size, random_state=random_state)
    y_train = X_train[target]
    y_valid = X_valid[target]
    dtrain = xgb.DMatrix(X_train[features], y_train)
    dvalid = xgb.DMatrix(X_valid[features], y_valid)
    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    gbm = xgb.train(params, dtrain, num_boost_round, evals=watchlist,
                    early_stopping_rounds=early_stopping_rounds, verbose_eval=True)

In [None]:
run_default_test(train, y_target, features, 'click')

#### Logistic Regression
Now it's time to do logistic regression as we've learned in lecture!
Here are the steps:
1. First you want to do a train test split. Notice that all you currently have is the train dataframe. You might want to split it into X_train and X_valid (for validation/testing).  
2. Since label data is already included in train dataframe, let's assign the 'click' column of X_train, X_valid to y_train, y_valid.   
3. Now since we've already done feature engineering, let's only use our selected features for X_train and X_valid.  
4. Now it's time to initialize a Logistic Regression model from sklearn and call the fit function!
5. Run the last function, model.score to see your model accuracy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Train Test Split
X_train, X_valid =  # TODO

# Create Labels
y_train, y_valid =  # TODO

# Select important features
X_train, X_valid =  # TODO

# Initialize a logistic regression model
model =  # TODO
model.fit()  # TODO

# Getting predictions
y_pred = model.predict(X_valid)

print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(
    model.score(X_valid, y_valid)))

In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_valid, y_pred)
print(confusion_matrix)

#### Compute precision & recall

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_valid, y_pred))

#### ROC Curve

Let's plot a ROC curve to see how our model is actually performing!

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_valid, logreg.predict(X_valid))
fpr, tpr, thresholds = roc_curve(y_valid, logreg.predict_proba(X_valid)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

### Congrats! Now you have what it takes to complete the data team project!