# Outline
1. [**Data Investigation**](#Data-Investigation**)
    1. [Count the Missing Values](#Count-the-Missing-Values)
    1. [Datatypes wise Columns Counts](#Datatypes-wise-Columns-Counts)
    1. [Investigate the target variable distribution](#Investigate-the-target-variable-distribution)
    1. [Investigate Statistics of Data](Investigate-Statistics-of-Data)
    1. [Correlation of dataset](#Correlation-of-dataset)
2. [**Remove Varible That Contrain Same Class**](#Remove-Varible-That-Contrain-Same-Class)
3. [**Categorical to Numeric Variable Conversion for Model Training(Label Encoding)**](#Categorical-to-Numeric-Variable-Conversion-for-Model-Training(Label-Encoding)
4. [**Bayesian Optimization for Best Parameter Tuning**](#Bayesian-Optimization-for-Best-Parameter-Tuning)
5. [**Parameter as per Understanding of Model**](#Parameter-as-per-Understanding-of-Model)
6. [**Model Training With KFold Cross Validation**](#Model-Training-With-KFold-Cross-Validation)
7. [**Best CV Score Return By Model**](#Best-CV-Score-Return-By-Model)
8. [**Features Importance**](#Features-Importance)
9. [**Final Submission**](#Final-Submission)

In [None]:

import pandas as pd
import numpy as np
import time
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor


import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
sns.set(style="darkgrid")
plt.style.use("fivethirtyeight")




In [None]:
def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

print(os.listdir("../input"))

In [None]:
%%time
train_df = load_df(nrows=600000)
train_corr = train_df.copy()
test_df = load_df("../input/test.csv", nrows=1000000)
# train_df.to_csv("500000.csv", index=False)

In [None]:
#

# 1. Data Investigation

## 1.1 Count Missing Values
* We can see that so missing Values are to high.
* Missing data are a **common occurrence and can have a significant effect** on the conclusions that can be drawn from the** data.**

In [None]:
train_df.isnull().sum().plot(kind="bar", figsize = (20,8))
plt.xlabel("Count")
plt.ylabel("Column Name")
plt.title("Missing Value Count By Column for 500000 Rows")

## 1.2 Count the Datatypes Columnwise

* We can see that most of columns datatype is **object**.
* The type of each attribute is important. Strings may need to be converted to ﬂoating point values or integers to represent categorical or ordinal values. 
* we can get an idea of the types of attributes by peeking at the raw data. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.


In [None]:
plt.figure(figsize=(20,8))
ax = sns.countplot(x=train_df.dtypes, data=train_df)
plt.xlabel("Data Types")
plt.ylabel("Counts")
plt.title("Column Count By Datatypes")

## 1.3 Investigate the Target Columns

In [None]:
target = train_df['totals.transactionRevenue'].fillna(0).astype(float)

target = target.apply(lambda x: np.log(x) if x > 0 else x)


del train_df['totals.transactionRevenue']
# plt.figure(figsize = (15,10))
sns.jointplot(target.values,target.index,kind="regg")
plt.show()

## 1.4 Investigate the Statistics of Data

Descriptive statistics can give you great insight into the shape of each attribute. Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. They are:
* Count.
* Mean.
* Standard Deviation.
*  Minimum Value.
* 25th Percentile.
* 50th Percentile (Median).
* 75th Percentile.
* Maximum Value.


### Categorical Variable Columns Statistics

In [None]:
train_df.select_dtypes('object').describe()

### Numeric Statistics

In [None]:
train_df.select_dtypes(exclude=('object')).describe().boxplot(figsize=(20,8))
train_df.select_dtypes(exclude=('object')).describe()

# 1.5 Correlation of dataset

* only for numeric variable

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(train_corr.corr(method="kendall"), annot=True)
train_corr.corr(method='kendall').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)

# 2. Remove Variable That Contain Same Class

In [None]:
columns = [col for col in train_df.columns if train_df[col].nunique() > 1]
train_df = train_df[columns]
test_df = test_df[columns]

# 3. Categorical to Numeric Variable Conversion for Model Training(Label Encoding)

In [None]:
trn_len = train_df.shape[0]
merged_df = pd.concat([train_df, test_df])

for col in merged_df.columns:
    if col in ['fullVisitorId']: continue
    if merged_df[col].dtypes == object or merged_df[col].dtypes == bool:
        merged_df[col], indexer = pd.factorize(merged_df[col])

train_df = merged_df[:trn_len]
test_df = merged_df[trn_len:]

In [None]:
train_df['totals.transactionRevenue']=target
train_df.head(5)

In [None]:
test_df.head()

In [None]:
#train_df["fullVisitorId"] = train_df.fullVisitorId.astype(float)
#test_df["fullVisitorId"] = test_df["fullVisitorId"].astype(float)

In [None]:
import xgboost as xgb
from lightgbm import LGBMRegressor
import lightgbm as lgb
from bayes_opt import BayesianOptimization
from sklearn.metrics import mean_squared_error


In [None]:
train_df.dtypes

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_df.drop(['totals.transactionRevenue','fullVisitorId'], axis=1),
                                                    train_df['totals.transactionRevenue'], test_size=0.25)
#del(df)
dtrain = xgb.DMatrix(X_train, label=y_train)
#del(X_train)
dtest = xgb.DMatrix(X_test)
#del(X_test)

In [None]:
# def xgb_evaluate(max_depth, gamma,min_child_weight,max_delta_step,subsample,colsample_bytree):
#     params = {'eval_metric': 'rmse',
#               'max_depth': int(max_depth),
#               'subsample': subsample,
#               'eta': 0.1,
#               'gamma': gamma,
#               'colsample_bytree': colsample_bytree,   
#               'min_child_weight': min_child_weight ,
#               'max_delta_step':max_delta_step
#              }
#     # Used around 1000 boosting rounds in the full model
#     cv_result = xgb.cv(params, dtrain, num_boost_round=100, nfold=3)    
    
#     # Bayesian optimization only knows how to maximize, not minimize, so return the negative RMSE
#     return -1.0 * cv_result['test-rmse-mean'].iloc[-1]

In [None]:
# xgb_bo = BayesianOptimization(xgb_evaluate, {
#                                     'max_depth': (2, 12),
#                                      'gamma': (0.001, 10.0),
#                                      'min_child_weight': (0, 20),
#                                      'max_delta_step': (0, 10),
#                                      'subsample': (0.4, 1.0),
#                                      'colsample_bytree' :(0.4, 1.0)})
# # Use the expected improvement acquisition function to handle negative numbers
# # Optimally needs quite a few more initiation points and number of iterations
# xgb_bo.maximize(init_points=3, n_iter=5, acq='ei')

In [None]:
#params = xgb_bo.res['max']['max_params']
#print(params)
params = {'max_depth': 6.714941854933043, 'gamma': 1.3250360141843498, 'min_child_weight': 13.0958516960316, 'max_delta_step': 8.88492863796954, 'subsample': 0.9864199446951019, 'colsample_bytree': 0.8376539278239742}
#params = {'max_depth': 12.0, 'gamma': 0.001, 'min_child_weight': 8.740952582296343, 'max_delta_step': 10.0, 'subsample': 0.4, 'colsample_bytree': 1.0}
params['max_depth'] = int(params['max_depth'])

In [None]:
# Train a new model with the best parameters from the search
model2 = xgb.train(params, dtrain, num_boost_round=250)

# Predict on testing and training set
y_pred = model2.predict(dtest)
y_train_pred = model2.predict(dtrain)

# Report testing and training RMSE
print(np.sqrt(mean_squared_error(y_test, y_pred)))
print(np.sqrt(mean_squared_error(y_train, y_train_pred)))

In [None]:
submission = test_df[['fullVisitorId']].copy()
#test = transform((test_df.drop('fullVisitorId', axis=1)))
dtest = xgb.DMatrix(test_df.drop('fullVisitorId', axis=1))
predictions = model2.predict(dtest)

# 9.Final Submission

In [None]:

submission.loc[:, 'PredictedLogRevenue'] = predictions
submission["PredictedLogRevenue"] = submission["PredictedLogRevenue"].apply(lambda x : 0.0 if x < 0 else x)
submission["PredictedLogRevenue"] = submission["PredictedLogRevenue"].fillna(0.0)
grouped_test = submission[['fullVisitorId', 'PredictedLogRevenue']].groupby('fullVisitorId').sum().reset_index()
grouped_test.to_csv('submit.csv',index=False)

In [None]:
grouped_test.head(5)