**Kernel Approach**

This kernel is a quick ML solution built based on XGBoost, and a Feature Importance Analisys using [Permutation Importance](https://www.kaggle.com/dansbecker/permutation-importance?utm_medium=email&utm_source=mailchimp&utm_campaign=ml4insights). Further feature analisys were done in my previous kernel [GA Challenge - DAta Analisys](https://www.kaggle.com/wesleyjr01/google-analytics-challenge-data-analisys).

**Reshaping the given dataset**

As mentioned in the challenge description, some fields of the datasets train.csv and test.csv are in json format, and for better data manipulation with DataFrames, we should make some type conversions first. For this task, we already have a [pretty nice kernel](https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields/notebook) built by [Julián Peller](https://www.kaggle.com/julian3833).

In [None]:
import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

df_train = load_df()
df_test = load_df("../input/test.csv")

In [None]:
#Lets have a look at the data
df_train.head(3)

**Datasets Analisys**

In [None]:
df_train.columns
print('Is there more than one transaction by VisitorId in train dataset?',
      len(df_train['fullVisitorId'])!=df_train['fullVisitorId'].nunique())
print('Is there more than one transaction by VisitorId in test dataset?',
      len(df_test['fullVisitorId'])!=df_test['fullVisitorId'].nunique())
#Confirming that we have more rows than unique Visitors Id's in both train and test datasets.

**Dropping columns with constant values**

Some of the features imported in the json-csv process have only one unique value, which doesn't give us any information, and it would be a problem for any ML model. So we should just drop them.

In [None]:
dropcols = [c for c in df_train.columns if df_train[c].nunique()==1]
dropcols_test = [c for c in df_test.columns if df_test[c].nunique()==1]
df_train = df_train.drop(dropcols,axis=1)
df_test = df_test.drop(dropcols_test,axis=1)

## Building Date Features

In [None]:
#Date Information: Lets split the information in Years/Months/Days and split the analisys.
df_train['year']= df_train['date'].astype(str).str[:4]
df_test['year']= df_test['date'].astype(str).str[:4]
df_train['month']= df_train['date'].astype(str).str[4:6]
df_test['month']= df_test['date'].astype(str).str[4:6]
df_train['day']= df_train['date'].astype(str).str[6:8]
df_test['day']= df_test['date'].astype(str).str[6:8]
df_train.drop('date',axis=1,inplace=True)
df_test.drop('date',axis=1,inplace=True)

In [None]:
# As stated in previous kernel analisys, this information does very little for us, lets drop it.
df_train.drop('year',axis=1,inplace=True)
df_test.drop('year',axis=1,inplace=True)

## Verify if features in both train and test datasets are equal

In [None]:
#Missing Data on Train Dataset
print('Any different features between Train and Test datasets?',
      False in df_train.drop('totals.transactionRevenue',axis=1).columns == df_test.columns)
#So, apart from the target values, there are no features difference between these datasets.
#The 'sessionId' and 'visitId' features are not going to be usefull for us, so we can drop it.
df_train.drop(['sessionId','visitId'],axis=1,inplace=True)
df_test.drop(['sessionId','visitId'],axis=1,inplace=True)

## Store and Drops Id's from datasets, and Target from Train Dataset

In [None]:
df_train["totals.transactionRevenue"].fillna(0, inplace=True)# Impute 0 for missing target values
y_train = df_train["totals.transactionRevenue"]
#df_train.drop("totals.transactionRevenue",axis=1,inplace=True)
train_id = df_train["fullVisitorId"]
pred_id = df_test["fullVisitorId"]

## Missing Data per features type

In [None]:
df_merge = pd.concat([df_train.drop("totals.transactionRevenue",axis=1),df_test],axis=0)

df_merge['day'] = df_merge['day'].astype(float)
df_merge['month'] = df_merge['month'].astype(float)


df_merge['trafficSource.adwordsClickInfo.page'] = df_merge['trafficSource.adwordsClickInfo.page'].astype(str)
df_merge['trafficSource.adwordsClickInfo.page'].fillna('None',inplace=True)

df_merge['totals.pageviews'] = df_merge['totals.pageviews'].astype(str)
df_merge['totals.pageviews'].fillna('None',inplace=True)

mergeId = df_merge['fullVisitorId']
df_merge.drop('fullVisitorId',axis=1,inplace=True)

qualitative_features = [f for f in df_merge.dropna().columns 
                        if df_merge.dropna().dtypes[f] == 'object' or 'bool'] #Lista de Features Qualitativas.
quantitative_features = [f for f in df_merge.dropna().columns 
                         if df_merge.dropna().dtypes[f] != 'object' or 'bool'] #Lista de Features Qualitativas.


def missingData(df,features):
    total = df[features].isnull().sum().sort_values(ascending=False)
    percent = (df[features].isnull().sum()/df[features].isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data

In [None]:
#Missing data qualitative feature
missing_data_quali = missingData(df_merge,qualitative_features)
missing_data_quali.head(20)

In [None]:
#Missing data quantitative feature
missing_data_quanti = missingData(df_merge,quantitative_features)
missing_data_quanti.head(20)

## Imputation with 'None' and LabelEncoder

In [None]:
for i in qualitative_features:
    if df_merge[i].isnull().any():
        df_merge[i].fillna('None',inplace=True)
print('\nIs there any NaN value  in the dataset after Imputing?:',df_merge.isnull().sum().any())

# Encoding the variable
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
d = defaultdict(LabelEncoder)
# Encoding the variable
fit = df_merge[qualitative_features].apply(lambda x: d[x.name].fit_transform(x))
df_merge[qualitative_features] = fit

#Restore datraframes df_train and df_test
df_train = df_merge[:len(df_train)]
df_train["totals.transactionRevenue"] = y_train.tolist()
df_test = df_merge[len(df_train):]


## Division between X_train,X_val,y_train,y_val 

In [None]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
size_test = 0.3
df_train = shuffle(df_train) #shuffle data before division
train_target = np.log1p(df_train["totals.transactionRevenue"].astype(float).tolist()) # Just for code readibility
predictors = df_train.drop("totals.transactionRevenue", axis=1)
X_train, X_val, y_train, y_val = train_test_split(predictors, 
                                                    train_target,
                                                    train_size=1-size_test, 
                                                    test_size=size_test, 
                                                    random_state=0)
X_pred = df_test

In [None]:
train_target

## XGBoost Model

If you want to start with XGBoost, [this kernel](https://www.kaggle.com/dansbecker/xgboost) written by [DanB](https://www.kaggle.com/dansbecker) might help you.

In [None]:
from xgboost import XGBRegressor
xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.01)

xgb_model.fit(X_train.values, y_train, early_stopping_rounds=5, 
             eval_set=[(X_val.values, y_val)], verbose=False)


# Training the model #
pred_test = xgb_model.predict(X_pred.values)

## Permutation Importante

There is a[ great kernel](https://www.kaggle.com/dansbecker/permutation-importance?utm_medium=email&utm_source=mailchimp&utm_campaign=ml4insights) written by[ DanB](https://www.kaggle.com/dansbecker) to helps us get insighsts about the feature importance in our model, and we will use it here.

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(xgb_model, random_state=1).fit(X_val, y_val)
eli5.show_weights(perm, feature_names = X_val.columns.tolist())

#Looks like totals.pageviews is the most important feature, followed by totals.hits

## Build the solution DataFrame

In [None]:
sub_df = pd.DataFrame({"fullVisitorId":pred_id})
#Round to Zero the negative predictions
pred_test[pred_test<0] = 0
sub_df["PredictedLogRevenue"] = np.expm1(pred_test)
#sub_df["PredictedLogRevenue"] = pred_test
sub_df = sub_df.groupby("fullVisitorId")["PredictedLogRevenue"].sum().reset_index()
sub_df.columns = ["fullVisitorId", "PredictedLogRevenue"]
sub_df["PredictedLogRevenue"] = np.log1p(sub_df["PredictedLogRevenue"])
sub_df.to_csv("xgb_predictions.csv", index=False)
sub_df.head()