**...On Training and Public Leaderboard**

...kind of a low-hanging fruit, but since I didn't find it anywhere else here we go.

Interesting that the all zeroes benchmark gives such different results on training and public leaderboard data.  I'd like to compare the rate of positive revenue cases and the conditional mean revenue of those, between training and public leaderboard data. On the public leaderboard data these have to be computed indirectly through rmse like so:

- The target is nearly dichotomous having values 0 and values around the conditional mean revenue of clients with positive revenues.
- Say we predict a fixed value x as revenue for all clients, then rmse is equal to y=sqrt(m/n*a^2-2*m/n*)a*x+x^2), where a is the conditional mean revenue and m is the number of clients with positive revenues.
- To compute m and a we need two pairs of values (x,y), for the public leaderboard e.g. we are already given the pair (0 , 1.7804) for the all zeroes benchmark.

I'll check validity of this approach on the training data, and then compute rate and mean implicitly on the public leaderboard data.

In [None]:
# thx for import: kernel by Julián Peller

import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

def load_df(csv_path='../input/train.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df

df_train = load_df()
df_test = load_df("../input/test.csv")

In [None]:
# compute rate and conditional revenue on test data:
df_train['totals.transactionRevenue']=df_train['totals.transactionRevenue'].astype(float)
train=pd.DataFrame()
train['target']=np.log1p(df_train.groupby('fullVisitorId')['totals.transactionRevenue'].sum())
print('Explicit')
print('Cases: '+str((train.target>0).mean())+', Conditional Mean: '+str(train[train.target>0].target.mean()))

In [None]:
# compute rate and conditional mean through rmse:
n=len(train)
y0=np.round(np.sqrt(((train.target-0)**2).mean()),4)
y1=np.round(np.sqrt(((train.target-1)**2).mean()),4)
a=2*y0**2/(y0**2-y1**2+1)
m=n*y0**2/a**2
print('Implicit')
print('Cases: '+str(m/n)+', Conditional Mean: '+str(a))

Appears to be fit for purpose... let's turn to the public leaderboard data.
We already have the all-zeroes benchmark. Let's check the rmse of another constant and compute implicit figures.

In [None]:
# get leaderboard rmse for x=1
# submission=pd.read_csv('../input/sample_submission.csv',index_col=[0])
# submission['PredictedLogRevenue']=1
# submission.to_csv('submission1.csv')
# this actually returns a leaderboard rmse of 1.9529

In [None]:
n=0.3*len(df_test)
y0=1.7804
y1=1.9529
a=2*y0**2/(y0**2-y1**2+1)
m=n*y0**2/a**2
print('Implicit on Public Leaderboard')
print('Cases: '+str(m/n)+', Conditional Mean: '+str(a))

Ok, maybe not that surprising. The reason for the difference in the performance of the all zeroes benchmark is the diminuished revenue rate. The product of both should be close to the best performance by a constant prediction. Maybe one could check the mean prediction and shift accordingly to improve a model... depends on  the type of public/private split of the test data.