5	VALIDATION + metric LGB baseline, new parameters	0.978867	0.481537	0.63109	changed hyperparameters from script 67 kernels

In [None]:
from kaggle.competitions import twosigmanews
# You can only call make_env() once, so don't lose it!
env = twosigmanews.make_env()

In [None]:
(market_train_df, news_train_df) = env.get_training_data()

<h1>Basic data cleaning

In [None]:
import numpy as np
import pandas as pd

In [None]:
def clean_data(market_train_df):
    """clean data procedure
    
    Args:
        market_train_df: pandas.DataFrame
    """
    market_train_df['close_to_open'] =  np.abs(market_train_df['close'] / market_train_df['open'])
    market_train_df['assetName_mean_open'] = market_train_df.groupby('assetName')['open'].transform('mean')
    market_train_df['assetName_mean_close'] = market_train_df.groupby('assetName')['close'].transform('mean')
    

    # if open price is too far from mean open price for this company, replace it. Otherwise replace close price.
    for i, row in market_train_df.loc[market_train_df['close_to_open'] >= 2].iterrows():
        if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
            market_train_df.iloc[i,5] = row['assetName_mean_open']
        else:
            market_train_df.iloc[i,4] = row['assetName_mean_close']

    for i, row in market_train_df.loc[market_train_df['close_to_open'] <= 0.5].iterrows():
        if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
            market_train_df.iloc[i,5] = row['assetName_mean_open']
        else:
            market_train_df.iloc[i,4] = row['assetName_mean_close']

In [None]:
clean_data(market_train_df)

In [None]:
from datetime import datetime, date
market_train_df['time'] = market_train_df['time'].dt.date
market_train_df = market_train_df.loc[market_train_df['time']>=date(2010, 1, 1)]

In [None]:
from matplotlib import pyplot as plt
import numpy as np
bottom, top = market_train_df.returnsOpenNextMktres10.quantile(0.001), market_train_df.returnsOpenNextMktres10.quantile(0.999)
returns, binwidth = market_train_df.returnsOpenNextMktres10.clip(bottom, top), 0.005
market_train_df.returnsOpenNextMktres10 = market_train_df.returnsOpenNextMktres10.clip(bottom, top)
plt.figure(figsize=(15,10))
plt.hist(returns,  bins=np.arange(min(returns), max(returns) + binwidth, binwidth))

<h1>FEATURE ENGINEERING

In [None]:
from multiprocessing import Pool
return_features = ['returnsClosePrevMktres10','returnsClosePrevRaw10','open','close']

def create_lag(df_code,n_lag=[3,7,14,],shift_size=1):
    code = df_code['assetCode'].unique()
    for col in return_features:
        for window in n_lag:
            rolled = df_code[col].shift(shift_size).rolling(window=window)
            lag_mean = rolled.mean()
            lag_max = rolled.max()
            lag_min = rolled.min()
            lag_std = rolled.std()
            df_code['%s_lag_%s_mean'%(col,window)] = lag_mean
            df_code['%s_lag_%s_max'%(col,window)] = lag_max
            df_code['%s_lag_%s_min'%(col,window)] = lag_min
#             df_code['%s_lag_%s_std'%(col,window)] = lag_std
    return df_code.fillna(-1)

def generate_lag_features(df,n_lag = [3,7,14]):
    features = ['time', 'assetCode', 'assetName', 'volume', 'close', 'open',
       'returnsClosePrevRaw1', 'returnsOpenPrevRaw1',
       'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
       'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',
       'returnsClosePrevMktres10', 'returnsOpenPrevMktres10',
       'returnsOpenNextMktres10', 'universe']
    
    assetCodes = df['assetCode'].unique()
    print(assetCodes)
    all_df = []
    df_codes = df.groupby('assetCode')
    df_codes = [df_code[1][['time','assetCode']+return_features] for df_code in df_codes]
    print('total %s df'%len(df_codes))
    
    pool = Pool(4)
    all_df = pool.map(create_lag, df_codes)
    
    new_df = pd.concat(all_df)  
    
    new_df.drop(return_features,axis=1,inplace=True)
    pool.close()
    
    return new_df

In [None]:
n_lag = [3,7,14] #leave it global
def feature_engineering(market_train_df):
    """feature engineering procedure
    Args:
        market_train_df: Pandas.DataFrame
    Return:
        market_train_df
        
    Usage:
        >>> market_train_df = feature_engineering(market_train_df)
    """
    new_df = generate_lag_features(market_train_df,n_lag)
    market_train_df = pd.merge(market_train_df,new_df,how='left',on=['time','assetCode'])
    return market_train_df

In [None]:
market_train_df = feature_engineering(market_train_df)

we just created rolling averages, min (supports) and max (resistance) for ['returnsClosePrevMktres10','returnsClosePrevRaw10','open','close']. Let's visualize

In [None]:
market_train_df.head()

In [None]:
plt.figure(figsize=(15,10))
plt.title("price vs lag_14_min")
plt.plot(market_train_df[market_train_df['assetCode'] == 'AAPL.O'][-300:]['open'])
plt.plot(market_train_df[market_train_df['assetCode'] == 'AAPL.O'][-300:]['open_lag_14_min'])

In [None]:
plt.figure(figsize=(15,10))
plt.title("price vs lag_14_mean")
plt.plot(market_train_df[market_train_df['assetCode'] == 'AAPL.O'][-300:]['open'])
plt.plot(market_train_df[market_train_df['assetCode'] == 'AAPL.O'][-300:]['open_lag_14_mean'])

**IDEAS:**
* binary feature: did price cross mean
* not only min, but try to get all support resistance levels (top 3 mins) especially in a long time

<h1>Test-Validation Split

In [None]:
X, Y = market_train_df.iloc[:, (market_train_df.columns != 'assetCode') & (market_train_df.columns != 'assetName') &(market_train_df.columns != 'time') & (market_train_df.columns != 'returnsOpenNextMktres10')], market_train_df['returnsOpenNextMktres10']

In [None]:
split = int(len(X) * 0.8)
test_train_distsance = 20000
X_train, X_val = X[:split - test_train_distsance], X[split:]
Y_train, Y_val = Y[:split - test_train_distsance], Y[split:]

In [None]:
print(len(X_val), len(Y_val))
universe_filter = market_train_df['universe'][split:] == 1.0
X_val = X_val[universe_filter]
Y_val = Y_val[universe_filter]
print(len(X_val), len(Y_val))

In [None]:
# this is a time_val series used to calc the sigma_score later, applied split and universe filter
time_val = market_train_df['time'][split:][universe_filter]
assert len(time_val) == len(X_val)
time_train = market_train_df['time'][:split - test_train_distsance]
assert len(time_train) == len(X_train)

<h1>Metric Definition

In [None]:
def sigma_score(preds, valid_data):
    df_time = valid_data.params['extra_time'] # will be injected afterwards
    labels = valid_data.get_label()
    
#    assert len(labels) == len(df_time)

    x_t = preds * labels #  * df_valid['universe'] -> Here we take out the 'universe' term because we already keep only those equals to 1.
    
    # Here we take advantage of the fact that `labels` (used to calculate `x_t`)
    # is a pd.Series and call `group_by`
    x_t_sum = x_t.groupby(df_time).sum()
    score = x_t_sum.mean() / x_t_sum.std()

    return 'sigma_score', score, True

<h1>Fit basic LightGBM

In [None]:
import lightgbm as lgb

In [None]:
train_cols = X.columns.tolist()

lgb_train = lgb.Dataset(X_train.values, Y_train, feature_name=train_cols, free_raw_data=False)
lgb_val = lgb.Dataset(X_val.values, Y_val, feature_name=train_cols, free_raw_data=False)

lgb_train.params = {
    'extra_time' : time_train.factorize()[0]
}
lgb_val.params = {
    'extra_time' : time_val.factorize()[0]
}


In [None]:
#this get score 0.629 but volume and other feats goes to 0.0
lgb_params_old = dict(
    objective = 'regression_l1',
    learning_rate = 0.1,
    num_leaves = 127,
    max_depth = -1,
#     min_data_in_leaf = 1000,
#     min_sum_hessian_in_leaf = 10,
    bagging_fraction = 0.75,
    bagging_freq = 2,
    feature_fraction = 0.5,
    lambda_l1 = 0.0,
    lambda_l2 = 1.0,
    metric = 'None', # This will ignore the loss objetive and use sigma_score instead,
    seed = 42 # Change for better luck! :)
)

x_1 = [0.19000424246380565, 2452, 212, 328, 202]
#this is from eda script 67
lgb_params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'regression_l1',
#         'objective': 'regression',
        'learning_rate': x_1[0],
        'num_leaves': x_1[1],
        'min_data_in_leaf': x_1[2],
#         'num_iteration': x_1[3],
        'num_iteration': 239,
        'max_bin': x_1[4],
        'verbose': 1,
        'lambda_l1': 0.0,
        'lambda_l2' : 1.0,
        'metric':'None'
}

In [None]:
training_results = {}
model = lgb.train(lgb_params, lgb_train, num_boost_round=1000, valid_sets=(lgb_val,lgb_train), valid_names=('valid','train'), verbose_eval=25,
              early_stopping_rounds=10, feval=sigma_score, evals_result=training_results)

In [None]:
plt.figure(figsize=(15,10))
plt.plot(training_results['valid']['sigma_score'])
plt.plot(training_results['train']['sigma_score'])

In [None]:
x=lgb.plot_importance(model)
x.figure.set_size_inches(10, 30) 

In [None]:
x=lgb.plot_importance(model, importance_type='gain')
x.figure.set_size_inches(10, 30) 

In [None]:
def lgbm_analyze_feats(model, col_names, top=10):
    """python function to print feature importances for lightgbm
    Args:
        model: lightgbm.basic.Booster
        col_names: pandas.core.indexes.base.Index
        top: int, (optional) -> e.g. print top 10 cols
    Returns:
        gain_sorted: list(int, string) -> gain from feat and feature name
        split_sorted: list(int, string) -> split num and feature name
    """
    gain_importances = model.feature_importance(importance_type='gain')
    gain_sorted = sorted([(importance, col_names[i]) for i, importance in enumerate(gain_importances)], reverse=True)
    split_importances = model.feature_importance(importance_type='split')
    split_sorted = sorted([(importance, col_names[i]) for i, importance in enumerate(split_importances)], reverse=True)
    print("\ntop {} by gain\n--".format(top))
    for i in range(top):
        print("{} : {}".format(gain_sorted[i][1], gain_sorted[i][0]))
    print("\ntop {} by split\n--".format(top))
    for i in range(top):
        print("{} : {}".format(split_sorted[i][1], split_sorted[i][0]))
    return gain_sorted, split_sorted
_, _ = lgbm_analyze_feats(model, train_cols)

In [None]:
# You can only iterate through a result from `get_prediction_days()` once
# so be careful not to lose it once you start iterating.
days = env.get_prediction_days()

## Main Loop
Let's loop through all the days and make our random predictions.  The `days` generator (returned from `get_prediction_days`) will simply stop returning values once you've reached the end.

In [None]:
def prepare_predictions(market_obs_df):
    """same procedure used for train data"""
    clean_data(market_obs_df)
    return feature_engineering(market_obs_df)

In [None]:
def dull_predictions():
    """used to skip to next prediction for debugging"""
    env.predict(predictions_template_df)
#dull_predictions()

In [None]:
from tqdm import tqdm_notebook as tqdm
import time

n_days = 0
prep_time = 0
prediction_time = 0
packaging_time = 0
total_market_obs_df = []
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    #market_obs_df, news_obs_df, predictions_template_df = next(days)
    n_days +=1
    if (n_days%50==0):
        print(n_days,end=' ')
    t = time.time()

    return_features = ['returnsClosePrevMktres10','returnsClosePrevRaw10','open','close']
    total_market_obs_df.append(market_obs_df)
    
    if len(total_market_obs_df)==1:
        history_df = total_market_obs_df[0]
    else:
        history_df = pd.concat(total_market_obs_df[-(np.max(n_lag)+1):])
    # we generated history_df

    # apply prepare_predictions
    new_df = prepare_predictions(history_df).drop(['assetName', 'volume', 'close', 'open',
       'returnsClosePrevRaw1', 'returnsOpenPrevRaw1',
       'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
       'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',
       'returnsClosePrevMktres10', 'returnsOpenPrevMktres10'], axis=1)
    market_obs_df = pd.merge(market_obs_df,new_df,how='left',on=['time','assetCode'])
    
    prep_time += time.time() - t

    t = time.time()
    #predictions
    predictions = market_obs_df.iloc[:, (market_obs_df.columns != 'assetCode') & (market_obs_df.columns != 'assetName') &(market_obs_df.columns != 'time') ]
    predictions.insert(loc=11, column='universe', value=1.0)
    
    if "close_to_open_x" in predictions.columns:
        # for some strange reason the three feats from data cleaning get duplicated, this is a cleaning
        predictions = predictions.drop(['close_to_open_x','assetName_mean_open_x', 'assetName_mean_close_x'],axis=1)
        predictions = predictions.rename(columns={'close_to_open_y':'close_to_open',
       'assetName_mean_open_y':'assetName_mean_open', 'assetName_mean_close_y':'assetName_mean_close'})
        
    #and this is sanity check, prediction == train
    assert len(predictions.columns) == len(train_cols)
    for i, col in enumerate(predictions.columns):
        try:
            assert col == train_cols[i]
        except:
            print(col, train_cols[i])
            print(predictions.columns, train_cols)
    
    predictions_template_df.confidenceValue = model.predict(predictions.values).clip(-1, 1)
    env.predict(predictions_template_df)
    packaging_time += time.time() - t
    
    print("preparation : {}, packaging: {}".format(prep_time, packaging_time))

# **`write_submission_file`** function

Writes your predictions to a CSV file (`submission.csv`) in the current working directory.

In [None]:
env.write_submission_file()

In [None]:
# We've got a submission file!
import os
print([filename for filename in os.listdir('.') if '.csv' in filename])

As indicated by the helper message, calling `write_submission_file` on its own does **not** make a submission to the competition.  It merely tells the module to write the `submission.csv` file as part of the Kernel's output.  To make a submission to the competition, you'll have to **Commit** your Kernel and find the generated `submission.csv` file in that Kernel Version's Output tab (note this is _outside_ of the Kernel Editor), then click "Submit to Competition".  When we re-run your Kernel during Stage Two, we will run the Kernel Version (generated when you hit "Commit") linked to your chosen Submission.

## Restart the Kernel to run your code again
In order to combat cheating, you are only allowed to call `make_env` or iterate through `get_prediction_days` once per Kernel run.  However, while you're iterating on your model it's reasonable to try something out, change the model a bit, and try it again.  Unfortunately, if you try to simply re-run the code, or even refresh the browser page, you'll still be running on the same Kernel execution session you had been running before, and the `twosigmanews` module will still throw errors.  To get around this, you need to explicitly restart your Kernel execution session, which you can do by pressing the Restart button in the Kernel Editor's bottom Console tab:
![Restart button](https://i.imgur.com/hudu8jF.png)