# Amateur Hour - Using Headlines to Predict Stocks
### Starter Kernel by ``Magichanics`` 
*([Gitlab](https://gitlab.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics))*

Stocks are unpredictable, but can sometimes follow a trend. In this notebook, we will be discovering the correlation between the stocks and the news.

If there are any things that you would like me to add or remove, feel free to comment down below. I'm mainly doing this to learn and experiment with the data. I plan on rewriting a lot of code in the future to make it look nicer, since a lot of the stuff I have written may not be the most efficient way to approach specific problems.

**What's new?**

October 14th, 2018:
* Changed groupby from seconds to minutes.See previous versions for an application of this.
* Fixed ``df_assetCode`` error.
* Applied more text processing techniques.
* Fixed groupby error where it returns nulls when it shouldn't have.


![title](https://upload.wikimedia.org/wikipedia/commons/8/8d/Wall_Street_sign_banner.jpg)

Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Wall_Street_sign_banner.jpg)

In [1]:
# main
import numpy as np
import pandas as pd
import os
from itertools import chain
import gc

# text processing
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# time
from pandas.tseries.holiday import USFederalHolidayCalendar
from sklearn.preprocessing import LabelEncoder
import datetime

# training
from sklearn.model_selection import train_test_split
import lightgbm as lgb

# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

Loading the data... This could take a minute.
Done!


In [2]:
sampling = True

In [29]:
(market_train_df, news_train_df) = env.get_training_data()

if sampling:
    market_train_df = market_train_df.tail(40_000)
    news_train_df = news_train_df.tail(100_000)
else:
    market_train_df = market_train_df.tail(3_000_000)
    news_train_df = news_train_df.tail(6_000_000) 

In [4]:
market_train_df.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
4032956,2016-11-30 22:00:00+00:00,BOX.N,Box Inc,1598145.0,15.22,15.36,-0.004578,-0.004537,-0.001788,-0.007651,0.01738,0.028112,-0.001421,-0.009743,-0.13843,0.0
4032957,2016-11-30 22:00:00+00:00,BP.N,BP PLC,11784765.0,35.01,34.54,0.044451,0.035061,0.047718,0.032011,0.041654,0.039735,0.03105,0.017423,0.014339,1.0
4032958,2016-11-30 22:00:00+00:00,BPFH.O,Boston Private Financial Holdings Inc,596857.0,15.0,14.9,0.020408,0.010169,0.025694,0.007065,-0.025974,-0.032468,-0.056248,-0.091455,-0.100516,0.0
4032959,2016-11-30 22:00:00+00:00,BPL.N,Buckeye Partners LP,1051021.0,64.34,64.75,0.016751,0.02323,0.018486,0.021982,-0.007864,-0.004153,-0.026552,-0.04099,-0.056339,0.0
4032960,2016-11-30 22:00:00+00:00,BPMC.O,Blueprint Medicines Corp,316481.0,29.37,30.07,-0.031971,-0.152719,-0.027388,-0.156343,-0.160137,-0.138395,-0.202046,-0.21537,-0.431184,0.0


In [5]:
news_train_df.head()

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
9228750,2016-11-09 14:40:00+00:00,2016-11-09 14:40:00+00:00,2016-11-09 14:40:00+00:00,20373a40928d0d9b,"S&P 500 OIL, GAS & CONSUMABLE FUELS INDEX DOWN...",1,1,RTRS,"{'BLR', 'STX', 'OILG', 'HOT', 'EXPL', 'OGTR', ...","{'O', 'U', 'NAW', 'OIL', 'E'}",0,6,,True,2,29,"{'XON.DE', 'XON.F', 'XOM.N'}",Exxon Mobil Corp,1,1.0,-1,0.523708,0.300387,0.175905,15,1,1,1,1,1,2,6,13,20,34
9228751,2016-11-09 14:40:01+00:00,2016-11-09 14:40:01+00:00,2016-11-09 14:27:08+00:00,3c02c1d52199f30c,Ford vows to work with Trump after criticism o...,3,1,RTRS,"{'JOB', 'ECON', 'RTRS', 'MCE', 'EMRG', 'TRF', ...","{'PGE', 'PCO', 'G', 'PCU', 'DNP', 'PSC', 'U', ...",752,1,,False,5,146,"{'F.PA', 'F.F', 'F.DE', 'F.N'}",Ford Motor Co,1,1.0,1,0.193485,0.171482,0.635034,146,0,0,0,0,0,6,12,42,45,66
9228752,2016-11-09 14:40:22+00:00,2016-11-09 14:40:22+00:00,2016-11-09 14:35:07+00:00,f7cb3e02eebc5b06,SHARES OF GOLDMAN SACHS AND MORGAN STANLEY ALS...,1,3,RTRS,"{'BLR', 'FUND', 'STX', 'INVB', 'HOT', 'LEN', '...","{'E', 'U'}",0,4,,False,1,17,"{'BAC', 'BAC.N'}",Bank of America Corp,0,0.707107,1,0.029941,0.117471,0.852588,17,0,0,0,0,0,8,19,39,40,52
9228753,2016-11-09 14:40:22+00:00,2016-11-09 14:40:22+00:00,2016-11-09 14:35:07+00:00,f7cb3e02eebc5b06,SHARES OF GOLDMAN SACHS AND MORGAN STANLEY ALS...,1,3,RTRS,"{'BLR', 'FUND', 'STX', 'INVB', 'HOT', 'LEN', '...","{'E', 'U'}",0,4,,False,1,17,"{'MSP.A', 'MS.N'}",Morgan Stanley,1,1.0,0,0.152862,0.615835,0.231303,17,0,0,0,0,0,4,11,19,20,24
9228754,2016-11-09 14:40:22+00:00,2016-11-09 14:40:22+00:00,2016-11-09 14:35:07+00:00,f7cb3e02eebc5b06,SHARES OF GOLDMAN SACHS AND MORGAN STANLEY ALS...,1,3,RTRS,"{'BLR', 'FUND', 'STX', 'INVB', 'HOT', 'LEN', '...","{'E', 'U'}",0,4,,False,1,17,"{'GSC.P', 'GS.N'}",Goldman Sachs Group Inc,1,1.0,0,0.152862,0.615835,0.231303,17,0,0,0,0,0,4,8,19,20,33


### Information on the Training Data
* There are no Unknown ``assetName`` in ``news_train_df``, but there are 24 479 rows with Unknown as the ``assetName`` in ``market_train_df``. Merging by ``assetCode`` leaves out Unknown rows, which could be problematic.
* ``Volume`` has the highest correlation in terms of ``returnsOpenNextMktres10``.
* Merging by just ``assetCodes`` greatly increases the dataframe (with just 100k rows, it has turned into 10 million rows), although merging by ``assetCodes`` and ``time`` greatly decrease the original dataframe.

### Aggregations on News Data

It helped a lot during the Home Credit competition, and in the next block of code we will be merging the news dataframe with the market dataframe. Instead of having columns with a list of numbers, we will get aggregations for each grouping. The following block creates a dictionary that will be used when merging the data.

In [6]:
news_agg_cols = [f for f in news_train_df.columns if 'novelty' in f or
                'volume' in f or
                'sentiment' in f or
                'bodySize' in f or
                'Count' in f or
                'marketCommentary' in f or
                'relevance' in f]
news_agg_dict = {}
for col in news_agg_cols:
    news_agg_dict[col] = ['mean', 'sum', 'max', 'min']
news_agg_dict['urgency'] = ['min', 'count']
news_agg_dict['takeSequence'] = ['max']

### Joining Market & News Data

The grouping method that I'll be using is from [bguberfain](https://www.kaggle.com/bguberfain), but I'll also be adding in the headlines column, as well eliminating rows that are not partnered with either the market or news data. One way I would improve this is probably group by time periods rather than exact times given in ``time`` due to the small amount of data that share the same amount of data in terms of the ``time`` column, and possibly making it a bit more efficient. 

Notes: 
* When you run the full dataset, expect it to take a while.
* As you remove more time features from seconds to year, the resulting train data becomes larger and larger.

In [28]:
# note to self: fill int/float columns with 0
def fillnulls(X):
    
    # fill headlines with the string null
    X['headline'] = X['headline'].fillna('null')
    
def generalize_time(X):
    # convert time to string and/or get rid of Hours, Minutes, and seconds
    X['time'] = X['time'].dt.strftime('%Y-%m-%d %H:%M:%S').str.slice(0,16) #(0,10) for Y-m-d, (0,13) for Y-m-d H

# get dataframes within indecies
def get_indecies(df, indecies):
    
    # update market dataframe to only contain the specific rows with matching indecies.
    def check_index(index, indecies):
        if index in indecies:
            return True
        else:
            return False
    
    df['del_index'] = df.index.values
    df['is_in_indecies'] = df['del_index'].apply(lambda x: check_index(x, indecies))
    df = df[df.is_in_indecies == True]
    del df['del_index'], df['is_in_indecies']
    
    return df

# this function checks for potential nulls after grouping by only grouping the time and assetcode dataframe
# returns valid news indecies for the next if statement.
def partial_groupby(market_df, news_df, df_assetCodes):
    
    # get new dataframe
    temp_news_df_expanded = pd.merge(df_assetCodes, news_df[['time', 'assetCodes']], left_on='level_0', right_index=True, suffixes=(['','_old']))

    # groupby dataframes
    temp_news_df = temp_news_df_expanded.copy()[['time', 'assetCode']]
    temp_market_df = market_df.copy()[['time', 'assetCode']]

    # get indecies on both dataframes
    temp_news_df['news_index'] = temp_news_df.index.values
    temp_market_df['market_index'] = temp_market_df.index.values

    # set multiindex and join the two
    temp_news_df.set_index(['time', 'assetCode'], inplace=True)

    # join the two
    temp_market_df_2 = temp_market_df.join(temp_news_df, on=['time', 'assetCode'])
    del temp_market_df, temp_news_df

    # drop nulls in any columns
    temp_market_df_2 = temp_market_df_2.dropna()

    # get indecies
    market_valid_indecies = temp_market_df_2['market_index'].tolist()
    news_valid_indecies = temp_market_df_2['news_index'].tolist()
    del temp_market_df_2

    # get index rows
    market_df = get_indecies(market_df, market_valid_indecies)
    
    return market_df, news_valid_indecies

def join_market_news(market_df, news_df, nulls=False):
    
    # convert time to string
    generalize_time(market_df)
    generalize_time(news_df)
    
    # Fix asset codes (str -> list)
    news_df['assetCodes'] = news_df['assetCodes'].str.findall(f"'([\w\./]+)'")

    # Expand assetCodes
    assetCodes_expanded = list(chain(*news_df['assetCodes']))
    assetCodes_index = news_df.index.repeat( news_df['assetCodes'].apply(len) )
    
    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
    
    if not nulls:
        market_df, news_valid_indecies = partial_groupby(market_df, news_df, df_assetCodes)
    
    # create dataframe based on groupby
    news_col = ['time', 'assetCodes', 'headline'] + sorted(list(news_agg_dict.keys()))
    news_df_expanded = pd.merge(df_assetCodes, news_df[news_col], left_on='level_0', right_index=True, suffixes=(['','_old']))
    
    # check if the columns are in the index
    if not nulls:
        news_df_expanded = get_indecies(news_df_expanded, news_valid_indecies)

    def news_df_feats(x):
        if x.name == 'headline':
            return list(x)
    
    # groupby time and assetcode
    news_df_expanded = news_df_expanded.reset_index()
    news_groupby = news_df_expanded.groupby(['time', 'assetCode'])
    
    # get aggregated df
    news_df_aggregated = news_groupby.agg(news_agg_dict).apply(np.float32).reset_index()
    news_df_aggregated.columns = ['_'.join(col).strip() for col in news_df_aggregated.columns.values]
    
    # get any important string dataframes
    news_df_cat = news_groupby.transform(lambda x: news_df_feats(x))['headline'].to_frame()
    new_news_df = pd.concat([news_df_aggregated, news_df_cat], axis=1)
    
    # cleanup
    del news_df_aggregated
    del news_df_cat
    del news_df
    
    # rename columns
    new_news_df.rename(columns={'time_': 'time', 'assetCode_': 'assetCode'}, inplace=True)
    new_news_df.set_index(['time', 'assetCode'], inplace=True)
    
    # Join with train
    market_df = market_df.join(new_news_df, on=['time', 'assetCode'])

    # cleanup
    fillnulls(market_df)

    return market_df


In [30]:
%%time
X_train = join_market_news(market_train_df, news_train_df, nulls=False)

CPU times: user 3.06 s, sys: 16 ms, total: 3.07 s
Wall time: 3.07 s


In [31]:
X_train.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,bodySize_mean,bodySize_sum,bodySize_max,bodySize_min,companyCount_mean,companyCount_sum,companyCount_max,companyCount_min,marketCommentary_mean,marketCommentary_sum,marketCommentary_max,marketCommentary_min,sentenceCount_mean,sentenceCount_sum,sentenceCount_max,sentenceCount_min,wordCount_mean,wordCount_sum,wordCount_max,wordCount_min,relevance_mean,relevance_sum,relevance_max,relevance_min,...,noveltyCount24H_mean,noveltyCount24H_sum,noveltyCount24H_max,noveltyCount24H_min,noveltyCount3D_mean,noveltyCount3D_sum,noveltyCount3D_max,noveltyCount3D_min,noveltyCount5D_mean,noveltyCount5D_sum,noveltyCount5D_max,noveltyCount5D_min,noveltyCount7D_mean,noveltyCount7D_sum,noveltyCount7D_max,noveltyCount7D_min,volumeCounts12H_mean,volumeCounts12H_sum,volumeCounts12H_max,volumeCounts12H_min,volumeCounts24H_mean,volumeCounts24H_sum,volumeCounts24H_max,volumeCounts24H_min,volumeCounts3D_mean,volumeCounts3D_sum,volumeCounts3D_max,volumeCounts3D_min,volumeCounts5D_mean,volumeCounts5D_sum,volumeCounts5D_max,volumeCounts5D_min,volumeCounts7D_mean,volumeCounts7D_sum,volumeCounts7D_max,volumeCounts7D_min,urgency_min,urgency_count,takeSequence_max,headline
4033312,2016-11-30 22:00,EXPE.O,Expedia Group Inc,1622074.0,124.05,125.82,-0.008393,-0.003248,-0.005846,-0.005374,0.004128,0.031142,0.002756,0.02916,-0.069039,1.0,4390.0,4390.0,4390.0,4390.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,24.0,24.0,24.0,24.0,715.0,715.0,715.0,715.0,0.766032,0.766032,0.766032,0.766032,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,1.0,1.0,[Packaging Corporation of America Completes Ac...
4033995,2016-11-30 22:00,PKG.N,Packaging Corp of America,768660.0,84.76,85.83,-0.010276,0.00846,-0.008547,0.005541,-0.013501,0.004447,-0.03147,-0.032285,-0.046095,1.0,2808.0,2808.0,2808.0,2808.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,19.0,19.0,19.0,19.0,441.0,441.0,441.0,441.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,[Shaw Provides Dividend Rate Notice for Cumula...
4034187,2016-11-30 22:00,SJR.N,Shaw Communications Inc,578209.0,19.57,19.8,-0.009114,0.011236,-0.007876,0.010423,0.019271,0.03828,0.019042,0.037642,0.045588,0.0,2235.0,4470.0,2236.0,2234.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,20.0,40.0,20.0,20.0,408.0,816.0,408.0,408.0,1.0,2.0,1.0,1.0,...,0.5,1.0,1.0,0.0,2.5,5.0,3.0,2.0,2.5,5.0,3.0,2.0,2.5,5.0,3.0,2.0,0.5,1.0,1.0,0.0,0.5,1.0,1.0,0.0,2.5,5.0,3.0,2.0,2.5,5.0,3.0,2.0,4.5,9.0,5.0,4.0,3.0,2.0,1.0,[Expedia.com Launches First Amazon Alexa Skill...
4034981,2016-12-01 22:00,CVE.N,Cenovus Energy Inc,2155678.0,15.64,15.97,0.011643,0.045157,0.016546,0.049285,0.068306,0.096088,0.051404,0.066691,0.000331,1.0,1638.0,4914.0,2457.0,0.0,1.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,13.0,39.0,19.0,1.0,275.666656,827.0,409.0,9.0,1.0,3.0,1.0,1.0,...,1.333333,4.0,2.0,0.0,1.333333,4.0,2.0,0.0,1.333333,4.0,2.0,0.0,1.333333,4.0,2.0,0.0,2.0,6.0,2.0,2.0,2.0,6.0,2.0,2.0,3.0,9.0,3.0,3.0,11.0,33.0,11.0,11.0,11.0,33.0,11.0,11.0,1.0,3.0,1.0,[Shaw Provides Dividend Rate Notice for Cumula...
4035214,2016-12-01 22:00,FTI.N,TechnipFMC PLC,5617350.0,34.7,34.82,0.012843,0.024419,0.017186,0.026321,-0.00316,-0.006562,-0.010754,-0.017195,-0.043592,1.0,4999.0,14997.0,14997.0,0.0,2.0,6.0,2.0,2.0,0.0,0.0,0.0,0.0,23.333334,70.0,68.0,1.0,778.666687,2336.0,2304.0,13.0,1.0,3.0,1.0,1.0,...,0.666667,2.0,1.0,0.0,0.666667,2.0,1.0,0.0,0.666667,2.0,1.0,0.0,0.666667,2.0,1.0,0.0,2.0,6.0,3.0,1.0,2.0,6.0,3.0,1.0,8.0,24.0,9.0,7.0,8.0,24.0,9.0,7.0,8.0,24.0,9.0,7.0,1.0,3.0,2.0,[Cenovus appoints Claude Mongeau to Board of D...


In [33]:
X_train.shape

(70, 104)

### Text Processing with Logistic Regression

We are going to vectorize the headlines and apply logistic regression (labels being binary as to whether the stocks go up or not). I would probably apply this same method to the universe column.

In [36]:
# reuse data
def round_scores(x):
    if x >= 0:
        return 1
    else:
        return 0
    
def clean_headlines(headline):
    
    # remove numerical and convert to lowercase
    headline =  re.sub('[^a-zA-Z]',' ',headline)
    headline = headline.lower()
    
    # drop stopwords
    headline_words = headline.split(' ')
    headline_words = [word for word in headline_words if not word in stopwords.words('english')]
    
    # use stemming to simplify words
    ps = PorterStemmer()
    headline_words = [ps.stem(word) for word in headline_words]
    
    # join sentence back again
    return ' '.join(headline_words)

# these functions should only go towards the training data only
def get_headline_df(X_train):
    
    headlines_lst = []
    target_lst = []
    
    # iter through every headline.
    for row in range(0,len(X_train.index)):
        for sentence in X_train['headline'].iloc[row]:
            headlines_lst.append(clean_headlines(sentence))
            target_lst.append(round_scores(X_train['returnsOpenNextMktres10'].iloc[row]))
            
    # return dataframe
    return pd.DataFrame({'headline':pd.Series(headlines_lst), 'returnsOpenNextMktres10':pd.Series(target_lst)})
    
def get_headline(headlines_df):
    
    # get headlines as list (use only headline_df produced by get_headline_df)
    headlines_lst = []
    for row in range(0,len(headlines_df.index)):
        headlines_lst.append(headlines_df.iloc[row])

    # split headlines to separate words
    basicvectorizer = CountVectorizer()
    headlines_vectorized = basicvectorizer.fit_transform(headlines_lst)
    
    print(headlines_vectorized.shape)
    return headlines_vectorized, basicvectorizer

def headline_mapping(target, headlines_vectored, headline_vectorizer):
    
    print(np.asarray(target).shape)
    headline_model = LogisticRegression()
    headline_model = headline_model.fit(headlines_vectored, target)
    
    # get coefficients
    basicwords = headline_vectorizer.get_feature_names()
    basiccoeffs = headline_model.coef_.tolist()[0]
    coeff_df = pd.DataFrame({'Word' : basicwords, 
                            'Coefficient' : basiccoeffs})
    
    # convert dataframe to dictionary of coefficients
    coefficient_dict = dict(zip(coeff_df.Word, coeff_df.Coefficient))

    return coefficient_dict, coeff_df['Coefficient'].mean()

# for predictions
def get_coeff_col(X, coeff_dict, coeff_default):
    
    def get_coeff(word_lst):
        
        # iter through every word
        coeff_sum = 0
        for word in word_lst:
            if word in coeff_dict:
                coeff_sum += coeff_dict[word]
            else:
                coeff_sum += coeff_default
        
        # get average coefficient
        coeff_score = coeff_sum / len(word_lst)
        return coeff_score
        
    basicvectorizer = CountVectorizer()
    
    # loop through every item
    headlines_coeff_lst = []
    for row in range(0,len(X['headline'].index)):
        coeff_score = 0
        for i in range(0,len(X['headline'].iloc[row])):
            coeff_score += get_coeff(clean_headlines(str(X['headline'].iloc[row][i])).split(' '))
        headlines_coeff_lst.append(coeff_score / len(X['headline'].iloc[row]))
        
    # merge coefficient frame with main
    coeff_mean_df = pd.DataFrame({'headline_coeff_mean': pd.Series(headlines_coeff_lst)})
    X = pd.concat([X.reset_index(), coeff_mean_df], axis=1)
    
    return X

In [37]:
headline_df = get_headline_df(X_train)
coefficient_dict, coefficient_default = headline_mapping(headline_df['returnsOpenNextMktres10'],
                                            *get_headline(headline_df['headline']))

(201, 353)
(201,)


In [38]:
# will be applied to X_test as well
X_train = get_coeff_col(X_train, coefficient_dict, coefficient_default)

### Extra Features ``return``

Here are some extra features pi

In [39]:
def extra_features(X):
    
    # Adding daily difference
    new_col = X["close"] - X["open"]
    X.insert(loc=6, column="daily_diff", value=new_col)
    X['close_to_open'] =  np.abs(X['close'] / X['open'])

In [40]:
extra_features(X_train)

### Get Time Features

This section splits the timestamp column into their own separate columns, as well as other various time features.

Possible idea: Encoding time

In [41]:
# ripped from my previous kernel, NYC Taxi Fare

# first get dates
def split_time(df):
    
    # split date_time into categories
    df['time_day'] = df['time'].str.slice(8,10)
    df['time_month'] = df['time'].str.slice(5,7)
    df['time_year'] = df['time'].str.slice(0,4)
    df['time_hour'] = df['time'].str.slice(11,13)
    df['time_minute'] = df['time'].str.slice(14,16)
    
    # source: https://www.kaggle.com/nicapotato/taxi-rides-time-analysis-and-oof-lgbm
    df['temp_time'] = df['time'].str.replace(" UTC", "")
    df['temp_time'] = pd.to_datetime(df['temp_time'], format='%Y-%m-%d %H')
    
    df['time_day_of_year'] = df.temp_time.dt.dayofyear
    df['time_week_of_year'] = df.temp_time.dt.weekofyear
    df["time_weekday"] = df.temp_time.dt.weekday
    df["time_quarter"] = df.temp_time.dt.quarter
    
    del df['temp_time']
    gc.collect()
    
    # convert to non-object columns
    time_feats = ['time_day', 'time_month', 'time_year']
    df[time_feats] = df[time_feats].apply(pd.to_numeric)
    
    # determine whether the day is set on a holiday
    cal = USFederalHolidayCalendar()
    holidays = cal.holidays(start='2007-01-01', end='2018-09-27').to_pydatetime()
    df['on_holiday'] = df['time'].str.slice(0,10).apply(lambda x: 1 if x in holidays else 0)

In [42]:
split_time(X_train)

### Cleaning Data
Removes all categorical data as well as data that does not show up in the test data.

In [43]:
def remove_cols(X):
    del_cols = [f for f in X.columns if X[f].dtype == 'object'] + ['assetName', 'index']
    for f in del_cols:
        del X[f]

In [44]:
remove_cols(X_train)

### Compile X functions into one function

This will be used when looping through different batches of X_test

In [45]:
def get_X(market_df, news_df):
    
    # these are all the functions applied to X_train except for a few
    X_test = join_market_news(market_df, news_df, nulls=True)
    X_test = get_coeff_col(X_test, coefficient_dict, coefficient_default)
    extra_features(X_test)
    split_time(X_test)
    remove_cols(X_test)
    
    return X_test

#### Resulting Dataframe and Data Correlation to Target column
We have went to roughly 50 columns to 113!

In [46]:
X_train.head(10)

Unnamed: 0,volume,close,daily_diff,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,bodySize_mean,bodySize_sum,bodySize_max,bodySize_min,companyCount_mean,companyCount_sum,companyCount_max,companyCount_min,marketCommentary_mean,marketCommentary_sum,marketCommentary_max,marketCommentary_min,sentenceCount_mean,sentenceCount_sum,sentenceCount_max,sentenceCount_min,wordCount_mean,wordCount_sum,wordCount_max,wordCount_min,relevance_mean,relevance_sum,relevance_max,relevance_min,sentimentClass_mean,sentimentClass_sum,...,noveltyCount5D_sum,noveltyCount5D_max,noveltyCount5D_min,noveltyCount7D_mean,noveltyCount7D_sum,noveltyCount7D_max,noveltyCount7D_min,volumeCounts12H_mean,volumeCounts12H_sum,volumeCounts12H_max,volumeCounts12H_min,volumeCounts24H_mean,volumeCounts24H_sum,volumeCounts24H_max,volumeCounts24H_min,volumeCounts3D_mean,volumeCounts3D_sum,volumeCounts3D_max,volumeCounts3D_min,volumeCounts5D_mean,volumeCounts5D_sum,volumeCounts5D_max,volumeCounts5D_min,volumeCounts7D_mean,volumeCounts7D_sum,volumeCounts7D_max,volumeCounts7D_min,urgency_min,urgency_count,takeSequence_max,headline_coeff_mean,close_to_open,time_day,time_month,time_year,time_day_of_year,time_week_of_year,time_weekday,time_quarter,on_holiday
0,1622074.0,124.05,-1.77,125.82,-0.008393,-0.003248,-0.005846,-0.005374,0.004128,0.031142,0.002756,0.02916,-0.069039,1.0,4390.0,4390.0,4390.0,4390.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,24.0,24.0,24.0,24.0,715.0,715.0,715.0,715.0,0.766032,0.766032,0.766032,0.766032,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,1.0,1.0,-0.083918,0.985932,30,11,2016,335,48,2,4,0
1,768660.0,84.76,-1.07,85.83,-0.010276,0.00846,-0.008547,0.005541,-0.013501,0.004447,-0.03147,-0.032285,-0.046095,1.0,2808.0,2808.0,2808.0,2808.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,19.0,19.0,19.0,19.0,441.0,441.0,441.0,441.0,1.0,1.0,1.0,1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,0.023086,0.987533,30,11,2016,335,48,2,4,0
2,578209.0,19.57,-0.23,19.8,-0.009114,0.011236,-0.007876,0.010423,0.019271,0.03828,0.019042,0.037642,0.045588,0.0,2235.0,4470.0,2236.0,2234.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,20.0,40.0,20.0,20.0,408.0,816.0,408.0,408.0,1.0,2.0,1.0,1.0,1.0,2.0,...,5.0,3.0,2.0,2.5,5.0,3.0,2.0,0.5,1.0,1.0,0.0,0.5,1.0,1.0,0.0,2.5,5.0,3.0,2.0,2.5,5.0,3.0,2.0,4.5,9.0,5.0,4.0,3.0,2.0,1.0,0.134187,0.988384,30,11,2016,335,48,2,4,0
3,2155678.0,15.64,-0.33,15.97,0.011643,0.045157,0.016546,0.049285,0.068306,0.096088,0.051404,0.066691,0.000331,1.0,1638.0,4914.0,2457.0,0.0,1.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,13.0,39.0,19.0,1.0,275.666656,827.0,409.0,9.0,1.0,3.0,1.0,1.0,0.666667,2.0,...,4.0,2.0,0.0,1.333333,4.0,2.0,0.0,2.0,6.0,2.0,2.0,2.0,6.0,2.0,2.0,3.0,9.0,3.0,3.0,11.0,33.0,11.0,11.0,11.0,33.0,11.0,11.0,1.0,3.0,1.0,0.023086,0.979336,1,12,2016,336,48,3,4,0
4,5617350.0,34.7,-0.12,34.82,0.012843,0.024419,0.017186,0.026321,-0.00316,-0.006562,-0.010754,-0.017195,-0.043592,1.0,4999.0,14997.0,14997.0,0.0,2.0,6.0,2.0,2.0,0.0,0.0,0.0,0.0,23.333334,70.0,68.0,1.0,778.666687,2336.0,2304.0,13.0,1.0,3.0,1.0,1.0,-0.333333,-1.0,...,2.0,1.0,0.0,0.666667,2.0,1.0,0.0,2.0,6.0,3.0,1.0,2.0,6.0,3.0,1.0,8.0,24.0,9.0,7.0,8.0,24.0,9.0,7.0,8.0,24.0,9.0,7.0,1.0,3.0,2.0,-0.025264,0.996554,1,12,2016,336,48,3,4,0
5,1865876.0,27.2,0.15,27.05,0.00369,0.028517,0.00383,0.028553,0.120046,0.120759,0.118194,0.116533,0.196101,0.0,416.0,416.0,416.0,416.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,5.0,5.0,5.0,5.0,87.0,87.0,87.0,87.0,1.0,1.0,1.0,1.0,1.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,1.0,1.0,-0.025264,1.005545,1,12,2016,336,48,3,4,0
6,289911.0,84.15,1.15,83.0,0.015691,0.018405,0.020777,0.021186,0.078155,0.063421,0.048542,0.017415,-0.05538,0.0,2862.333252,8587.0,8587.0,0.0,1.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,12.0,36.0,34.0,1.0,479.333344,1438.0,1411.0,9.0,1.0,3.0,1.0,1.0,1.0,3.0,...,2.0,2.0,0.0,0.666667,2.0,2.0,0.0,0.666667,2.0,2.0,0.0,0.666667,2.0,2.0,0.0,15.666667,47.0,17.0,15.0,32.666668,98.0,34.0,32.0,32.666668,98.0,34.0,32.0,1.0,3.0,2.0,-0.025264,1.013855,1,12,2016,336,48,3,4,0
7,619886.0,39.43,0.69,38.74,0.026823,0.005972,0.030085,0.009732,0.047559,0.025982,0.030893,0.000334,-0.111732,1.0,4596.5,9193.0,4654.0,4539.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,23.5,47.0,24.0,23.0,849.0,1698.0,856.0,842.0,1.0,2.0,1.0,1.0,1.0,2.0,...,1.0,1.0,0.0,0.5,1.0,1.0,0.0,1.5,3.0,2.0,1.0,1.5,3.0,2.0,1.0,3.5,7.0,4.0,3.0,4.5,9.0,5.0,4.0,4.5,9.0,5.0,4.0,3.0,2.0,1.0,0.015273,1.017811,1,12,2016,336,48,3,4,0
8,3663665.0,32.1,-1.66,33.76,-0.051699,-0.010261,-0.052121,-0.007915,0.022293,0.063307,0.021523,0.064146,0.060768,1.0,0.0,0.0,0.0,0.0,2.0,4.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,25.5,51.0,26.0,25.0,0.853554,1.707107,1.0,0.707107,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1.0,1.0,0.0,0.5,1.0,1.0,0.0,3.5,7.0,4.0,3.0,4.5,9.0,5.0,4.0,4.5,9.0,5.0,4.0,1.0,2.0,2.0,0.015273,0.950829,2,12,2016,337,48,4,4,0
9,371803.0,168.0,0.08,167.92,-0.004149,-0.002732,-0.004483,0.003535,0.039797,0.039173,0.041126,0.042239,0.002453,0.0,260.0,260.0,260.0,260.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,4.0,4.0,4.0,4.0,58.0,58.0,58.0,58.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,0.099617,1.000476,2,12,2016,337,48,4,4,0


### Using LGBM for Modelling

In [47]:
def set_data(X_train):
    
    # get X and Y
    y_train = X_train['returnsOpenNextMktres10']
    del X_train['returnsOpenNextMktres10'], X_train['universe']
    
    # split data (for cross validation)
    x1, x2, y1, y2 = train_test_split(X_train, 
                                      y_train, 
                                      test_size=0.25, 
                                      random_state=99)
    
    return x1, x2, y1, y2
    
def lgbm_training(X_train):
    
    # set model and parameters
    params = {'learning_rate': 0.02, 
              'boosting': 'gbdt', 
              'objective': 'regression', 
              'seed': 2018}
    
    # get x and y values
    x1, x2, y1, y2 = set_data(X_train)
    
    # train data
    lgb_model = lgb.train(params, 
                            lgb.Dataset(x1, label=y1), 
                            5000, 
                            lgb.Dataset(x2, label=y2), 
                            verbose_eval=100, 
                            early_stopping_rounds=200)
    
    return lgb_model


In [48]:
lgb_model = lgbm_training(X_train)

Training until validation scores don't improve for 200 rounds.
[100]	valid_0's l2: 0.00351335
[200]	valid_0's l2: 0.00341711
[300]	valid_0's l2: 0.00331763
[400]	valid_0's l2: 0.00323196
[500]	valid_0's l2: 0.00317827
[600]	valid_0's l2: 0.00318116
[700]	valid_0's l2: 0.00316656
[800]	valid_0's l2: 0.00314943
[900]	valid_0's l2: 0.00314722
[1000]	valid_0's l2: 0.00315356
Early stopping, best iteration is:
[846]	valid_0's l2: 0.00313528


### Making Predictions

Now the difference between the training and test data would be these two columns,  ``['returnsOpenNextMktres10', 'universe']``. We will be trying to predict ``returnsOpenNextMktres10`` and using that as the ``confidenceValue``.

In [None]:
%%time

def make_predictions(market_obs_df, news_obs_df):
    
    # predict using given model
    X_test = get_X(market_obs_df, news_obs_df)
    prediction_values = np.clip(lgb_model.predict(X_test), -1, 1)

    return prediction_values

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days(): # Looping over days from start of 2017 to 2019-07-15
    
    # make predictions
    predictions_template_df['confidenceValue'] = make_predictions(market_obs_df, news_obs_df)
    
    # save predictions
    env.predict(predictions_template_df)


### Export Submission

In [None]:
# exports csv
env.write_submission_file()
print('finished!')

### References:
* [Getting Started - DJ Sterling](https://www.kaggle.com/dster/two-sigma-news-official-getting-started-kernel)
* [a simple model - Bruno G. do Amaral](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-data)
* [LGBM Model - the1owl](https://www.kaggle.com/the1owl/my-two-sigma-cents-only)
* [Headline Processing - Andrew Gelé](https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit)
* [Feature engineering - Andrew Lukyanenko](https://www.kaggle.com/artgor/eda-feature-engineering-and-everything)
* [Basic Text Processing - akatsuki06](https://www.kaggle.com/akatsuki06/basic-text-processing-cleaning-the-description)