# Amateur Hour - Stock Market News
### Starter Kernel by Magichanics
*[Gitlab](https://gitlab.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics)*

Stocks are unpredictable, but can sometimes follow a trend. In this notebook, we will be discovering the correlation between the stocks and the news.

If there are any things that you would like me to add or remove, feel free to comment down below. I'm mainly doing this to learn and experiment with the data. 

**What's new?**
* October 18th, 2018 - Published kernel


![title](https://upload.wikimedia.org/wikipedia/commons/8/8d/Wall_Street_sign_banner.jpg)

Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Wall_Street_sign_banner.jpg)

### References:
* [Getting Started - DJ Sterling](https://www.kaggle.com/dster/two-sigma-news-official-getting-started-kernel)
* [a simple model - Bruno G. do Amaral](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-data)
* [LGBM Model - the1owl](https://www.kaggle.com/the1owl/my-two-sigma-cents-only)
* [Headline Processing - Andrew Gelé](https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit)

In [1]:
import numpy as np
import pandas as pd
import os
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer
from pandas.tseries.holiday import USFederalHolidayCalendar
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import datetime
import gc

# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

Loading the data... This could take a minute.
Done!


In [13]:
sampling = True

In [14]:
(market_train_df, news_train_df) = env.get_training_data()

if sampling:
    market_train_df = market_train_df.tail(400_000)
    news_train_df = news_train_df.tail(1_000_000)
else:
    market_train_df = market_train_df.tail(3_000_000)
    news_train_df = news_train_df.tail(6_000_000) 

In [4]:
market_train_df.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
0,2007-02-01 22:00:00+00:00,A.N,Agilent Technologies Inc,2606900.0,32.19,32.17,0.005938,0.005312,,,-0.00186,0.000622,,,0.034672,1.0
1,2007-02-01 22:00:00+00:00,AAI.N,AirTran Holdings Inc,2051600.0,11.12,11.08,0.004517,-0.007168,,,-0.078708,-0.088066,,,0.027803,0.0
2,2007-02-01 22:00:00+00:00,AAP.N,Advance Auto Parts Inc,1164800.0,37.51,37.99,-0.011594,0.025648,,,0.014332,0.045405,,,0.024433,1.0
3,2007-02-01 22:00:00+00:00,AAPL.O,Apple Inc,23747329.0,84.74,86.23,-0.011548,0.016324,,,-0.048613,-0.037182,,,-0.007425,1.0
4,2007-02-01 22:00:00+00:00,ABB.N,ABB Ltd,1208600.0,18.02,18.01,0.011791,0.025043,,,0.012929,0.020397,,,-0.017994,1.0


In [5]:
news_train_df.head()

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
0,2007-01-01 04:29:32+00:00,2007-01-01 04:29:32+00:00,2007-01-01 04:29:32+00:00,e58c6279551b85cf,China's Daqing pumps 43.41 mln tonnes of oil i...,3,1,RTRS,"{'ENR', 'ASIA', 'CN', 'NGS', 'EMRG', 'RTRS', '...","{'Z', 'O', 'OIL'}",1438,1,,False,11,275,"{'0857.HK', '0857.F', '0857.DE', 'PTR.N'}",PetroChina Co Ltd,6,0.235702,-1,0.500739,0.419327,0.079934,73,0,0,0,0,0,0,0,3,6,7
1,2007-01-01 07:03:35+00:00,2007-01-01 07:03:34+00:00,2007-01-01 07:03:34+00:00,5a31c4327427f63f,"FEATURE-In kidnapping, finesse works best",3,1,RTRS,"{'FEA', 'CA', 'LATAM', 'MX', 'INS', 'ASIA', 'I...","{'PGE', 'PCO', 'G', 'ESN', 'MD', 'PCU', 'DNP',...",4413,1,FEATURE,False,55,907,{'STA.N'},Travelers Companies Inc,8,0.447214,-1,0.600082,0.345853,0.054064,62,1,1,1,1,1,1,1,3,3,3
2,2007-01-01 11:29:56+00:00,2007-01-01 11:29:56+00:00,2007-01-01 11:29:56+00:00,1cefd27a40fabdfe,PRESS DIGEST - Wall Street Journal - Jan 1,3,1,RTRS,"{'RET', 'ENR', 'ID', 'BG', 'US', 'PRESS', 'IQ'...","{'T', 'DNP', 'PSC', 'U', 'D', 'M', 'RNP', 'PTD...",2108,2,PRESS DIGEST,False,15,388,"{'WMT.DE', 'WMT.N'}",Wal-Mart Stores Inc,14,0.377964,-1,0.450049,0.295671,0.25428,67,0,0,0,0,0,0,0,5,11,17
3,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...","{'T', 'DNP', 'PSC', 'U', 'D', 'M', 'RNP', 'PTD...",1776,6,PRESS DIGEST,False,14,325,"{'GOOG.O', 'GOOG.OQ', 'GOOGa.DE'}",Google Inc,13,0.149071,-1,0.752917,0.162715,0.084368,83,0,0,0,0,0,0,0,5,13,15
4,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...","{'T', 'DNP', 'PSC', 'U', 'D', 'M', 'RNP', 'PTD...",1776,6,PRESS DIGEST,False,14,325,{'XMSR.O'},XM Satellite Radio Holdings Inc,11,0.149071,-1,0.699274,0.20936,0.091366,102,0,0,0,0,0,0,0,0,0,0


### Information on the Training Data
* There are no Unknown ``assetName`` in ``news_train_df``, but there are 24 479 rows with Unknown as the ``assetName`` in ``market_train_df``. Merging by ``assetCode`` leaves out Unknown rows, which could be problematic.
* ``Volume`` has the highest correlation in terms of ``returnsOpenNextMktres10``.
* Merging by just ``assetCodes`` greatly increases the dataframe (with just 100k rows, it has turned into 10 million rows), although merging by ``assetCodes`` and ``time`` greatly decrease the original dataframe.

### Aggregations on News Data

It helped a lot during the Home Credit competition, and in the next block of code we will be merging the news dataframe with the market dataframe. Instead of having columns with a list of numbers, we will get aggregations for each grouping. The following block creates a dictionary that will be used when merging the data.

In [6]:
news_agg_cols = [f for f in news_train_df.columns if 'novelty' in f or
                'volume' in f or
                'sentiment' in f or
                'bodySize' in f or
                'Count' in f or
                'marketCommentary' in f or
                'relevance' in f]
news_agg_dict = {}
for col in news_agg_cols:
    news_agg_dict[col] = ['mean', 'sum', 'max', 'min']
news_agg_dict['urgency'] = ['min', 'count']
news_agg_dict['takeSequence'] = ['max']

### Joining Market & News Data

The grouping method that I'll be using is from [bguberfain](https://www.kaggle.com/bguberfain), but I'll also be adding in the headlines column, as well eliminating rows that are not partnered with either the market or news data. One way I would improve this is probably group by time periods rather than exact times given in ``time`` due to the small amount of data that share the same amount of data in terms of the ``time`` column, and possibly making it a bit more efficient.

NOTE: When you run the full dataset, expect it to take a while.

In [10]:
# update market dataframe to only contain the specific rows with matching indecies.
def check_index(index, indecies):
    if index in indecies:
        return True
    else:
        return False

def join_market_news(market_df, news_df, nulls=False):

    print('market_df :' + str(market_df.shape))
    
    # Fix asset codes (str -> list)
    news_df['assetCodes'] = news_df['assetCodes'].str.findall(f"'([\w\./]+)'")

    # Expand assetCodes
    assetCodes_expanded = list(chain(*news_df['assetCodes']))
    assetCodes_index = news_df.index.repeat( news_df['assetCodes'].apply(len) )

    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
    
    # get rid of any rows that will cause null values in one dataframe or the other.
    if not nulls:
        
        # gget new dataframe
        temp_news_df_expanded = pd.merge(df_assetCodes, news_df[['time', 'assetCodes']], left_on='level_0', right_index=True, suffixes=(['','_old']))
        
        # groupby dataframes
        temp_news_df = temp_news_df_expanded.copy()[['time', 'assetCode']]
        temp_market_df = market_df.copy()[['time', 'assetCode']]
        
        # get indecies on both dataframes
        temp_news_df['news_index'] = temp_news_df.index.values
        temp_market_df['market_index'] = temp_market_df.index.values
        
        # set multiindex and join the two
        temp_news_df.set_index(['time', 'assetCode'], inplace=True)
        
        # join the two
        temp_market_df_2 = temp_market_df.join(temp_news_df, on=['time', 'assetCode'])
        del temp_market_df, temp_news_df
        
        # drop nulls in any columns
        temp_market_df_2 = temp_market_df_2.dropna()
        print('dataframe relation: ' + str(temp_market_df_2.shape))
        
        # get indecies
        market_valid_indecies = temp_market_df_2['market_index'].tolist()
        news_valid_indecies = temp_market_df_2['news_index'].tolist()
        del temp_market_df_2
            
        # get index column
        market_df['market_index'] = market_df.index.values
        market_df['is_news'] = market_df['market_index'].apply(lambda x: check_index(x, market_valid_indecies))
        market_df = market_df[market_df.is_news == True]
        print('new market dataframe: ' + str(market_df.shape))
        del market_df['market_index'], market_df['is_news']
    
    # create dataframe based on groupby
    news_col = ['time', 'assetCodes', 'headline'] + sorted(list(news_agg_dict.keys()))
    news_df_expanded = pd.merge(df_assetCodes, news_df[news_col], left_on='level_0', right_index=True, suffixes=(['','_old']))
    
    # check if the columns are in the index
    if news_valid_indecies:
        news_df_expanded['news_index'] = news_df_expanded.index.values
        news_df_expanded['is_market'] = news_df_expanded['news_index'].apply(lambda x: check_index(x, news_valid_indecies))
        news_df_expanded = news_df_expanded[news_df_expanded.is_market == True]
        print('new news dataframe: ' + str(news_df_expanded.shape))
        del news_df_expanded['news_index'], news_df_expanded['is_market']

    print('creating grouped data...')

    def news_df_feats(x):
        if x.name == 'headline':
            return list(x)
    
    # groupby time and assetcode
    news_df_expanded = news_df_expanded.reset_index()
    news_groupby = news_df_expanded.groupby(['time', 'assetCode'])
    
    # get aggregated df
    news_df_aggregated = news_groupby.agg(news_agg_dict).apply(np.float32).reset_index()
    news_df_aggregated.columns = ['_'.join(col).strip() for col in news_df_aggregated.columns.values]
    
    # get any important string dataframes
    news_df_cat = news_groupby.transform(lambda x: news_df_feats(x))['headline'].to_frame()
    new_news_df = pd.concat([news_df_aggregated, news_df_cat], axis=1)
    
    # cleanup
    del news_df_aggregated
    del news_df_cat
    del news_df
    
    # rename columns
    new_news_df.rename(columns={'time_': 'time', 'assetCode_': 'assetCode'}, inplace=True)
    new_news_df.set_index(['time', 'assetCode'], inplace=True)
    
    print('merging data...')
    
    # Join with train
    market_df = market_df.join(new_news_df, on=['time', 'assetCode'])

    # cleanup
    gc.collect()
    
    print('X shape :' + str(market_df.shape))
    
    return market_df


In [15]:
%%time
X_train = join_market_news(market_train_df, news_train_df, nulls=False)

market_df :(400000, 16)
dataframe relation: (227, 4)
new market dataframe: (182, 18)
new news dataframe: (227, 30)
creating grouped data...
merging data...
X shape :(182, 104)
CPU times: user 22 s, sys: 724 ms, total: 22.7 s
Wall time: 22.7 s


In [16]:
X_train.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,bodySize_mean,bodySize_sum,bodySize_max,bodySize_min,companyCount_mean,companyCount_sum,companyCount_max,companyCount_min,marketCommentary_mean,marketCommentary_sum,marketCommentary_max,marketCommentary_min,sentenceCount_mean,sentenceCount_sum,sentenceCount_max,sentenceCount_min,wordCount_mean,wordCount_sum,wordCount_max,wordCount_min,relevance_mean,relevance_sum,relevance_max,relevance_min,...,noveltyCount24H_mean,noveltyCount24H_sum,noveltyCount24H_max,noveltyCount24H_min,noveltyCount3D_mean,noveltyCount3D_sum,noveltyCount3D_max,noveltyCount3D_min,noveltyCount5D_mean,noveltyCount5D_sum,noveltyCount5D_max,noveltyCount5D_min,noveltyCount7D_mean,noveltyCount7D_sum,noveltyCount7D_max,noveltyCount7D_min,volumeCounts12H_mean,volumeCounts12H_sum,volumeCounts12H_max,volumeCounts12H_min,volumeCounts24H_mean,volumeCounts24H_sum,volumeCounts24H_max,volumeCounts24H_min,volumeCounts3D_mean,volumeCounts3D_sum,volumeCounts3D_max,volumeCounts3D_min,volumeCounts5D_mean,volumeCounts5D_sum,volumeCounts5D_max,volumeCounts5D_min,volumeCounts7D_mean,volumeCounts7D_sum,volumeCounts7D_max,volumeCounts7D_min,urgency_min,urgency_count,takeSequence_max,headline
3673357,2016-02-22 22:00:00+00:00,ALV.N,Autoliv Inc,436921.0,111.96,108.99,0.029801,0.0087,0.024366,0.005473,0.126858,0.084695,,,-0.037786,0.0,8557.0,8557.0,8557.0,8557.0,9.0,9.0,9.0,9.0,0.0,0.0,0.0,0.0,52.0,52.0,52.0,52.0,1555.0,1555.0,1555.0,1555.0,0.036886,0.036886,0.036886,0.036886,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,3.0,17.0,17.0,17.0,17.0,3.0,1.0,1.0,[Ingram Micro Expands Availability of Acronis ...
3673877,2016-02-22 22:00:00+00:00,F.N,Ford Motor Co,33549732.0,12.56,12.24,0.038017,0.004102,0.021645,5e-05,0.096943,0.0625,0.047356,0.055504,0.02702,1.0,8557.0,8557.0,8557.0,8557.0,9.0,9.0,9.0,9.0,0.0,0.0,0.0,0.0,52.0,52.0,52.0,52.0,1555.0,1555.0,1555.0,1555.0,0.073771,0.073771,0.073771,0.073771,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,4.0,4.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,18.0,18.0,18.0,18.0,26.0,26.0,26.0,26.0,3.0,1.0,1.0,[RPT-EXCLUSIVE-Up to 90 million more Takata ai...
3674049,2016-02-22 22:00:00+00:00,HMC.N,Honda Motor Co Ltd,610950.0,26.11,26.12,0.006942,0.007716,-0.004143,0.002679,0.003845,-0.010231,-0.01781,-0.013049,0.024391,1.0,8557.0,8557.0,8557.0,8557.0,9.0,9.0,9.0,9.0,0.0,0.0,0.0,0.0,52.0,52.0,52.0,52.0,1555.0,1555.0,1555.0,1555.0,0.073771,0.073771,0.073771,0.073771,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,3.0,1.0,1.0,[RPT-EXCLUSIVE-Up to 90 million more Takata ai...
3674110,2016-02-22 22:00:00+00:00,IM.N,Ingram Micro Inc,7136909.0,35.92,36.17,-0.010741,0.004443,-0.021674,-0.002716,0.294881,0.280807,0.220617,0.269462,-0.058289,1.0,6256.0,6256.0,6256.0,6256.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,32.0,32.0,32.0,32.0,979.0,979.0,979.0,979.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,36.0,36.0,36.0,36.0,51.0,51.0,51.0,51.0,3.0,1.0,1.0,[RPT-EXCLUSIVE-Up to 90 million more Takata ai...
3674877,2016-02-22 22:00:00+00:00,TM.N,Toyota Motor Corp,297787.0,106.69,106.07,0.013104,0.00369,-0.002039,-0.001471,-0.033518,-0.054971,,,-0.036162,0.0,8557.0,8557.0,8557.0,8557.0,9.0,9.0,9.0,9.0,0.0,0.0,0.0,0.0,52.0,52.0,52.0,52.0,1555.0,1555.0,1555.0,1555.0,0.036886,0.036886,0.036886,0.036886,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,13.0,13.0,13.0,13.0,20.0,20.0,20.0,20.0,3.0,1.0,1.0,[RPT-EXCLUSIVE-Up to 90 million more Takata ai...


### Text Processing with MultinomialNB

In [19]:
def get_headline(headlines_df):
    
    # get headlines as list
    headlines_lst = []
    for row in range(0,len(headlines_df.index)):
        for sentence in headlines_df.iloc[row]:
            headlines_lst.append(row)

    # split headlines to separate words
    basicvectorizer = CountVectorizer()
    headlines_vectorized = basicvectorizer.fit_transform(headlines_lst)
    
    print(headlines_vectorized.shape)
    return headlines_vectorized, basicvectorizer

def headline_mapping(target, headlines_vectored, headline_vectorizer):
    
    # get model (testing with model that isn't )
    from sklearn.naive_bayes import MultinomialNB
    headline_model = MultinomialNB()
    headline_model = headline_model.fit(headlines_vectored, target)
    
    # get coefficients
    basicwords = headline_vectorizer.get_feature_names()
    basiccoeffs = headline_model.coef_.tolist()[0]
    coeff_df = pd.DataFrame({'Word' : basicwords, 
                            'Coefficient' : basiccoeffs})
    
    # convert dataframe to dictionary of coefficients
    coefficient_dict = dict(zip(coeff_df.Word, coeff_df.Coefficient))

    return coefficient_dict, coeff_df['Coefficient'].mean()

def get_coeff_col(headlines_df, coeff_dict, coeff_default):
    
    def get_coeff(word_lst):
        
        # iter through every word
        coeff_sum = 0
        for word in word_lst:
            if word in coeff_dict:
                coeff_sum += coeff_dict[word]
            else:
                coeff_sum += coeff_default
        
        # get average coefficient
        return coeff_sum / len(word_lst)
        
    basicvectorizer = CountVectorizer()
    
    # loop through every item
    headlines_coeff_lst = []
    for row in range(0,len(headlines_df.index)):
        for sentence in headlines_df.iloc[row]:
            headlines_coeff_lst.append(get_coeff(str(sentence).split(' ')))
    
    return pd.Series(headlines_coeff_lst)

In [20]:
coefficient_dict, coefficient_default = headline_mapping(X_train['returnsOpenNextMktres10'],
                                            *get_headline(X_train['headline']))

X_train['headline_coeff_mean'] = get_coeff_col(X_train['headline'], coefficient_dict, coefficient_default)

AttributeError: 'int' object has no attribute 'lower'

### Extra Features ``return``

### Get Time Features

In [None]:
# ripped from my previous kernel, NYC Taxi Fare

# first get dates
def split_time(df):
    
    # convert to string (will find a more efficient way to do this without converting to string)
    df['time'] = df['time'].dt.strftime('%Y-%m-%d %H:%M:%S')
    
    # split date_time into categories
    df['time_day'] = df['time'].str.slice(8,10)
    df['time_month'] = df['time'].str.slice(5,7)
    df['time_year'] = df['time'].str.slice(0,4)
    df['time_hour'] = df['time'].str.slice(11,13)
    
    # source: https://www.kaggle.com/nicapotato/taxi-rides-time-analysis-and-oof-lgbm
    df['temp_time'] = df['time'].str.replace(" UTC", "")
    df['temp_time'] = pd.to_datetime(df['temp_time'], format='%Y-%m-%d %H:%M:%S')
    
    df['time_day_of_year'] = df.temp_time.dt.dayofyear
    df['time_week_of_year'] = df.temp_time.dt.weekofyear
    df["time_weekday"] = df.temp_time.dt.weekday
    df["time_quarter"] = df.temp_time.dt.quarter
    
    del df['temp_time']
    gc.collect()
    
    # convert to non-object columns
    time_feats = ['time_day', 'time_month', 'time_year', 'time_hour']
    df[time_feats] = df[time_feats].apply(pd.to_numeric)
    
    # determine whether the day is set on a holiday
    cal = USFederalHolidayCalendar()
    holidays = cal.holidays(start='2007-01-01', end='2018-09-27').to_pydatetime()
    df['on_holiday'] = df['time'].str.slice(0,10).apply(lambda x: 1 if x in holidays else 0)
    
    # note to self: encode time later on
    
    return df

X_train = split_time(X_train)

In [None]:
def get_misc_features(X_df):
    
    # Adding daily difference
    new_col = X_df["close"] - X_df["open"]
    X_df.insert(loc=6, column="daily_diff", value=new_col)
    X_df['close_to_open'] =  np.abs(X_df['close'] / X_df['open'])

### Label Encoding

In [None]:
def group_delete(df, del_features):
    for f in del_features:
        del df[f]

def encoding(df, categorical_feats):
    df_encoded = pd.get_dummies(df[categorical_feats])
    df.join(df_encoded, how = 'right')
    group_delete(df, categorical_feats)
    print('new shape: ' + str(df.shape))
    return df

group_delete(X_train, ['time', 'sourceId', 'headline', 'assetCodes'])
X_train = encoding(X_train, [f for f in X_train.columns if X_train[f].dtype == 'object'])

### Cleaning Data

In [None]:
# will use a more efficient way later on
fcol = [c for c in X_train.columns if c not in ['sourceTimestamp', 'firstCreated', 'returnsOpenNextMktres10', 
                                                'assetName_x', 'universe', 'provider', 'subjects',
                                               'audiences', 'marketCommentary', 'assetName_y', 'sourceTimestamp'
                                               'firstCreated']] #<---- added


### Using LGBM for Modelling

In [None]:
# prepare x dataframes for modelling/prediction
def convert_to_X(market_obs_df, news_obs_df):
    
    # this repeats everything that was done previously
    X_test = join_market_news(market_obs_df, news_obs_df)
    X_test = aggregations(X_test)
    X_test['headline_coeff_mean'] = get_coeff_col(X_test['headline'], coefficient_dict, coefficient_default)
    X_test = split_time(X_test)
    group_delete(X_test, ['time', 'sourceId', 'headline', 'assetCodes'])
    X_test = encoding(X_test, ['assetCode', 'headlineTag'])
    X_test = X_test[[f for f in X_test.columns if 'int' in str(X_test[f].dtype) or 'float' in str(X_test[f].dtype)]]
    
    return X_test

In [None]:
y_train = X_train['returnsOpenNextMktres10']
del X_train['returnsOpenNextMktres10']

In [None]:
import lightgbm as lgb
import time

# set model and parameters
params = {'learning_rate': 0.02, 
          'boosting': 'gbdt', 
          'objective': 'regression', 
          'seed': 2018}

In [None]:
#split data (for cross validation)
x1, x2, y1, y2 = train_test_split(X_train[fcol], 
                                  y_train, 
                                  test_size=0.25, 
                                  random_state=99)

In [None]:
# train
t = time.time()
print('Fitting Up')

# cross validation
lgb_model = lgb.train(params, 
                        lgb.Dataset(x1, label=y1), 
                        5000, 
                        lgb.Dataset(x2, label=y2), 
                        verbose_eval=100, 
                        early_stopping_rounds=200)

# lgb_model = lgb.train(params, 
#                         lgb.Dataset(X_train[fcol], label=y_train),
#                         verbose_eval=100)

print(f'Done, time = {time.time() - t}')

In [None]:
def make_predictions(market_obs_df, news_obs_df):
    
    print('market_obs_df shape: ' + str(market_obs_df.shape))
    print('news_obs_df shape: ' + str(news_obs_df.shape))
    
    # predict using given model
    X_test = convert_to_X(market_obs_df, news_obs_df)
    print('Created X_test with features: ' + str(X_test[fcol].columns))
    
    # there is an error:
    # ValueError: Length of values does not match length of index
    prediction_values = np.clip(lgb_model.predict(X_test[fcol]), -1, 1)
    
    print('finished predictions')

    return prediction_values

### Making Predictions

Now the difference between the training and test data would be these two columns,  ``['returnsOpenNextMktres10', 'universe']``. We will be trying to predict ``returnsOpenNextMktres10`` and using that as the ``confidenceValue``.

In [None]:
for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days(): # Looping over days from start of 2017 to 2019-07-15
    
    print('predictions_template_df shape: ' + str(predictions_template_df.shape))
    # make predictions
    predictions_template_df['confidenceValue'] = make_predictions(market_obs_df, news_obs_df)
    
    # save predictions
    env.predict(predictions_template_df)


### Export Submission

In [None]:
env.write_submission_file() # Writes your submission file
print('finished!')