# Amateur Hour - Using Headlines to Predict Stocks
### Starter Kernel by ``Magichanics`` 
*([Gitlab](https://gitlab.com/Magichanics) - [Kaggle](https://www.kaggle.com/magichanics))*

Stocks are unpredictable, but can sometimes follow a trend. In this notebook, we will be discovering the correlation between the stocks and the news.

If there are any things that you would like me to add or remove, feel free to comment down below. I'm mainly doing this to learn and experiment with the data. I plan on rewriting a lot of code in the future to make it look nicer, since a lot of the stuff I have written may not be the most efficient way to approach specific problems.

**To Do List:**
* Removing features with low importance.

**What's new?**
* Added ``vol_by_group``, where we get the average of the volume from each company defined in ``assetName``.
* Added count and groupby columns with ``audiences`` and `subjects`.
* Added ``assetCode`` and ``assetName`` as a feature for training.



![title](https://upload.wikimedia.org/wikipedia/commons/8/8d/Wall_Street_sign_banner.jpg)

Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Wall_Street_sign_banner.jpg)

In [1]:
# main
import numpy as np
import pandas as pd
import os
from itertools import chain
import gc

# text processing
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# clustering
from sklearn.cluster import KMeans

# time
from pandas.tseries.holiday import USFederalHolidayCalendar
from sklearn.preprocessing import LabelEncoder
import datetime

# training
from sklearn.model_selection import train_test_split
import lightgbm as lgb

# import environment for data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

Loading the data... This could take a minute.
Done!


In [2]:
sampling = False

In [13]:
(market_train_df, news_train_df) = env.get_training_data()

if sampling:
    market_train_df = market_train_df.tail(40_000)
    news_train_df = news_train_df.tail(100_000)

In [14]:
market_train_df.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
4032956,2016-11-30 22:00:00+00:00,BOX.N,Box Inc,1598145.0,15.22,15.36,-0.004578,-0.004537,-0.001788,-0.007651,0.01738,0.028112,-0.001421,-0.009743,-0.13843,0.0
4032957,2016-11-30 22:00:00+00:00,BP.N,BP PLC,11784765.0,35.01,34.54,0.044451,0.035061,0.047718,0.032011,0.041654,0.039735,0.03105,0.017423,0.014339,1.0
4032958,2016-11-30 22:00:00+00:00,BPFH.O,Boston Private Financial Holdings Inc,596857.0,15.0,14.9,0.020408,0.010169,0.025694,0.007065,-0.025974,-0.032468,-0.056248,-0.091455,-0.100516,0.0
4032959,2016-11-30 22:00:00+00:00,BPL.N,Buckeye Partners LP,1051021.0,64.34,64.75,0.016751,0.02323,0.018486,0.021982,-0.007864,-0.004153,-0.026552,-0.04099,-0.056339,0.0
4032960,2016-11-30 22:00:00+00:00,BPMC.O,Blueprint Medicines Corp,316481.0,29.37,30.07,-0.031971,-0.152719,-0.027388,-0.156343,-0.160137,-0.138395,-0.202046,-0.21537,-0.431184,0.0


In [15]:
news_train_df.head()

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
9228750,2016-11-09 14:40:00+00:00,2016-11-09 14:40:00+00:00,2016-11-09 14:40:00+00:00,20373a40928d0d9b,"S&P 500 OIL, GAS & CONSUMABLE FUELS INDEX DOWN...",1,1,RTRS,"{'BLR', 'STX', 'OILG', 'HOT', 'EXPL', 'OGTR', ...","{'O', 'U', 'NAW', 'OIL', 'E'}",0,6,,True,2,29,"{'XON.DE', 'XON.F', 'XOM.N'}",Exxon Mobil Corp,1,1.0,-1,0.523708,0.300387,0.175905,15,1,1,1,1,1,2,6,13,20,34
9228751,2016-11-09 14:40:01+00:00,2016-11-09 14:40:01+00:00,2016-11-09 14:27:08+00:00,3c02c1d52199f30c,Ford vows to work with Trump after criticism o...,3,1,RTRS,"{'JOB', 'ECON', 'RTRS', 'MCE', 'EMRG', 'TRF', ...","{'PGE', 'PCO', 'G', 'PCU', 'DNP', 'PSC', 'U', ...",752,1,,False,5,146,"{'F.PA', 'F.F', 'F.DE', 'F.N'}",Ford Motor Co,1,1.0,1,0.193485,0.171482,0.635034,146,0,0,0,0,0,6,12,42,45,66
9228752,2016-11-09 14:40:22+00:00,2016-11-09 14:40:22+00:00,2016-11-09 14:35:07+00:00,f7cb3e02eebc5b06,SHARES OF GOLDMAN SACHS AND MORGAN STANLEY ALS...,1,3,RTRS,"{'BLR', 'FUND', 'STX', 'INVB', 'HOT', 'LEN', '...","{'E', 'U'}",0,4,,False,1,17,"{'BAC', 'BAC.N'}",Bank of America Corp,0,0.707107,1,0.029941,0.117471,0.852588,17,0,0,0,0,0,8,19,39,40,52
9228753,2016-11-09 14:40:22+00:00,2016-11-09 14:40:22+00:00,2016-11-09 14:35:07+00:00,f7cb3e02eebc5b06,SHARES OF GOLDMAN SACHS AND MORGAN STANLEY ALS...,1,3,RTRS,"{'BLR', 'FUND', 'STX', 'INVB', 'HOT', 'LEN', '...","{'E', 'U'}",0,4,,False,1,17,"{'MSP.A', 'MS.N'}",Morgan Stanley,1,1.0,0,0.152862,0.615835,0.231303,17,0,0,0,0,0,4,11,19,20,24
9228754,2016-11-09 14:40:22+00:00,2016-11-09 14:40:22+00:00,2016-11-09 14:35:07+00:00,f7cb3e02eebc5b06,SHARES OF GOLDMAN SACHS AND MORGAN STANLEY ALS...,1,3,RTRS,"{'BLR', 'FUND', 'STX', 'INVB', 'HOT', 'LEN', '...","{'E', 'U'}",0,4,,False,1,17,"{'GSC.P', 'GS.N'}",Goldman Sachs Group Inc,1,1.0,0,0.152862,0.615835,0.231303,17,0,0,0,0,0,4,8,19,20,33


### Information on the Training Data
* There are no Unknown ``assetName`` in ``news_train_df``, but there are 24 479 rows with Unknown as the ``assetName`` in ``market_train_df``. Merging by ``assetCode`` leaves out Unknown rows, which could be problematic.
* ``Volume`` has the highest correlation in terms of ``returnsOpenNextMktres10``.
* Merging by just ``assetCodes`` greatly increases the dataframe (with just 100k rows, it has turned into 10 million rows), although merging by ``assetCodes`` and ``time`` greatly decrease the original dataframe.

### Market Groupby Features
We are going to group market data based on ``assetName`` and determine the median and mean of the volume.

In [16]:
def mean_volume(market_df):
    
    # groupby and return median
    vol_by_name = market_df[['volume', 'assetName']].groupby('assetName').median()['volume']
    #vol_by_name_mean = market_df[['volume', 'assetName']].groupby('assetName').mean()['volume'] # could try mean?
    market_df['vol_by_name'] = market_df['assetName'].map(vol_by_name)
    
    # get difference
    market_df['vol_by_name_diff'] = market_df['volume'] - market_df['vol_by_name']
    
    return market_df

In [17]:
market_train_df = mean_volume(market_train_df)

### Aggregations on News Data

It helped a lot during the Home Credit competition, and in the next block of code we will be merging the news dataframe with the market dataframe. Instead of having columns with a list of numbers, we will get aggregations for each grouping. The following block creates a dictionary that will be used when merging the data.

In [18]:
news_agg_cols = [f for f in news_train_df.columns if 'novelty' in f or
                'volume' in f or
                'sentiment' in f or
                'bodySize' in f or
                'Count' in f or
                'marketCommentary' in f or
                'relevance' in f]
news_agg_dict = {}
for col in news_agg_cols:
    news_agg_dict[col] = ['mean', 'sum', 'max', 'min']
news_agg_dict['urgency'] = ['min', 'count']
news_agg_dict['takeSequence'] = ['max']

### Joining Market & News Data

The grouping method that I'll be using is from [bguberfain](https://www.kaggle.com/bguberfain), but I'll also be adding in other columns like ``headline``, as well eliminating rows that are not partnered with either the market or news data. One way I would improve this is probably group by time periods rather than exact times given in ``time`` due to the small amount of data that share the same amount of data in terms of the ``time`` column, and possibly making it a bit more efficient. 

Notes: 
* When you run the full dataset, expect it to take a while.
* As you remove more time features from seconds to year, the resulting train data becomes larger and larger.

In [9]:
def generalize_time(X):
    # convert time to string and/or get rid of Hours, Minutes, and seconds
    X['time'] = X['time'].dt.strftime('%Y-%m-%d %H:%M:%S').str.slice(0,16) #(0,10) for Y-m-d, (0,13) for Y-m-d H

# get dataframes within indecies
def get_indecies(df, indecies):
    
    # update market dataframe to only contain the specific rows with matching indecies.
    def check_index(index, indecies):
        if index in indecies:
            return True
        else:
            return False
    
    df['del_index'] = df.index.values
    df['is_in_indecies'] = df['del_index'].apply(lambda x: check_index(x, indecies))
    df = df[df.is_in_indecies == True]
    del df['del_index'], df['is_in_indecies']
    
    return df

# this function checks for potential nulls after grouping by only grouping the time and assetcode dataframe
# returns valid news indecies for the next if statement.
def partial_groupby(market_df, news_df, df_assetCodes):
    
    # get new dataframe
    temp_news_df_expanded = pd.merge(df_assetCodes, news_df[['time', 'assetCodes']], left_on='level_0', right_index=True, suffixes=(['','_old']))

    # groupby dataframes
    temp_news_df = temp_news_df_expanded.copy()[['time', 'assetCode']]
    temp_market_df = market_df.copy()[['time', 'assetCode']]

    # get indecies on both dataframes
    temp_news_df['news_index'] = temp_news_df.index.values
    temp_market_df['market_index'] = temp_market_df.index.values

    # set multiindex and join the two
    temp_news_df.set_index(['time', 'assetCode'], inplace=True)

    # join the two
    temp_market_df_2 = temp_market_df.join(temp_news_df, on=['time', 'assetCode'])
    del temp_market_df, temp_news_df

    # drop nulls in any columns
    temp_market_df_2 = temp_market_df_2.dropna()

    # get indecies
    market_valid_indecies = temp_market_df_2['market_index'].tolist()
    news_valid_indecies = temp_market_df_2['news_index'].tolist()
    del temp_market_df_2

    # get index rows
    market_df = get_indecies(market_df, market_valid_indecies)
    
    return market_df, news_valid_indecies

def join_market_news(market_df, news_df, nulls=False):
    
    # convert time to string
    generalize_time(market_df)
    generalize_time(news_df)
    
    # Fix asset codes (str -> list)
    news_df['assetCodes'] = news_df['assetCodes'].str.findall(f"'([\w\./]+)'")

    # Expand assetCodes
    assetCodes_expanded = list(chain(*news_df['assetCodes']))
    assetCodes_index = news_df.index.repeat( news_df['assetCodes'].apply(len) )
    
    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})
    
    if not nulls:
        market_df, news_valid_indecies = partial_groupby(market_df, news_df, df_assetCodes)
    
    # create dataframe based on groupby
    news_col = ['time', 'assetCodes', 'headline', 'audiences', 'subjects'] + sorted(list(news_agg_dict.keys()))
    news_df_expanded = pd.merge(df_assetCodes, news_df[news_col], left_on='level_0', right_index=True, suffixes=(['','_old']))
    
    # check if the columns are in the index
    if not nulls:
        news_df_expanded = get_indecies(news_df_expanded, news_valid_indecies)

    def news_df_feats(x):
        if x.name == 'headline':
            return list(x)
        elif x.name == 'subjects' or x.name == 'audiences':
            output = []
            for i in x:
                # remove all special characters
                codes = i.strip('{\',}').replace('\'','').split(', ')
                for j in codes:
                    output.append(j)
            return output
                
                
    
    # groupby time and assetcode
    news_df_expanded = news_df_expanded.reset_index()
    news_groupby = news_df_expanded.groupby(['time', 'assetCode'])
    
    # get aggregated df
    news_df_aggregated = news_groupby.agg(news_agg_dict).apply(np.float32).reset_index()
    news_df_aggregated.columns = ['_'.join(col).strip() for col in news_df_aggregated.columns.values] # columns are abnormal
    
    # get any important string dataframes
    groupby_news = news_groupby.transform(lambda x: news_df_feats(x))
    news_df_cat = pd.DataFrame({'headline':groupby_news['headline'],
                               'subjects':groupby_news['subjects'],
                               'audiences':groupby_news['audiences']})
    new_news_df = pd.concat([news_df_aggregated, news_df_cat], axis=1)
    
    # cleanup
    del news_df_aggregated
    del news_df_cat
    del news_df
    
    # rename columns
    new_news_df.rename(columns={'time_': 'time', 'assetCode_': 'assetCode'}, inplace=True)
    new_news_df.set_index(['time', 'assetCode'], inplace=True)
    
    # Join with train
    market_df = market_df.join(new_news_df, on=['time', 'assetCode'])
    
    # replace with null string
    market_df[['audiences', 'subjects', 'headline']] = market_df[['audiences', 'subjects', 'headline']].fillna('null')

    return market_df


# if there is a joining error, it means that the dataframes have no correlation with each other (solution: increase train dataset)

In [19]:
%%time
X_train = join_market_news(market_train_df, news_train_df, nulls=False)

CPU times: user 18.9 s, sys: 104 ms, total: 19 s
Wall time: 19 s


In [11]:
X_train.head()

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,vol_by_name,vol_by_name_diff,bodySize_mean,bodySize_sum,bodySize_max,bodySize_min,companyCount_mean,companyCount_sum,companyCount_max,companyCount_min,marketCommentary_mean,marketCommentary_sum,marketCommentary_max,marketCommentary_min,sentenceCount_mean,sentenceCount_sum,sentenceCount_max,sentenceCount_min,wordCount_mean,wordCount_sum,wordCount_max,wordCount_min,relevance_mean,relevance_sum,...,noveltyCount24H_max,noveltyCount24H_min,noveltyCount3D_mean,noveltyCount3D_sum,noveltyCount3D_max,noveltyCount3D_min,noveltyCount5D_mean,noveltyCount5D_sum,noveltyCount5D_max,noveltyCount5D_min,noveltyCount7D_mean,noveltyCount7D_sum,noveltyCount7D_max,noveltyCount7D_min,volumeCounts12H_mean,volumeCounts12H_sum,volumeCounts12H_max,volumeCounts12H_min,volumeCounts24H_mean,volumeCounts24H_sum,volumeCounts24H_max,volumeCounts24H_min,volumeCounts3D_mean,volumeCounts3D_sum,volumeCounts3D_max,volumeCounts3D_min,volumeCounts5D_mean,volumeCounts5D_sum,volumeCounts5D_max,volumeCounts5D_min,volumeCounts7D_mean,volumeCounts7D_sum,volumeCounts7D_max,volumeCounts7D_min,urgency_min,urgency_count,takeSequence_max,headline,subjects,audiences
4062964,2016-12-22 22,LMT.N,Lockheed Martin Corp,831089.0,252.8,252.01,0.001109,-0.009784,0.001196,-0.010157,-0.024879,-0.053945,-0.008481,-0.03322,0.04761,1.0,954302.5,-123213.5,217.600006,1088.0,652.0,0.0,1.6,8.0,2.0,1.0,0.2,1.0,1.0,0.0,2.2,11.0,5.0,1.0,66.400002,332.0,151.0,12.0,0.894281,4.471405,...,1.0,0.0,1.0,5.0,2.0,0.0,1.0,5.0,2.0,0.0,1.0,5.0,2.0,0.0,3.0,15.0,5.0,1.0,10.0,50.0,11.0,9.0,24.6,123.0,26.0,23.0,30.0,150.0,32.0,28.0,31.0,155.0,33.0,29.0,1.0,5.0,1.0,[PRESTIGE BRANDS ANNOUNCES AGREEMENT TO ACQUIR...,"[BLR, RDRU, FDRT, AGA, US, CMPNY, SHOPAL, RTRS...","[E, U, BSW, CNR, E, U, E, U, E, U, E, U, E, U,..."
4063015,2016-12-22 22,MCD.N,McDonald's Corp,3037126.0,123.72,123.12,0.004384,-0.000244,0.00502,0.001127,0.027148,0.025573,0.021236,0.014277,-0.029052,1.0,2072290.0,964836.0,2764.0,2764.0,2764.0,2764.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,16.0,16.0,16.0,16.0,498.0,498.0,498.0,498.0,0.353553,0.353553,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,8.0,8.0,8.0,8.0,3.0,1.0,1.0,[PRESTIGE BRANDS ANNOUNCES AGREEMENT TO ACQUIR...,"[BLR, RDRU, FDRT, AGA, US, CMPNY, SHOPAL, RTRS...","[E, U, BSW, CNR, E, U, E, U, E, U, E, U, E, U,..."
4063032,2016-12-22 22,MEOH.O,Methanex Corp,477658.0,44.4,45.05,-0.014428,-0.00442,-0.014976,-0.000892,0.000441,0.012819,0.013607,0.033811,0.027325,1.0,384433.5,93224.5,16357.0,16357.0,16357.0,16357.0,39.0,39.0,39.0,39.0,0.0,0.0,0.0,0.0,73.0,73.0,73.0,73.0,2975.0,2975.0,2975.0,2975.0,0.014778,0.014778,...,2.0,2.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,3.0,1.0,1.0,[Acceleron Announces First Patient Treated in ...,"[GEN, NEWR, HECA, PHMR, MRCH, HEA, US, CMPNY, ...","[BSW, CNR, E, U, E, U]"
4063035,2016-12-22 22,MFC.N,Manulife Financial Corp,2032250.0,18.11,18.3,-0.013617,-0.010276,-0.011132,-0.003661,-0.020022,0.017798,-0.032808,-0.006542,0.003204,1.0,1883139.5,149110.5,3740.0,3740.0,3740.0,3740.0,29.0,29.0,29.0,29.0,0.0,0.0,0.0,0.0,17.0,17.0,17.0,17.0,691.0,691.0,691.0,691.0,0.074536,0.074536,...,2.0,2.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,16.0,16.0,16.0,16.0,3.0,1.0,1.0,[Acceleron Announces First Patient Treated in ...,"[GEN, NEWR, HECA, PHMR, MRCH, HEA, US, CMPNY, ...","[BSW, CNR, E, U, E, U]"
4063058,2016-12-22 22,MNK.N,Mallinckrodt Plc,1692417.0,52.03,52.86,-0.007629,-0.005456,-0.009629,-0.005219,-0.006113,0.021844,-0.018195,-0.002696,0.030322,1.0,1674999.0,17418.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,31.0,31.0,31.0,31.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,[PRESTIGE BRANDS ANNOUNCES AGREEMENT TO ACQUIR...,"[BLR, RDRU, FDRT, AGA, US, CMPNY, SHOPAL, RTRS...","[E, U, BSW, CNR, E, U, E, U, E, U, E, U, E, U,..."


In [12]:
X_train.shape

(100, 108)

### Text Processing with Logistic Regression

We are going to vectorize the headlines and apply logistic regression (labels being binary as to whether the stocks go up or not). In a nutshell, it splits the headlines into individual words, filters out unecessary words to prevent abnormal results, vectorizes it for modelling, and then with the target column provided, we could create a dataframe of coefficients that we could use as a feature in the dataframe! Right now I am just getting the mean of the coefficients in each list of headlines. 

Note: May be useful to apply it to ``universe``, and possibly get the sum or standard deviation of the word coefficients?

In [None]:
# reuse data
def round_scores(x):
    if x >= 0:
        return 1
    else:
        return 0
    
def clean_headlines(headline):
    
    # remove numerical and convert to lowercase
    headline =  re.sub('[^a-zA-Z]',' ',headline)
    headline = headline.lower()
    
    # drop stopwords
    headline_words = headline.split(' ')
    headline_words = [word for word in headline_words if not word in stopwords.words('english')]
    
    # use stemming to simplify words
    ps = PorterStemmer()
    headline_words = [ps.stem(word) for word in headline_words]
    
    # join sentence back again
    return ' '.join(headline_words)

# these functions should only go towards the training data only
def get_headline_df(X_train):
    
    headlines_lst = []
    target_lst = []
    
    # iter through every headline.
    for row in range(0,len(X_train.index)):
        for sentence in X_train['headline'].iloc[row]:
            headlines_lst.append(clean_headlines(sentence))
            target_lst.append(round_scores(X_train['returnsOpenNextMktres10'].iloc[row]))
            
    # return dataframe
    return pd.DataFrame({'headline':pd.Series(headlines_lst), 'returnsOpenNextMktres10':pd.Series(target_lst)})
    
def get_headline(headlines_df):
    
    # get headlines as list (use only headline_df produced by get_headline_df)
    headlines_lst = []
    for row in range(0,len(headlines_df.index)):
        headlines_lst.append(headlines_df.iloc[row])

    # split headlines to separate words
    basicvectorizer = CountVectorizer()
    headlines_vectorized = basicvectorizer.fit_transform(headlines_lst)
    
    print(headlines_vectorized.shape)
    return headlines_vectorized, basicvectorizer

def headline_mapping(target, headlines_vectored, headline_vectorizer):
    
    print(np.asarray(target).shape)
    headline_model = LogisticRegression()
    headline_model = headline_model.fit(headlines_vectored, target)
    
    # get coefficients
    basicwords = headline_vectorizer.get_feature_names()
    basiccoeffs = headline_model.coef_.tolist()[0]
    coeff_df = pd.DataFrame({'Word' : basicwords, 
                            'Coefficient' : basiccoeffs})
    
    # convert dataframe to dictionary of coefficients
    coefficient_dict = dict(zip(coeff_df.Word, coeff_df.Coefficient))

    return coefficient_dict, coeff_df['Coefficient'].mean()

# for predictions
def get_coeff_col(X, coeff_dict, coeff_default):
    
    def get_coeff(word_lst):
        
        # iter through every word
        coeff_sum = 0
        for word in word_lst:
            if word in coeff_dict:
                coeff_sum += coeff_dict[word]
            else:
                coeff_sum += coeff_default
        
        # get average coefficient
        coeff_score = coeff_sum / len(word_lst)
        return coeff_score
        
    basicvectorizer = CountVectorizer()
    
    # loop through every item
    headlines_coeff_lst = []
    for row in range(0,len(X['headline'].index)):
        coeff_score = 0
        if X['headline'].iloc[row] == 'null':
            headlines_coeff_lst.append(np.nan)
            break
        for i in range(0,len(X['headline'].iloc[row])):
            coeff_score += get_coeff(clean_headlines(str(X['headline'].iloc[row][i])).split(' '))
        headlines_coeff_lst.append(coeff_score / len(X['headline'].iloc[row]))
        
    # merge coefficient frame with main
    coeff_mean_df = pd.DataFrame({'headline_coeff_mean': pd.Series(headlines_coeff_lst)})
    X = pd.concat([X.reset_index(), coeff_mean_df], axis=1)
    
    return X

In [None]:
headline_df = get_headline_df(X_train)
coefficient_dict, coefficient_default = headline_mapping(headline_df['returnsOpenNextMktres10'],
                                            *get_headline(headline_df['headline']))

In [None]:
# will be applied to X_test as well
X_train = get_coeff_col(X_train, coefficient_dict, coefficient_default)

### News Groupby Features
We are going to be looking specifically at the ``audiences`` and ``subjects`` column from the news dataframe. Here are some ideas:
* Get the number of times a certain subject/audience occurs, and get the sum of the list of audiences/subjects.

In [None]:
# this is set up for list of strings columns
def get_feature_count(X_feat):
    
    # get list
    item_lst = []
    for row in range(0,len(X_feat.index)):
        if X_feat.iloc[row] != 'null':
            for i in range(0, len(X_feat.iloc[row])):
                item_lst.append(X_feat.iloc[row][i])
    
    # get unique items
    unique_feats = set(item_lst)
    
    # get frequency dictionary
    item_map = {}
    for i in unique_feats:
        item_map[i] = len([n for n in item_lst if n == i])
    
    return item_map

def get_feature_count_total(X_feat, item_map):
    
    # iter through every item and get total count
    counts = []
    for row in range(0,len(X_feat.index)):
        count = 0
        if X_feat.iloc[row] != 'null':
            for i in range(0, len(X_feat.iloc[row])): # this is what is causing the error.
                count += item_map[X_feat.iloc[row][i]]
        counts.append(count)
            
    return pd.Series(counts)
    
def news_grouping_features(X):
    
    # account for all possible nulls
    X[['audiences', 'subjects']] = X[['audiences', 'subjects']].fillna('null')
    
    # get map
    audience_map = get_feature_count(X['audiences'])
    subjects_map = get_feature_count(X['subjects'])
    
    # get count of each item
    X['audiences_count'] = get_feature_count_total(X['audiences'], get_feature_count(X['audiences']))
    X['subjects_count'] = get_feature_count_total(X['subjects'], get_feature_count(X['subjects']))
    
    return X

In [None]:
X_train = news_grouping_features(X_train)

### Clustering
We are going to be clustering a few columns together (mainly to see how this will affect our results). 

In [None]:
def clustering(df):

    def cluster_modelling(features):
        df_set = df[features]
        cluster_model = KMeans(n_clusters = 8)
        cluster_model.fit(df_set)
        return cluster_model.predict(df_set)
    
    # get columns:
    vol_cols = [f for f in df.columns if f != 'volume' and 'volume' in f]
    novelty_cols = [f for f in df.columns if 'novelty' in f]
    
    # fill nulls
    cluster_cols = novelty_cols + vol_cols + ['open', 'close']
    df[cluster_cols] = df[cluster_cols].fillna(0)
    
    df['cluster_open_close'] = cluster_modelling(['open', 'close'])
    df['cluster_volume'] = cluster_modelling(vol_cols)
    df['cluster_novelty'] = cluster_modelling(novelty_cols)
    
    return df

In [None]:
X_train = clustering(X_train)

### Extra Features

Here are some basic extra features from other notebooks.

In [None]:
def extra_features(df):
    
    # Adding daily difference
    new_col = df["close"] - df["open"]
    df.insert(loc=6, column="daily_diff", value=new_col)
    df['close_to_open'] =  np.abs(df['close'] / df['open'])
    
    return df

In [None]:
X_train = extra_features(X_train)

### Get Time Features

This section splits the timestamp column into their own separate columns, as well as other various time features.

Possible idea: Encoding time?

In [None]:
# ripped from my previous kernel, NYC Taxi Fare

# first get dates
def split_time(df):
    
    # split date_time into categories
    df['time_day'] = df['time'].str.slice(8,10)
    df['time_month'] = df['time'].str.slice(5,7)
    df['time_year'] = df['time'].str.slice(0,4)
    df['time_hour'] = df['time'].str.slice(11,13)
    df['time_minute'] = df['time'].str.slice(14,16)
    
    # source: https://www.kaggle.com/nicapotato/taxi-rides-time-analysis-and-oof-lgbm
    df['temp_time'] = df['time'].str.replace(" UTC", "")
    df['temp_time'] = pd.to_datetime(df['temp_time'], format='%Y-%m-%d %H')
    
    df['time_day_of_year'] = df.temp_time.dt.dayofyear
    df['time_week_of_year'] = df.temp_time.dt.weekofyear
    df["time_weekday"] = df.temp_time.dt.weekday
    df["time_quarter"] = df.temp_time.dt.quarter
    
    del df['temp_time']
    gc.collect()
    
    # convert to non-object columns
    time_feats = ['time_day', 'time_month', 'time_year']
    df[time_feats] = df[time_feats].apply(pd.to_numeric)
    
    # determine whether the day is set on a holiday
    cal = USFederalHolidayCalendar()
    holidays = cal.holidays(start='2007-01-01', end='2018-09-27').to_pydatetime()
    df['on_holiday'] = df['time'].str.slice(0,10).apply(lambda x: 1 if x in holidays else 0)
    
    return df

In [None]:
X_train = split_time(X_train)

Here we remove all the excess columns and use pd.get_dummies on ``assetCode`` and ``assetName`` to process these categorical features.

In [None]:
def misc_adjustments(X):
    del_cols = ['index'] + [f for f in X.columns if X[f].dtype == 'object' and f != 'assetCode' and f != 'assetName']
    for f in del_cols:
        del X[f]
        
    # encode data
    X = pd.get_dummies(X)
    
    return X
        
#     # categorize assetCode and assetName
#     from sklearn.preprocessing import LabelEncoder
#     le = LabelEncoder()
#     X = X.assign(assetCode = le.fit_transform(X.assetCode))

In [None]:
X_train = misc_adjustments(X_train)

### Compile X functions into one function

This will be used when looping through different batches of X_test

In [None]:
def get_X(market_df, news_df):
    
    # these are all the functions applied to X_train except for a few
    market_df = mean_volume(market_df)
    X_test = join_market_news(market_df, news_df, nulls=True)
    X_test = get_coeff_col(X_test, coefficient_dict, coefficient_default)
    X_test = news_grouping_features(X_test)
    X_test = clustering(X_test)
    X_test = extra_features(X_test)
    X_test = split_time(X_test)
    X_test = misc_adjustments(X_test)
    
    return X_test

#### Resulting Dataframe and Data Correlation to Target column
We have went to roughly 50 columns to 135!

In [None]:
X_train.head(10)

### Using LightGBM for Modelling + Remove unecessary features

We are going to use parameters from a notebook for modelling our data, as well as looping through the data until we reach a certain score. One thing that I was doing wrong which is causing my score to drop is removing the ``assetCode`` column.

Notes: 
* Might possibly add bayesian optimization if necessary?
* ValueError: could not convert string to float: 'Packaging Corp of America' <--- occurs without label encoding ``assetName``.

In [None]:
def set_data(X_train):

    # get X and Y
    y_train = X_train['returnsOpenNextMktres10']
    del X_train['returnsOpenNextMktres10'], X_train['universe']
#     X_train = pd.get_dummies(X_train)
    
    # split data (for cross validation)
    x1, x2, y1, y2 = train_test_split(X_train, 
                                      y_train, 
                                      test_size=0.25, 
                                      random_state=99)
    
    # get columns
    train_cols = X_train.columns.tolist()
#     categorical_cols = ['assetCode', 'assetName']
    
#     # convert to LGBM Data Structures (with categorical features; produces errors)
#     dtrain = lgb.Dataset(x1.values, y1, feature_name=train_cols, categorical_feature=categorical_cols)
#     dvalid = lgb.Dataset(x2.values, y2, feature_name=train_cols, categorical_feature=categorical_cols)
    
    # convert to LGBM Data Structures
    dtrain = lgb.Dataset(x1.values, y1, feature_name=train_cols)
    dvalid = lgb.Dataset(x2.values, y2, feature_name=train_cols)
    
    return dtrain, dvalid
    
def lgbm_training(dtrain, dvalid):
    
    # set model and parameters
    params = {'learning_rate': 0.02,
              'boosting': 'gbdt', 
              'objective': 'regression', 
              'seed': 2018}
    
    # train data
    lgb_model = lgb.train(params, dtrain, 
                          num_boost_round=1000, 
                          valid_sets=(dvalid,), 
                          valid_names=('valid',), 
                          verbose_eval=25, 
                          early_stopping_rounds=20)
    
    return lgb_model

In [None]:
lgb_model = lgbm_training(*set_data(X_train))

### Making Predictions

Now the difference between the training and test data would be these two columns,  ``['returnsOpenNextMktres10', 'universe']``. We will be trying to predict ``returnsOpenNextMktres10`` and using that as the ``confidenceValue``.

In [None]:
%%time

def make_predictions(market_obs_df, news_obs_df):
    
    # predict using given model
    X_test = get_X(market_obs_df, news_obs_df)
    prediction_values = np.clip(lgb_model.predict(X_test), -1, 1)

    return prediction_values

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days(): # Looping over days from start of 2017 to 2019-07-15
    
    # make predictions
    predictions_template_df['confidenceValue'] = make_predictions(market_obs_df, news_obs_df)
    
    # save predictions
    env.predict(predictions_template_df)


### Export Submission

In [None]:
# exports csv
env.write_submission_file()
print('finished!')

### References:
* [Getting Started - DJ Sterling](https://www.kaggle.com/dster/two-sigma-news-official-getting-started-kernel)
* [a simple model - Bruno G. do Amaral](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-data)
* [LGBM Model - the1owl](https://www.kaggle.com/the1owl/my-two-sigma-cents-only)
* [Headline Processing - Andrew Gelé](https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit)
* [Feature engineering - Andrew Lukyanenko](https://www.kaggle.com/artgor/eda-feature-engineering-and-everything)
* [Basic Text Processing - akatsuki06](https://www.kaggle.com/akatsuki06/basic-text-processing-cleaning-the-description)
* [The fallacy of encoding assetCode - marketneutral](https://www.kaggle.com/marketneutral/the-fallacy-of-encoding-assetcode) *I know it's contradictory since I'm encoding my assetCode and assetName features for now.*