# Two Sigma Financial News Competition Official Getting Started Kernel
## Introduction
In this competition you will predict how stocks will change based on the market state and news articles.  You will loop through a long series of trading days; for each day, you'll receive an updated state of the market, and a series of news articles which were published since the last trading day, along with impacted stocks and sentiment analysis.  You'll use this information to predict whether each stock will have increased or decreased ten trading days into the future.  Once you make these predictions, you can move on to the next trading day. 

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Kernels, and you may not use other data sources, GPU, or internet access.
* This is a **two-stage competition**.  In Stage One you can edit your Kernels and improve your model, where Public Leaderboard scores are based on their predictions relative to past market data.  At the beginning of Stage Two, your Kernels are locked, and we will re-run your Kernels over the next six months, scoring them based on their predictions relative to live data as those six months unfold.
* You must use our custom **`kaggle.competitions.twosigmanews`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions for the current trading day.

## In this Starter Kernel, we'll show how to use the **`twosigmanews`** module to get the training data, get test features and make predictions, and write the submission file.
## TL;DR: End-to-End Usage Example
```
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

(market_train_df, news_train_df) = env.get_training_data()
train_my_model(market_train_df, news_train_df)

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days():
  predictions_df = make_my_predictions(market_obs_df, news_obs_df, predictions_template_df)
  env.predict(predictions_df)
  
env.write_submission_file()
```
Note that `train_my_model` and `make_my_predictions` are functions you need to write for the above example to work.

## In-depth Introduction
First let's import the module and create an environment.

In [4]:
%pip install kaggle


The following command must be run outside of the IPython shell:

    $ pip install kaggle

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


## **`get_training_data`** function

Returns the training data DataFrames as a tuple of:
* `market_train_df`: DataFrame with market training data
* `news_train_df`: DataFrame with news training data

These DataFrames contain all market and news data from February 2007 to December 2016.  See the [competition's Data tab](https://www.kaggle.com/c/two-sigma-financial-news/data) for more information on what columns are included in each DataFrame.

In [11]:
import pandas as pd
import warnings; warnings.simplefilter('ignore')
news_train_dir = "./new_train_df.csv"
news_train_df = pd.read_csv(news_train_dir)

In [13]:
news_train_df.head(10)

Unnamed: 0.1,Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,...,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
0,0,2007-01-01 04:29:32+00:00,2007-01-01 04:29:32+00:00,2007-01-01 04:29:32+00:00,e58c6279551b85cf,China's Daqing pumps 43.41 mln tonnes of oil i...,3,1,RTRS,"{'ENR', 'ASIA', 'CN', 'NGS', 'EMRG', 'RTRS', '...",...,0,0,0,0,0,0,0,3,6,7
1,1,2007-01-01 07:03:35+00:00,2007-01-01 07:03:34+00:00,2007-01-01 07:03:34+00:00,5a31c4327427f63f,"FEATURE-In kidnapping, finesse works best",3,1,RTRS,"{'FEA', 'CA', 'LATAM', 'MX', 'INS', 'ASIA', 'I...",...,1,1,1,1,1,1,1,3,3,3
2,2,2007-01-01 11:29:56+00:00,2007-01-01 11:29:56+00:00,2007-01-01 11:29:56+00:00,1cefd27a40fabdfe,PRESS DIGEST - Wall Street Journal - Jan 1,3,1,RTRS,"{'RET', 'ENR', 'ID', 'BG', 'US', 'PRESS', 'IQ'...",...,0,0,0,0,0,0,0,5,11,17
3,3,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...",...,0,0,0,0,0,0,0,5,13,15
4,4,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...",...,0,0,0,0,0,0,0,0,0,0
5,5,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...",...,0,0,0,0,0,0,0,0,2,3
6,6,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...",...,0,0,0,0,0,0,0,2,8,8
7,7,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...",...,0,0,0,0,0,0,0,0,4,16
8,8,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,2007-01-01 12:08:37+00:00,23768af19dc69992,PRESS DIGEST - New York Times - Jan 1,3,1,RTRS,"{'FUND', 'FIN', 'CA', 'SFWR', 'INS', 'PUB', 'B...",...,0,0,0,0,0,0,2,8,11,12
9,9,2007-01-01 13:00:02+00:00,2007-01-01 13:00:02+00:00,2007-01-01 13:00:02+00:00,9fb959be43ed4ba2,Tenet Completes Sale of Alvarado Hospital Medi...,3,1,BSW,"{'US', 'LEN', 'NEWR', 'DRU'}",...,0,0,0,0,0,0,0,0,0,1


In [2]:
(market_train_df, news_train_df) = env.get_training_data()

In [4]:
market_train_df.shape

(4072956, 16)

In [7]:
news_train_df.shape

(9328750, 35)

## `get_prediction_days` function

Generator which loops through each "prediction day" (trading day) and provides all market and news observations which occurred since the last data you've received.  Once you call **`predict`** to make your future predictions, you can continue on to the next prediction day.

Yields:
* While there are more prediction day(s) and `predict` was called successfully since the last yield, yields a tuple of:
    * `market_observations_df`: DataFrame with market observations for the next prediction day.
    * `news_observations_df`: DataFrame with news observations for the next prediction day.
    * `predictions_template_df`: DataFrame with `assetCode` and `confidenceValue` columns, prefilled with `confidenceValue = 0`, to be filled in and passed back to the `predict` function.
* If `predict` has not been called since the last yield, yields `None`.

In [15]:
# You can only iterate through a result from `get_prediction_days()` once
# so be careful not to lose it once you start iterating.
days = env.get_prediction_days()

In [16]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

Exception: You can only call `get_prediction_days` once.

In [24]:
market_obs_df.head(10)

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10
0,2017-01-03 22:00:00+00:00,A.N,Agilent Technologies Inc,1739726.0,46.49,45.93,0.020413,0.003715,0.009812,0.003744,0.015565,-0.011736,0.015757,0.001985
1,2017-01-03 22:00:00+00:00,AA.N,Alcoa Corp,2746634.0,28.83,28.6,0.026709,-0.012772,0.015256,-0.012756,-0.019388,-0.035413,,
2,2017-01-03 22:00:00+00:00,AAL.O,American Airlines Group Inc,6737752.0,46.3,47.28,-0.008353,-0.002952,-0.026595,-0.002917,-0.027924,-0.027361,-0.027729,-0.015277
3,2017-01-03 22:00:00+00:00,AAN.N,Aaron's Inc,760498.0,31.9,32.4,-0.002813,0.01029,-0.020835,0.010329,-0.026846,-0.021739,-0.026324,0.011201
4,2017-01-03 22:00:00+00:00,AAP.N,Advance Auto Parts Inc,691526.0,170.6,170.78,0.008751,-0.003152,-0.001344,-0.003128,-0.007281,-0.014546,-0.007198,-0.010078
5,2017-01-03 22:00:00+00:00,AAPL.O,Apple Inc,28781865.0,116.15,115.8,0.002849,-0.007287,-0.003841,-0.007265,0.001552,-0.005753,0.00167,0.002888
6,2017-01-03 22:00:00+00:00,ABB.N,ABB Ltd,2009610.0,21.28,21.14,0.009967,-0.005644,0.004332,-0.005634,0.015267,0.010033,0.015339,0.014276
7,2017-01-03 22:00:00+00:00,ABBV.N,AbbVie Inc,9328198.0,62.41,62.92,-0.003354,0.003045,-0.002449,0.003055,0.003054,0.012878,0.003129,0.019393
8,2017-01-03 22:00:00+00:00,ABC.N,AmerisourceBergen Corp,4134229.0,82.61,78.51,0.056529,-0.010586,0.045927,-0.010582,0.067589,0.020406,0.067513,0.016566
9,2017-01-03 22:00:00+00:00,ABCO.O,Advisory Board Co,518959.0,34.4,33.45,0.034586,0.018265,0.028449,0.018302,0.060092,0.022936,0.059964,0.018003


In [25]:
news_obs_df.head(10)

Unnamed: 0,time,sourceTimestamp,firstCreated,sourceId,headline,urgency,takeSequence,provider,subjects,audiences,bodySize,companyCount,headlineTag,marketCommentary,sentenceCount,wordCount,assetCodes,assetName,firstMentionSentence,relevance,sentimentClass,sentimentNegative,sentimentNeutral,sentimentPositive,sentimentWordCount,noveltyCount12H,noveltyCount24H,noveltyCount3D,noveltyCount5D,noveltyCount7D,volumeCounts12H,volumeCounts24H,volumeCounts3D,volumeCounts5D,volumeCounts7D
0,2016-12-30 22:00:02+00:00,2016-12-30 22:00:02+00:00,2016-12-30 22:00:02+00:00,02ae05e4a5650826,Enstar Announces Acquisition of Dana Companies...,3,1,GNW,"{'MRG', 'MINS', 'NEWR', 'INSR', 'BACT', 'BM', ...","{'CNR', 'GNWN'}",3742,1,,False,19,575,"{'ESGR.OQ', 'ESGR.O'}",Enstar Group Ltd,1,1.0,1,0.222786,0.38274,0.394474,555,0,0,0,0,0,0,0,0,0,0
1,2016-12-30 22:00:02+00:00,2016-12-30 22:00:02+00:00,2016-12-30 22:00:02+00:00,8a59a41e3dc3f255,ENSTAR ANNOUNCES ACQUISITION OF DANA COMPANIES,1,1,RTRS,"{'BLR', 'MINS', 'INSR', 'FINS', 'US', 'CMPNY',...","{'E', 'U'}",0,1,,False,1,7,"{'ESGR.OQ', 'ESGR.O'}",Enstar Group Ltd,1,1.0,0,0.172375,0.611033,0.216592,7,0,0,0,0,0,0,0,0,0,0
2,2016-12-30 22:00:06+00:00,2016-12-30 22:00:06+00:00,2016-12-30 22:00:02+00:00,c8ee6bcb3de3a944,ENSTAR ANNOUNCES ACQUISITION OF DANA COMPANIES,1,2,RTRS,"{'BLR', 'AUTO', 'MINS', 'CYCS', 'INSR', 'FINS'...","{'E', 'U'}",0,2,,False,1,7,{'DAN.N'},Dana Inc,1,1.0,0,0.172375,0.611033,0.216592,7,0,0,0,0,0,0,0,0,0,0
3,2016-12-30 22:00:06+00:00,2016-12-30 22:00:06+00:00,2016-12-30 22:00:02+00:00,c8ee6bcb3de3a944,ENSTAR ANNOUNCES ACQUISITION OF DANA COMPANIES,1,2,RTRS,"{'BLR', 'AUTO', 'MINS', 'CYCS', 'INSR', 'FINS'...","{'E', 'U'}",0,2,,False,1,7,"{'ESGR.OQ', 'ESGR.O'}",Enstar Group Ltd,1,1.0,0,0.172375,0.611033,0.216592,7,2,2,2,2,2,2,2,2,2,2
4,2016-12-30 22:00:17+00:00,2016-12-30 22:00:17+00:00,2016-12-30 21:33:44+00:00,58ea9d8a0be61c03,BRIEF-Axovant Sciences files for mixed shelf o...,3,1,RTRS,"{'BLR', 'SISU', 'INDU', 'DBT', 'HECA', 'PHMR',...","{'PCO', 'PCU', 'DNP', 'PSC', 'U', 'RNP', 'NAW'...",239,1,BRIEF,False,4,65,{'AXON.N'},Axovant Sciences Ltd,1,1.0,0,0.307338,0.378957,0.313705,58,1,1,1,1,1,1,1,1,1,1
5,2016-12-30 22:00:30+00:00,2016-12-30 22:00:30+00:00,2016-12-30 22:00:02+00:00,091670975b204bb4,ENSTAR GROUP LTD - DEAL FOR $91.5 MLN,1,3,RTRS,"{'BLR', 'AUTO', 'MINS', 'CYCS', 'INSR', 'FINS'...","{'E', 'U'}",0,2,,False,1,10,"{'ESGR.OQ', 'ESGR.O'}",Enstar Group Ltd,1,1.0,1,0.006567,0.432727,0.560706,10,0,0,0,0,0,3,3,3,3,3
6,2016-12-30 22:00:30+00:00,2016-12-30 22:00:30+00:00,2016-12-30 22:00:02+00:00,091670975b204bb4,ENSTAR GROUP LTD - DEAL FOR $91.5 MLN,1,3,RTRS,"{'BLR', 'AUTO', 'MINS', 'CYCS', 'INSR', 'FINS'...","{'E', 'U'}",0,2,,False,1,10,{'DAN.N'},Dana Inc,0,1.0,1,0.006567,0.432727,0.560706,10,0,0,0,0,0,1,1,1,1,1
7,2016-12-30 22:00:39+00:00,2016-12-30 22:00:39+00:00,2016-12-30 22:00:39+00:00,39b80eb9f50fc245,"JPMORGAN CHINA REGION FUND, INC. BOARD TO SUBM...",1,1,RTRS,"{'BLR', 'INVT', 'LEN', 'FINS', 'US', 'CMPNY', ...","{'E', 'U'}",0,2,,False,1,19,{'JFC.N'},JPMorgan China Region Fund Inc,1,1.0,0,0.321686,0.375306,0.303008,19,1,1,1,1,1,1,1,1,1,1
8,2016-12-30 22:00:39+00:00,2016-12-30 22:00:39+00:00,2016-12-30 22:00:39+00:00,39b80eb9f50fc245,"JPMORGAN CHINA REGION FUND, INC. BOARD TO SUBM...",1,1,RTRS,"{'BLR', 'INVT', 'LEN', 'FINS', 'US', 'CMPNY', ...","{'E', 'U'}",0,2,,False,1,19,"{'JPM', 'JPM.DE', 'JPM.N'}",JPMorgan Chase & Co,1,1.0,0,0.321686,0.375306,0.303008,19,0,0,0,0,0,3,4,4,5,5
9,2016-12-30 22:01:01+00:00,2016-12-30 22:01:01+00:00,2016-12-30 22:00:39+00:00,07d091ecbccaceca,JPMORGAN CHINA REGION FUND INC - BOARD IS CURR...,1,2,RTRS,"{'BLR', 'INVT', 'BACT', 'LEN', 'REORG', 'FINS'...","{'E', 'U'}",0,2,,False,1,19,{'JFC.N'},JPMorgan China Region Fund Inc,1,1.0,1,0.053507,0.196706,0.749788,19,2,2,2,2,2,2,2,2,2,2


In [22]:
predictions_template_df.head()

Unnamed: 0,assetCode,confidenceValue
0,A.N,0.0
1,AA.N,0.0
2,AAL.O,0.0
3,AAN.N,0.0
4,AAP.N,0.0


Note that we'll get an error if we try to continue on to the next prediction day without making our predictions for the current day.

In [None]:
next(days)

### **`predict`** function
Stores your predictions for the current prediction day.  Expects the same format as you saw in `predictions_template_df` returned from `get_prediction_days`.

Args:
* `predictions_df`: DataFrame which must have the following columns:
    * `assetCode`: The market asset.
    * `confidenceValue`: Your confidence whether the asset will increase or decrease in 10 trading days.  All values must be in the range `[-1.0, 1.0]`.

The `predictions_df` you send **must** contain the exact set of rows which were given to you in the `predictions_template_df` returned from `get_prediction_days`.  The `predict` function does not validate this, but if you are missing any `assetCode`s or add any extraneous `assetCode`s, then your submission will fail.

Let's make random predictions for the first day:

In [None]:
import numpy as np
def make_random_predictions(predictions_df):
    predictions_df.confidenceValue = 2.0 * np.random.rand(len(predictions_df)) - 1.0

In [None]:
make_random_predictions(predictions_template_df)
env.predict(predictions_template_df)

Now we can continue on to the next prediction day and make another round of random predictions for it:

In [None]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
market_obs_df.head()

In [None]:
news_obs_df.head()

In [None]:
predictions_template_df.head()

In [None]:
make_random_predictions(predictions_template_df)
env.predict(predictions_template_df)

## Main Loop
Let's loop through all the days and make our random predictions.  The `days` generator (returned from `get_prediction_days`) will simply stop returning values once you've reached the end.

In [None]:
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    make_random_predictions(predictions_template_df)
    env.predict(predictions_template_df)
print('Done!')

## **`write_submission_file`** function

Writes your predictions to a CSV file (`submission.csv`) in the current working directory.

UsageError: Line magic function `%hostname` not found.


In [28]:
pwd

'/kaggle/working'

In [27]:
ls -a

[0m[01;34m.[0m/  [01;34m..[0m/  [01;34m.ipynb_checkpoints[0m/  __notebook_source__.ipynb


In [None]:
env.write_submission_file()

In [None]:
# We've got a submission file!
import os
print([filename for filename in os.listdir('.') if '.csv' in filename])

As indicated by the helper message, calling `write_submission_file` on its own does **not** make a submission to the competition.  It merely tells the module to write the `submission.csv` file as part of the Kernel's output.  To make a submission to the competition, you'll have to **Commit** your Kernel and find the generated `submission.csv` file in that Kernel Version's Output tab (note this is _outside_ of the Kernel Editor), then click "Submit to Competition".  When we re-run your Kernel during Stage Two, we will run the Kernel Version (generated when you hit "Commit") linked to your chosen Submission.

## Restart the Kernel to run your code again
In order to combat cheating, you are only allowed to call `make_env` or iterate through `get_prediction_days` once per Kernel run.  However, while you're iterating on your model it's reasonable to try something out, change the model a bit, and try it again.  Unfortunately, if you try to simply re-run the code, or even refresh the browser page, you'll still be running on the same Kernel execution session you had been running before, and the `twosigmanews` module will still throw errors.  To get around this, you need to explicitly restart your Kernel execution session, which you can do by pressing the Restart button in the Kernel Editor's bottom Console tab:
![Restart button](https://i.imgur.com/hudu8jF.png)