## Loads Data from Stock-net Dataset

Dataset can be found [here](https://github.com/yumoxu/stocknet-dataset).

In [127]:
# Dependencies
import os
import json
import numpy as np
import pandas as pd
from tqdm import tqdm

In [128]:
# Change this filepath to your relative path
stocknet_dataset_filepath = './stocknet-dataset-master'

## Import data into maps
`company_to_price_df` maps each company name to a DataFrame containing the stock movement information per date. Schema:
```
{
    company_name: DataFrame ['date', 'open', 'high', 'low', 'close', 'adjust_close', 'volume']
    ...
}
```

`company_to_tweets` maps each company name to a dictionary of dates and list of tweets. Schema:
```
{
    company_name: 
    {
        date: [list of tweets + metadata]
        ...
    }
    ...
}
```

NOTE: GMRE is not included because it does not include dates before 2017.

In [129]:
# Data import
preprocessed_prices_filepath = stocknet_dataset_filepath + '/price/preprocessed'
preprocessed_tweets_filepath = stocknet_dataset_filepath + '/tweet/preprocessed'

company_to_price_df = {}
company_to_tweets = {}

for filename in os.listdir(preprocessed_prices_filepath):
    with open(preprocessed_prices_filepath + '/' + filename) as file:
        company_name = filename.split('.')[0]
        
        # Not enough data for GMRE
        if company_name == 'GMRE':
            continue
        df = pd.read_csv(file, sep='\t')
        df.columns = ['date', 'open', 'high', 'low', 'close', 'adjust_close', 'volume']
        company_to_price_df[company_name] = df

for filename in tqdm(os.listdir(preprocessed_tweets_filepath)):
    company_name = filename.split('.')[0]
    dates_to_tweets = {}
    for tweet_filename in os.listdir(preprocessed_tweets_filepath + '/' + filename):
        with open(preprocessed_tweets_filepath + '/' + filename + '/' + tweet_filename) as file:
            list_of_tweets = []
            for line in file:
                tweet_json = json.loads(line)
                list_of_tweets.append(tweet_json)
            dates_to_tweets[tweet_filename] = list_of_tweets
    company_to_tweets[company_name] = dates_to_tweets

100%|██████████| 87/87 [00:05<00:00, 15.54it/s]


## Train / Dev / Test Split for Price
The dates are as specified for the train / dev / test split:
* 01/01/2014 to 31/07/2015 for training
* 01/08/2015 to 30/09/2015 for validation
* 01/10/2015 to 01/01/2016 for testing

There are 6 companies that don't have 1256 rows in their stock data, 3 of which have 1255 rows (PTR, REX, SNP). The remaining 3 each have varying numbers of rows (BABA, AFGS, ABBV).

In [130]:
# Train / dev / test split
COMPANIES_1255 = ['PTR', 'REX', 'SNP']
DIFF_COMPANIES = ['BABA', 'AGFS', 'ABBV']

TRAIN_IDXS = range(526, 924)
TRAIN_IDXS_1255 = range(525, 923)
TRAIN_IDXS_BABA = range(526, 743)
TRAIN_IDXS_AGFS = range(526, 699)
TRAIN_IDXS_ABBV = range(526, 924)

DEV_IDXS = range(484, 526)
TEST_IDXS = range(419, 484)

## Build price tensors per company

`company_to_price_tensors_{dataset}` maps each company name to a numpy tensor of dimension `len({dataset}) x 3`. Each feature vector is structured as `[closing_price, high_price, low_price]`. Schema:

```
{
    company_name: 
        numpy.array([
            [closing_price, high_price, low_price],
            ...
        ])
    ...
}
```

In [135]:
# Build price tensor per company [closing, highest, lowest] & split data up
company_to_price_tensors_train = {}
company_to_price_tensors_dev = {}
company_to_price_tensors_test = {}

# Helper method for building tensor
def build_price_tensor(company, idxs):
    tensor = []
    for index, row in company_to_price_df[company].iloc[idxs].iterrows():
        tensor.append([row['close'], row['high'], row['low']])
    return np.array(tensor)

# Build training tensors for prices
for company in company_to_price_df.keys():
    # Skip over companies that dont match for now
    if company in COMPANIES_1255 or company in DIFF_COMPANIES:
        continue
    company_to_price_tensors_train[company] = build_price_tensor(company, TRAIN_IDXS)

# Build training tensors for prices
for company in company_to_price_df.keys():
    # Skip over companies that dont match for now
    if company in COMPANIES_1255 or company in DIFF_COMPANIES:
        continue
    company_to_price_tensors_dev[company] = build_price_tensor(company, DEV_IDXS)

# Build training tensors for prices
for company in company_to_price_df.keys():
    # Skip over companies that dont match for now
    if company in COMPANIES_1255 or company in DIFF_COMPANIES:
        continue
    company_to_price_tensors_test[company] = build_price_tensor(company, TEST_IDXS)