### Setup

- Generate cooccurrences (value represents number of articles in which two companies occur together)

TODO:
- Other measures for cooccurrence?

In [104]:
import os
import re
import glob
from datetime import datetime
import sys
sys.path.append("..") # Adds higher directory to python modules path for importing from src dir

import pandas as pd
import numpy as np
import tqdm
import matplotlib
from matplotlib import pyplot as plt
from tqdm import tqdm_notebook as tqdm

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

from src.datasets import NyseStocksDataset, NyseSecuritiesDataset, NyseFundamentalsDataset
import src.econometric_utils as eco
import src.regression_utils as regr
import src.plot_utils as plot
import src.math_utils as math_utils
import src.utils as utils

%matplotlib inline
%load_ext autotime
%load_ext autoreload
%autoreload 2

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
time: 1.41 s


In [5]:
stocks_ds = NyseStocksDataset(file_path='../data/nyse/prices-split-adjusted.csv')
securities_ds = NyseSecuritiesDataset(file_path='../data/nyse/securities.csv')
companies = securities_ds.get_all_company_names()  # List[Tuple[symbol, name]]
stocks_ds.load()

HBox(children=(IntProgress(value=0, max=470), HTML(value='')))


time: 10.7 s


### Data

In [589]:
increments = pd.Series({ 26944: 1, 55568: 2, 55624: 3, 56146: 4, 60323: 5, 68948: 6, 68978: 7, 72066: 8, 72779: 9, 74970: 10, 75163: 11, 75690: 12, 80466: 13, 81443: 14, 88575: 15, 89332: 16, 92712: 17, 94556: 18, 95271: 19, 95551: 20, 98957: 21, 99316: 22, 102077: 23, 102831: 24, 104671: 25 })

def get_increase(idx):
    diffs = idx - increments.index
    if all(diffs < 0):
        return 0
    return increments.iloc[diffs[diffs >= 0].argmin()]

time: 179 ms


In [590]:
all_correlations = pd.read_csv('../data/preprocessed/econometrics/correlations.csv', index_col=[0, 1])
all_correlations.index.levels[1].name = None

comp_occ = pd.read_csv('../data/preprocessed/occurrences/occurrences.csv', index_col=0)
# FIXME: Remove later
comp_occ = comp_occ[comp_occ.article_id <= 106493]
comp_coocc = nlp_utils.get_cooccurrences(comp_occ).stack()
comp_coocc = comp_coocc[[a < b for a, b in comp_coocc.index]]

people_occ = pd.read_csv('../data/preprocessed/occurrences/occurrences-reuters-people-enriched.csv')
people_occ = people_occ[people_occ.columns[2:]]
people_occ.article_id += people_occ.article_id.apply(get_increase)
# people_occ.article_id += 1
name2symbol = dict([(com, securities_ds.get_company(com)) for com in people_occ.company.unique()])
people_occ['stock_symbol'] = people_occ.company.apply(name2symbol.get)
people_coocc = nlp_utils.get_cooccurrences(people_occ).stack()
people_coocc = people_coocc[[a < b for a, b in people_coocc.index]]

columns = ['article_id', 'start_idx', 'end_idx', 'stock_symbol']
mixed_occ = pd.concat([comp_occ[columns], people_occ[columns]]).sort_values(['article_id', 'start_idx']).reset_index(drop=True)
mixed_occ.drop_duplicates(['article_id', 'start_idx', 'end_idx'], inplace=True)

time: 10.5 s


Check article ids in people

- 26944: 1
- 55568: 2
- 55624: 3
- 56146: 4
- 60323: 5
- 68948: 6
- 68978: 7
- 72066: 8
- 72779: 9
- 74970: 10
- 75163: 11
- 75690: 12
- 80466: 13
- 81443: 14
- 88575: 15
- 89332: 16
- 92712: 17
- 94556: 18
- 95271: 19
- 95551: 20
- 98957: 21
- 99316: 22
- 102077: 23
- 102831: 24
- 104671: 25

In [588]:
# NEWS = os.path.join("..", "data", "preprocessed", "news-v2.csv")
# articles = pd.read_csv(NEWS, index_col=0, nrows=people_occ.article_id.max()+50)

add = 0
for article_id in tqdm(articles.index):
    for i, row in people_occ[people_occ.article_id == article_id].iterrows():
        if pd.isna(articles.loc[article_id+add].content) or not articles.loc[article_id+add].content[row.start_idx:row.end_idx] == row.match_text:
            add += 1
            assert articles.loc[article_id+add].content[row.start_idx:row.end_idx] == row.match_text
            print(article_id, add)
    for i, row in comp_occ[comp_occ.article_id == article_id].iterrows():
        assert articles.loc[article_id].content[row.start_idx:row.end_idx] == row.match_text

HBox(children=(IntProgress(value=0, max=106543), HTML(value='')))

time: 6min 20s


- 40.702.147 Entities
- 9.636.490 tagged as ORG
- 463.595 linked to my stock companys (distributed over 126.583 articles)

In [101]:
comp_names = securities_ds.get_all_company_names()
assert all([com in comp_names.Name.values for com in people_occ.company.unique()]), 'Sanity Check: Valid company names'
assert all(people_occ.notna().any()), 'Sanity Check: No missing values'

time: 181 ms


TODO: Validate correctness article_ids, start_idx and end_idx

TODO: Repeat after Bloomberg was added

In [233]:
for x in tqdm(comp_occ.match_text.values):
    assert not securities_ds.get_most_similar_company(x, debug=True)[1], x

HBox(children=(IntProgress(value=0, max=208303), HTML(value='')))

time: 29min 47s


Damerau-Levenshtein distance is not working (too repelling)

### Inspect Articles For Show Cases

In [967]:
NEWS = os.path.join("..", "data", "preprocessed", "news-v3.csv")
# articles = pd.read_csv(NEWS, index_col=0, nrows=people_occ.article_id.max()+50,)
articles = pd.read_csv(NEWS, index_col=0)

# for i, row in tqdm(comp_occ.iterrows(), total=len(comp_occ)):
#     assert row.match_text == articles.loc[row.article_id].content[row.start_idx:row.end_idx], i

time: 35.5 s


In [971]:
# _, occ_per_article = nlp_utils.get_cooccurrences(comp_occ, debug=True)

def merge_ranges(index_pairs):
    ranges = []
    for x, y in index_pairs:
        x = max(x, 0)
        if len(ranges) == 0:
            ranges.append([x, y])
        else:
            _, last_y = ranges[-1]
            if x <= last_y and y > last_y:
                ranges[-1][1] = y
            elif x > last_y: 
                ranges.append([x, y])
    return ranges

def get_matching_article(symA, symB):
    x = occ_per_article[[symA, symB]]
    return x[(x[symA] != 0) & (x[symB] != 0)].index.tolist()

def print_cooccurrences(symA, symB):
    x = occ_per_article[[symA, symB]]
    for idx, row in x[(x[symA] != 0) & (x[symB] != 0)].iterrows():
        print(f'--->Article: {idx}<----', row[symA], row[symB])
        temp_occs = comp_occ[(comp_occ.article_id == idx) & (comp_occ.stock_symbol.isin([symA, symB]))]
        # temp_occs.sort_values('start_idx', ascending=True, inplace=True)
        init_indeces = [(row.start_idx - 100, row.end_idx + 100) for i, row in temp_occs.iterrows()]
        indeces = merge_ranges(init_indeces)
        if len(init_indeces) == len(indices):
            continue
        for x, y in indeces:
            print('\n>>>', articles.loc[idx].content[x:y], '>>>\n')

time: 8.87 s


In [None]:
plot.scatter_regression(np.cumprod(a+1), np.cumprod(b+1))

In [1075]:
a = np.random.normal(0, 0.01, size=1000)
b = np.random.normal(0, 0.01, size=1000)
math_utils.correlation(a, b), math_utils.correlation(a + 1, b + 1), math_utils.correlation(np.cumprod(a+1), np.cumprod(b+1))

(0.027482947052723913, 0.027482947052723562, 0.8932890089085012)

time: 201 ms


In [972]:
print_cooccurrences('IBM', 'MCD')

--->Article: 5635<---- 1 1

>>> es rose 5.1 percent to $42.17. Tech stocks fell following disappointing results from Yahoo Inc. and International Business Machines Corp., which were released after Tuesday's closing bell. Yahoo shares slid 11.8 percent to $28.31 on the  >>>


>>> iggest gainers were Merck & Co., up 11.84 percent over that period; Coca-Cola Co., up 7.91 percent; McDonald's Corp., up 6.79 percent, and Exxon Mobil Corp., up 4.47 percent. The top loser over that time was Altria G >>>

--->Article: 5691<---- 1 1

>>> es rose 5.1 percent to $42.17. Tech stocks fell following disappointing results from Yahoo Inc. and International Business Machines Corp., which were released after Tuesday's closing bell. Yahoo shares slid 11.8 percent to $28.31 on the  >>>


>>> iggest gainers were Merck & Co., up 11.84 percent over that period; Coca-Cola Co., up 7.91 percent; McDonald's Corp., up 6.79 percent, and Exxon Mobil Corp., up 4.47 percent. The top loser over that time was Altria G >

>>> n by much better-than-expected earnings from some major companies, including Intel Corp ( INTC.O ), International Business Machines Corp ( IBM.N ) and Goldman Sachs Group ( GS.N ). The rally has driven stocks back up to near their highs >>>

--->Article: 45777<---- 1 1

>>> Tuesday, the earnings period accelerates, as some 57 S&P 500 companies are set to report this week. International Business Machines Corp is scheduled to post results on Tuesday while Google Inc ( GOOG.O ) is expected on Thursday. Among  >>>


>>> dustrials are expected to have the lowest. Also set to report this week: General Electric ( GE.N ), McDonald's Corp ( MCD.N ) and American Express Co ( AXP.N ). (Editing by Kenneth Barry) >>>

--->Article: 45796<---- 1 1

>>> Tuesday, the earnings period accelerates, as some 57 S&P 500 companies are set to report this week. International Business Machines Corp is scheduled to post results on Tuesday while Google Inc ( GOOG.O ) is expected on Thursday. Among  >>>


>>> d

In [911]:
# match_articles = get_matching_article('CB', 'ALL')
match_articles = get_matching_article('IBM', 'MCD')
# 'ADP', 'XOM'
# print(articles.loc[match_articles[1]].content)

time: 187 ms


In [913]:
merged[(merged['Same Industry'] == 0) & (merged.Cooccurrences > 5) & (merged.Residuals > 0.25)]

Unnamed: 0,Unnamed: 1,Cooccurrences,Cooccurrences - People,Dist,Dist - People,Dist - Mixed,Dist2,Dist2 - People,Dist2 - Mixed,Price,Return,Normalized,Residuals,Same Industry,Cooccurrences - Add,Dist - Add,Dist2 - Add
ABT,XOM,6.0,0.0,0.261333,0.0,0.261333,0.647955,0.0,0.647955,0.429599,0.455766,0.334983,0.3043,0.0,6.0,0.261333,0.647955
ADP,JNJ,8.0,0.0,0.942625,0.0,0.942625,0.981475,0.0,0.981475,0.565312,0.54511,0.334575,0.323231,0.0,8.0,0.942625,0.981475
ADP,XOM,8.0,0.0,0.968125,0.0,0.968125,0.988392,0.0,0.988392,0.956714,0.601115,0.42138,0.390005,0.0,8.0,0.968125,0.988392
CVX,IBM,20.0,0.0,0.36195,0.0,0.391,0.719562,0.0,0.732025,0.913501,0.561221,0.320763,0.297269,0.0,20.0,0.36195,0.719562
CVX,JNJ,13.0,0.0,0.558077,0.0,0.558077,0.847271,0.0,0.847271,0.44807,0.515741,0.378115,0.349819,0.0,13.0,0.558077,0.847271
CVX,MCD,18.0,0.0,0.3625,0.0,0.3625,0.669097,0.0,0.669251,0.810063,0.452319,0.383746,0.353276,0.0,18.0,0.3625,0.669097
CVX,MMM,11.0,0.0,0.650545,0.0,0.650545,0.825137,0.0,0.825137,0.454528,0.631064,0.279499,0.259257,0.0,11.0,0.650545,0.825137
CVX,TGT,9.0,0.0,0.4687,0.0,0.4687,0.80108,0.0,0.80108,0.041237,0.353841,0.301016,0.274897,0.0,9.0,0.4687,0.80108
DOW,MCD,6.0,0.0,0.677667,0.0,0.677667,0.830417,0.0,0.830417,0.16366,0.413121,0.301259,0.280805,0.0,6.0,0.677667,0.830417
IBM,JNJ,19.0,0.0,0.412421,0.0,0.42005,0.625756,0.0,0.635956,0.475419,0.503008,0.403201,0.383983,0.0,19.0,0.412421,0.625756


time: 239 ms


### Try New Distance Measure

In [688]:
comp_distances = nlp_utils.get_distances(comp_occ)
people_distances = nlp_utils.get_distances(people_occ)
mixed_distances = nlp_utils.get_distances(mixed_occ)

HBox(children=(IntProgress(value=0, max=47165), HTML(value='')))

HBox(children=(IntProgress(value=0, max=5940), HTML(value='')))

HBox(children=(IntProgress(value=0, max=48127), HTML(value='')))

time: 10min 42s


In [689]:
comp_distances2 = nlp_utils.get_cheap_distances(comp_occ)
people_distances2 = nlp_utils.get_cheap_distances(people_occ)
mixed_distances2 = nlp_utils.get_cheap_distances(mixed_occ)

HBox(children=(IntProgress(value=0, max=47165), HTML(value='')))

HBox(children=(IntProgress(value=0, max=5940), HTML(value='')))

HBox(children=(IntProgress(value=0, max=48127), HTML(value='')))

time: 8min 16s


### Correlate Features

In [734]:
def make_mergable(x):
    if len(x.shape) == 2:
        x = x.stack()
    return x[[a < b for a, b in x.index]]

merged = pd.concat([comp_coocc.rename('Cooccurrences'),  # 429
                    people_coocc.rename('Cooccurrences - People'),  # 200
                    make_mergable(comp_distances).rename('Dist'),
                    make_mergable(people_distances).rename('Dist - People'),
                    make_mergable(mixed_distances).rename('Dist - Mixed'),  # 450
                    make_mergable(comp_distances2).rename('Dist2'),
                    make_mergable(people_distances2).rename('Dist2 - People'),
                    make_mergable(mixed_distances2).rename('Dist2 - Mixed'),
                    *[all_correlations[col].abs() for col in all_correlations.columns]  # 466
                   ], axis=1).dropna(subset=['Cooccurrences', 'Price']).fillna(0)
merged['Cooccurrences - Add'] = merged['Cooccurrences'] + merged['Cooccurrences - People']
merged['Dist - Add'] = merged['Dist'] + merged['Dist - People']
merged['Dist2 - Add'] = merged['Dist2'] + merged['Dist2 - People']

time: 4.79 s


In [735]:
merged.shape, co_merged.shape, comp_coocc.shape, people_coocc.shape

((81810, 13), (96225, 2), (92235,), (20100,))

time: 197 ms


In [758]:
merged.corr(method='kendall').round(4)

Unnamed: 0,Cooccurrences,Cooccurrences - People,Dist,Dist - People,Dist - Mixed,Dist2,Dist2 - People,Dist2 - Mixed,Price,Return,Normalized,Residuals,Same Industry,Cooccurrences - Add,Dist - Add,Dist2 - Add
Cooccurrences,1.0,0.0807,0.9617,0.0807,0.9432,0.9533,0.0807,0.935,-0.0034,0.0724,0.0547,0.0546,0.1645,0.9967,0.9587,0.9507
Cooccurrences - People,0.0807,1.0,0.078,0.9994,0.1213,0.0771,0.9992,0.1194,0.0006,0.0096,0.0114,0.0122,0.037,0.1269,0.1269,0.1287
Dist,0.9617,0.078,1.0,0.078,0.9798,0.9819,0.078,0.9626,-0.0035,0.0711,0.0536,0.0535,0.1628,0.9586,0.9962,0.9783
Dist - People,0.0807,0.9994,0.078,1.0,0.1214,0.0771,0.9997,0.1194,0.0006,0.0096,0.0114,0.0122,0.037,0.1269,0.1269,0.1287
Dist - Mixed,0.9432,0.1213,0.9798,0.1214,1.0,0.9627,0.1213,0.9811,-0.0022,0.0711,0.0527,0.0527,0.1656,0.9461,0.9826,0.9654
Dist2,0.9533,0.0771,0.9819,0.0771,0.9627,1.0,0.0771,0.979,-0.0037,0.0704,0.0528,0.0528,0.1611,0.9502,0.9784,0.9958
Dist2 - People,0.0807,0.9992,0.078,0.9997,0.1213,0.0771,1.0,0.1194,0.0006,0.0096,0.0114,0.0122,0.037,0.1268,0.1269,0.1287
Dist2 - Mixed,0.935,0.1194,0.9626,0.1194,0.9811,0.979,0.1194,1.0,-0.0024,0.0705,0.052,0.0519,0.1638,0.9378,0.9654,0.9813
Price,-0.0034,0.0006,-0.0035,0.0006,-0.0022,-0.0037,0.0006,-0.0024,1.0,0.1176,0.0198,0.0185,0.0462,-0.0034,-0.0035,-0.0038
Return,0.0724,0.0096,0.0711,0.0096,0.0711,0.0704,0.0096,0.0705,0.1176,1.0,0.0548,0.0544,0.191,0.0723,0.071,0.0703


time: 2.08 s


#### News Correlations

In [753]:
columns = ['Residuals', 'Cooccurrences', 'Dist2 - Add', 'Dist - Add', 'Same Industry']
renamed_columns = ['Stock Correlation', 'Co-occurrences', 'Minimum Distance', 'Pairwise Distance', 'Same Industry']
final = merged[columns].rename(dict(zip(columns, renamed_columns)), axis=1)
final.corr().round(4)

Unnamed: 0,Stock Correlation,Co-occurrences,Minimum Distance,Pairwise Distance,Same Industry
Stock Correlation,1.0,0.0923,0.1085,0.1149,0.2008
Co-occurrences,0.0923,1.0,0.2181,0.214,0.0821
Minimum Distance,0.1085,0.2181,1.0,0.907,0.1663
Pairwise Distance,0.1149,0.214,0.907,1.0,0.1658
Same Industry,0.2008,0.0821,0.1663,0.1658,1.0


time: 235 ms


In [763]:
final.corr(method='kendall').round(4)

Unnamed: 0,Stock Correlation,Co-occurrences,Minimum Distance,Pairwise Distance,Same Industry
Stock Correlation,1.0,0.0546,0.053,0.0536,0.121
Co-occurrences,0.0546,1.0,0.9507,0.9587,0.1645
Minimum Distance,0.053,0.9507,1.0,0.9819,0.162
Pairwise Distance,0.0536,0.9587,0.9819,1.0,0.1636
Same Industry,0.121,0.1645,0.162,0.1636,1.0


time: 331 ms


#### Stock Correlations

In [761]:
merged[['Dist - Add', *all_correlations.columns]].rename({'Dist - Add': 'Pairwise Distance'}, axis=1).corr().round(4)

Unnamed: 0,Pairwise Distance,Price,Return,Normalized,Residuals,Same Industry
Pairwise Distance,1.0,0.0031,0.107,0.1139,0.1149,0.1658
Price,0.0031,1.0,0.1858,0.0441,0.0419,0.0561
Return,0.107,0.1858,1.0,0.1266,0.1272,0.2717
Normalized,0.1139,0.0441,0.1266,1.0,0.9884,0.196
Residuals,0.1149,0.0419,0.1272,0.9884,1.0,0.2008
Same Industry,0.1658,0.0561,0.2717,0.196,0.2008,1.0


time: 229 ms


In [762]:
merged[['Dist - Add', *all_correlations.columns]].rename({'Dist - Add': 'Pairwise Distance'}, axis=1).corr(method='kendall').round(4)

Unnamed: 0,Pairwise Distance,Price,Return,Normalized,Residuals,Same Industry
Pairwise Distance,1.0,-0.0035,0.071,0.0536,0.0536,0.1636
Price,-0.0035,1.0,0.1176,0.0198,0.0185,0.0462
Return,0.071,0.1176,1.0,0.0548,0.0544,0.191
Normalized,0.0536,0.0198,0.0548,1.0,0.8741,0.1194
Residuals,0.0536,0.0185,0.0544,0.8741,1.0,0.121
Same Industry,0.1636,0.0462,0.191,0.1194,0.121,1.0


time: 534 ms


In [799]:
merged.loc['ALL', 'CB'].to_frame()

Unnamed: 0_level_0,ALL
Unnamed: 0_level_1,CB
Cooccurrences,5.0
Cooccurrences - People,0.0
Dist,0.4814
Dist - People,0.0
Dist - Mixed,0.4814
Dist2,0.788251
Dist2 - People,0.0
Dist2 - Mixed,0.788251
Price,0.011805
Return,0.641101


time: 414 ms


#### Final

In [None]:
print('\nPearson:\n')
print(utils.pandas_df_to_markdown_table(final.corr().round(4)))
print('\nSpearman:\n')
print(utils.pandas_df_to_markdown_table(final.corr(method='spearman').round(4)))
print('\nKendall:\n')
print(utils.pandas_df_to_markdown_table(final.corr(method='kendall').round(4)))

Pearson:

|        -          |   Stock Correlation |   Co-occurrences |   Minimum Distance |   Pairwise Distance |   Same Industry |
|:------------------|--------------------:|-----------------:|-------------------:|--------------------:|----------------:|
| Stock Correlation |              1      |           0.0923 |             0.1085 |              0.1149 |          0.2008 |
| Co-occurrences    |              0.0923 |           1      |             0.2181 |              0.214  |          0.0821 |
| Minimum Distance  |              0.1085 |           0.2181 |             1      |              0.907  |          0.1663 |
| Pairwise Distance |              0.1149 |           0.214  |             0.907  |              1      |          0.1658 |
| Same Industry     |              0.2008 |           0.0821 |             0.1663 |              0.1658 |          1      |

Spearman:

|        -          |   Stock Correlation |   Co-occurrences |   Minimum Distance |   Pairwise Distance |   Same Industry |
|:------------------|--------------------:|-----------------:|-------------------:|--------------------:|----------------:|
| Stock Correlation |              1      |           0.0681 |             0.0665 |              0.0672 |          0.1482 |
| Co-occurrences    |              0.0681 |           1      |             0.9934 |              0.994  |          0.1676 |
| Minimum Distance  |              0.0665 |           0.9934 |             1      |              0.9993 |          0.1661 |
| Pairwise Distance |              0.0672 |           0.994  |             0.9993 |              1      |          0.1675 |
| Same Industry     |              0.1482 |           0.1676 |             0.1661 |              0.1675 |          1      |

Kendall:

|        -          |   Stock Correlation |   Co-occurrences |   Minimum Distance |   Pairwise Distance |   Same Industry |
|:------------------|--------------------:|-----------------:|-------------------:|--------------------:|----------------:|
| Stock Correlation |              1      |           0.0546 |             0.053  |              0.0536 |          0.121  |
| Co-occurrences    |              0.0546 |           1      |             0.9507 |              0.9587 |          0.1645 |
| Minimum Distance  |              0.053  |           0.9507 |             1      |              0.9819 |          0.162  |
| Pairwise Distance |              0.0536 |           0.9587 |             0.9819 |              1      |          0.1636 |
| Same Industry     |              0.121  |           0.1645 |             0.162  |              0.1636 |          1      |

- Normalization partly removed correlation between stock return and industry
    - Return vs. Same Industry: 0.2717 --> Normalized vs. Same Industry: 0.1960
- Correlation between Stock and Cooc increased with each preprocessing step
- "Cooc" / "Add Cooc" correlates with "Residuals"
- Addition of People Dist and Comp Dist increased Correlation
- Spearman: Higher for intra-text features (Cooc, Distances, Same Industry) --> monotonic, non-linear relationship, Lower in comparison with stock correlation --> linear relationship between both feature types
- Kendall: All values are lower than spearman, same as spearman compared to pearson
- Pearson/Spearman Significance: 0.10 -> 0.0058, 0.05 -> 0.0069, 0.01 -> 0.0090
- Kendall Significance: 0.10 -> 0.0038, 0.05 -> 0.0046, 0.01 -> 0.0060

Plot/Print diagonal:
- https://seaborn.pydata.org/examples/many_pairwise_correlations.html