# 5. Financial Networks

## Introduction

- Activity in the stock markets, has <u>always attracted a great deal of interest from not only investors,but also scientists</u>. $\rightarrow$ Have wanted to discover regularity in price fluctuations.
- <u>Continuous storage of a variety of data</u> i.e. numbers of transactions, pricing, numbers of bids and asks for all traded stocks worldwide <u>constantly produces one of the largest datasets available to researchers</u>.
- This discipline has attracted over time the interest of investors convinced that it would in principle be possible to <u>predict the future behaviour by inspecting the past history</u>.
- It is noteworthy that since <u>the market is not totally isolated</u>, if many believe that the price will go up, then the price will effectively go up (self-fulfilling prophecy). $\rightarrow$ This creates an interesting feedback between observer and system observed.


- The study of time series is just one of the ways in which we can study quantitatively economic and financial networks.
- Another approach that is particularly fruitful is to describe the various connections between financial institutions in the form of a network.
- The structure obtained is particularly complex,since an edge (or various kinds of edges) can represent lending,exposure,insurance,credit default swaps (CDS),own- ership, interlock in the board etc.


- The aim of this chapter is to <mark>provide the reader with the main quantitative instruments to describe these systems</mark>.

## Data from Yahoo! Finance
- <u>Financial data are very difficult to collect</u>, essentially due to disclosure problems,but also because of the absence of specific policy regulations on certain kinds of transactions; also most of the data are not available in an aggregated form.
- <u>After the financial crisis</u> which started with sub-prime mortgages in 2008, it became clear to a variety of policy regulators and control organisations,that the complexity of the financial structure and our poor knowledge of it had been one of the causes of the turndown in the economy.
- From that moment a series of international organisations and companies <u>started collecting and making available various data</u>, unfortunately not always accessible to scientists.
- The set of data we present here <u>has been downloaded from the Yahoo! Finance web service</u>, which offers daily historical data for the closure prices of stock traded in various markets.
- Present <mark>how to interact with the service in order to get the relevant data we need to explore the correlations between stocks for companies present in the NYSE (New York Stock Exchange) index</mark>.
- The historical data from Yahoo! Finance presents information about <u>the volume of stocks transacted, the highest, the lowest, the opening, and the closing values, as well as an adjusted closing value</u> that provides the closing price (on the requested day,week, or month for any stock) adjusted for all applicable splits and dividend distributions.

- <code>$ pip install yahoo_finance</code>

- The historical data from Yahoo! Finance : <u>volume of stocks transacted, the highest, the lowest, the opening, and the closing values, as well as an adjusted closing value</u> that provides the closing price.

- But, [Yahoo financel API discontinued from 15th, May](http://unintelligent-nerd.blogspot.kr/2017/06/yahoo-finance-api-discontinued.html)


- [pandas-datareader](https://pandas-datareader.readthedocs.io/en/latest/index.html)

- <code>$ pip install pandas-datareader</code>

In [None]:
import pandas as pd
import pandas_datareader.data as web
import datetime

start = datetime.datetime(2014, 5, 19)
end = datetime.datetime(2014, 5, 20)

d = web.DataReader("YHOO", 'google', start, end)

print(d)

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

d = web.DataReader("YHOO", "google", "2014-01-01", "2014-12-31")
# print(d['Volume']*d['Close'])
plt.plot(d['Close']*d['Volume'])

- [Company List (NASDAQ, NYSE, & AMEX)](http://www.nasdaq.com/screening/company-list.aspx)
    - [Download NYSE](http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download)
    
    
- Only use only companies with a market Cap greater than 50B$

In [None]:
# Get stock labels, sectors, and industries

import time
import re
import pandas as pd

df = pd.read_csv("./data/companylist.csv")
# df = pd.read_csv("./data/companylist-nasdaq.csv")
# print(df.head())
print("Original # of companies :", len(df))

df = df[df['MarketCap'].str[-1] == 'B']
# print(df.head())
print(">B$ market cap company # :", len(df))

df['MarketCap'] = df.MarketCap.apply(lambda x: re.sub(r'^\$|B$','', x))
df = df[df.MarketCap.astype(float) > 50.0]
# print(df.head())
print("Company # of >50B Market Cap :", len(df))

list_stocks = df[['Symbol', 'Name', 'Sector', 'industry']].values.tolist()
print(list_stocks[0])

In [None]:
df

In [None]:
df.info()

In [None]:
df.groupby(["Sector"]).size()

In [None]:
from collections import Counter

diz_sectors = {}
for s in list_stocks:
    diz_sectors[s[0]] = s[2]

list_ranking = list(Counter(diz_sectors.values()).items())
# list_ranking = []
# for s in set(diz_sectors.values()):
#     print(s,  sum(len(v) for v in diz_sectors.values()))
#     print(s, diz_sectors.values(), sum(map(len, diz_sectors.get(s))))
#     print(s, sum(len(v) for v in diz_sectors.get(s)))
#     list_ranking.append((Counter(diz_sectors.values()), s))
    # s가 포함된 diz_sectors을 수를 넣어야 함

list_ranking.sort(reverse=True)
list_colors=['red', 'green', 'blue', 'black', 'cyan', 'magenta', 'yellow', 'coral', 'aquamarine', 'gray', 'goldenrod', 'palegreen']
# list_colors=['0.0', '0.2', '0.4', '0.6', '0.7', '0.8', '0.9']

diz_colors = {}

for s in list_ranking:
    print("s", s)
    if s[1] == 'n/a':  # ???
#         diz_colors[s[1]] = 'white'
        diz_colors[s[0]] = 'white'
        continue
#     if list_colors == []:
    if not list_colors:
#         diz_colors[s[1]] = 'white'
        diz_colors[s[0]] = 'white'
        continue
#     diz_colors[s[1]] = list_colors.pop(0)
    diz_colors[s[0]] = list_colors.pop(0)

In [None]:
list_ranking

In [None]:
diz_colors

## Prices time series
- The time series of a stock price is a typical quantity that investors (right or wrong) use when considering their investments.

In [None]:
# Retrieving historical data

import datetime

import pandas as pd
import pandas_datareader.data as web
from pandas_datareader._utils import RemoteDataError

start = datetime.datetime(2013, 5, 1)
end = datetime.datetime(2014, 5, 31)
# start = datetime.datetime(2016, 5, 1)
# end = datetime.datetime(2017, 5, 31)
diz_comp = {}
for s in list_stocks:
#     print(s[0])
    try:
        diz_comp[s[0]] = web.DataReader(s[0], 'google', start, end)
    except RemoteDataError:
        print("[WARN] No information for '%s'" % s[0])
        continue

#create dictionaries of time series for each company
diz_historical={}
for k in diz_comp.keys():
    if k not in diz_comp: 
        print(k, "is not in diz_comp")
        continue
    ts_list = diz_comp[k].index.tolist()  # a list of Timestamp's
    date_list = [ ts.date() for ts in ts_list ]  # a list of datetime.date's
    date_str_list = [ str(date) for date in date_list ]  # a list of strings
    if not date_str_list:
        print("[WARN]", k, "has no data")
        continue 
    diz_historical[k] = dict(zip(date_str_list, diz_comp[k].Close.values.tolist()))

for k in diz_historical.keys():
    print(k, len(diz_historical[k]))

- While the link between past and future performance has never been demonstrated, there is nevertheless a certain consensus that “on average" this information is valuable to the investors.
- In particular the <mark>return</mark> and the <mark>volatility</mark> are considered <u>the most important indicators</u>.
- Define the proportional return of the investment in the period $\Delta t$:

$$r(\Delta t)=\frac{p(t_0+\Delta t)-p(t_0)}{p(t_0)}$$
given interval $\Delta t$, price at the beginning $p(t_0)$ and at the end $p(t_0+\Delta t)$

- Assumed investment in only a certain number of one type of stock, so that we can use the price to determine costs and gains. The above equation in the limit $(\Delta t\rightarrow 0)$ can be written as $r(t)\simeq\frac{d\ln(p(t))}{dt}$.
- This expression passing to discrete time steps takes the following form:
$$r=\ln{p(t_0+\Delta t)}-\ln{p(t_0)}$$

In [None]:
# Return of prices

from math import log

reference_company = 'ABEV'
# reference_company = 'MSFT'
diz_returns = {}
d = list(diz_historical[reference_company].keys())
d.sort()
# print(len(d), d)

for c in diz_historical.keys():
    if len(diz_historical[c].keys()) < len(d):
        continue
    diz_returns[c] = {}
    for i in range(1, len(d)):
        diz_returns[c][d[i]] = \
            log(float(diz_historical[c][d[i]])) - log(float(diz_historical[c][d[i-1]]))
            
# print(diz_returns[reference_company])
diz_returns[reference_company]

- Among the various definitions of <mark>volatility</mark> $\sigma$, <u>the simplest is the standard deviation of the value of prices $p(t)$</u>.

In [None]:
# Basic statistics and the correlation coefficient

# mean
def mean(X):
    m = 0.0
    for i in X:
        m = m + i
    return m/len(X)

# covariance
def covariance(X, Y):
    c = 0.0
    m_X = mean(X)
    m_Y = mean(Y)
    for i in range(len(X)):
        c = c + (X[i] - m_X) * (Y[i] - m_Y)
    return c/len(X)

# pearson correlation coefficent
def pearson(X, Y):
    return covariance(X, Y)/(covariance(X, X)**0.5 * covariance(Y, Y)**0.5)

- [Pearson correlation coefficent](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)

## Correlation of prices
- Correlations in time series (or more simply comovements) are also considered to be extremely valuable.
- The idea is that every investor has precise knowledge of the market (highly unrealistic (Greenwald,Bruce and Stiglitz,1993)) and since (s)he is perfectly rational (another strong assumption),(s)he wants to maximise the return and at the same time minimise the risk of their investments.
- This is obtained by choosing the proportion of the investments among all the assets present in the market (considered “complete") and by essentially building a portfolio of all the different assets. - “Theory of portfolio"(Markowitz, 1952)


- If two or more assets have a past history of common behaviour (i.e. they both go up or down at the same time) we can measure a correlation between their price evolution as given by these “comovements".


- The correlation $\rho_{ij}(\Delta t)$ between the price returns over a time $\Delta t$ Correlation is computed by means of
$$\rho_{ij}(\Delta t)=\frac{\langle r_ir_j\rangle-\langle r_i\rangle\langle r_j\rangle}{\sqrt{(\langle r^2_i\rangle-\langle r_i\rangle^2)(\langle r^2_j\rangle-\langle r_j\rangle^2)}}$$

In [None]:
# Correlation of price returns

def stocks_corr_coeff(h1, h2):
    l1 = []
    l2 = []
    intersec_dates = set(h1.keys()).intersection(set(h2.keys()))
    for d in intersec_dates:
        l1.append(float(h1[d]))
        l2.append(float(h2[d]))
    return pearson(l1, l2)

print(stocks_corr_coeff(diz_returns[reference_company], 
                        diz_returns[reference_company]))

## Minimal spanning trees(최소 신장 트리)
- Trees are economical graphs in the sense that they connect a fixed number of vertices through the minimal number of edges.
- Trees are also often used to investigate network structure,as in the case of the breadth first search algorithms and/or as in this case,to filter the information present in a complete graph.
- Trees are perfect for classifying information(e.g., in the case of botany or zoology).


- Using the correlation values defined we obtain a set of $n\times (n - 1)/2$ numbers characterising the similarity of any of the $n$  stocks with respect to all the other $n-1$ stocks.
- A metric distance between any pair of stocks by defining
$$D_{i,j}(\Delta t)=\sqrt{2(1-\rho_{ij}(\Delta T))}$$
- With this choice, $d_{i,j}(\Delta t)$ fulfils the three axioms of a metric distance:
    - $d_{i,j}(\Delta t)=0$ iff $i=j$;
    - $d_{i,j}(\Delta t)=d_{j,i}(\Delta t)\ \forall i,j$;
    - $d_{i,j}(\Delta t)\leq d_{i,j}(\Delta t)+d_{k,j}(\Delta t)\ \forall i,j,k$.

In [None]:
# Building the network with the metric distance

import math
import networkx as nx

corr_network = nx.Graph()

num_companise = len(diz_returns.keys())
for i1 in range(num_companise-1):
    for i2 in range(i1+1, num_companise):
        stock1 = list(diz_returns.keys())[i1]
        stock2 = list(diz_returns.keys())[i2]
        metric_distance = math.sqrt(2*(1.0 - stocks_corr_coeff(diz_returns[stock1], diz_returns[stock2])))
        corr_network.add_edge(stock1, stock2, weight=metric_distance)
        
print("number of nodes:", corr_network.number_of_nodes())
print("number of edges:", corr_network.number_of_edges())

- The method for constructing the MST linking $N$ objects is known in multivariate analysis as the “nearest neighbour single linkage cluster algorithm" (Mardia et al., 1979).
- The idea is to consider the above-defined distance between two vertices as the weight of the link connecting them. 
- At this point, we <u>keep only the strongest correlations or the shortest distances</u>. To filter among the $\simeq n^2$ links we first <u>rank all the edges</u>, then we <u>start from the vertices which are nearest</u> and we <u>keep adding new vertices by following the rank of the edges</u>, <u>discarding all the links that would form a cycle</u> (in this way, by construction, the graph is acyclic, i.e. a tree). 
- Finally, we <u>stop when all the vertices are drawn</u> (in this way the tree is spanning).
- Schematising:
    1. rank a couple of vertices (stocks) from the nearest to the farthest 
    2. draw the first edge from this rank
    3. continue in the rank
    4. if the new edge does not close a cycle draw it
    5. go to point 3
    6. stop when all the vertices have been drawn.
    
- Ref
    - http://leeyongjeon.tistory.com/entry/최소신장트리Minimum-Spanning-Trees-크루스칼Kruskal-알고리즘
    - [Wikipdia](https://en.wikipedia.org/wiki/Minimum_spanning_tree)

In [None]:
# Minimal spanning tree (Prim's algorithm)

tree_seed = reference_company
N_new = []
E_new = []
N_new.append(tree_seed)
while len(N_new) < corr_network.number_of_nodes():
    min_weight = 10000000.0
    for n in N_new:
        for n_adj in corr_network.neighbors(n):
            if not n_adj in N_new:
                if corr_network[n][n_adj]['weight'] < min_weight:
                    min_weight = corr_network[n][n_adj]['weight']
                    min_weight_edge = (n, n_adj)
                    n_adj_ext = n_adj
    E_new.append(min_weight_edge)
    N_new.append(n_adj_ext)
    
tree_graph = nx.Graph()
tree_graph.add_edges_from(E_new)

for n in tree_graph.nodes():
#     print(diz_sectors[n])
    tree_graph.node[n]['color'] = diz_colors[diz_sectors[n]]

In [None]:
# Printing the financial minimum spannning tree

import matplotlib.pyplot as plt

pos = nx.nx_pydot.graphviz_layout(tree_graph, prog='neato',
                                  args='-Gmodel=subset -Gratio=fill')
plt.figure(figsize=(20,20))
nx.draw_networkx_edges(tree_graph, pos, width=2,
                       edge_color='black', alpha=0.5, style='solid')
nx.draw_networkx_labels(tree_graph, pos)
for n in tree_graph.nodes():
    nx.draw_networkx_nodes(tree_graph, pos, [n], node_size=800,
                           alpha=0.5, node_color=tree_graph.node[n]['color'],
                           with_labels=True)
# colors = []
# for n in tree_graph.nodes():
#     colors.append(tree_graph.node[n]['color'])
# nc = nx.draw_networkx_nodes(tree_graph, pos, tree_graph.nodes(), node_size=800,
#                        alpha=0.7, node_color=colors,
#                        with_labels=True, cmap=plt.cm.Greys_r)

plt.legend(diz_colors, markerscale=0.4, loc='lower left')
plt.axis('off')