## SigNet Package test

**General idea** : test the functions defined in the module1.py file and make a first application.

**General plan** : 
1. Import the data using the DropBox dataset

2. Repeat the clustering several times to get several clustering outcomes

3. Compute the weights of each stocks within each cluster 

4. Compute the returns of each portfolio, where one portfolio corresponds to one clustering and is composed of 5 big assets (which are the 5 clusters) **in this notebook, we first fix the number of clusters to be equal to 5**

Here are the main package we use in this notebook

In [26]:
import pandas as pd
import numpy as np
import yfinance as yf
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from bs4 import BeautifulSoup
import requests 
from pypfopt.efficient_frontier import EfficientFrontier

We also import the module1.py file to use the function we defined there.

In [27]:
## we make some manipulations to correctly import module1

import os
import sys

module_1_directory = '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/code'
sys.path.append(module_1_directory)

# Now we can import module1 properly 
import module1

We also import the adjency.py file to use the function we defined there.

In [3]:
## we make some manipulations to correctly import module1

import os
import sys

adjency_directory = '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/code'
sys.path.append(adjency_directory)

# Maintenant, vous pouvez importer module1
from adjency import Adjency

### 1. Importation of data

We download various types of data (Open, High, Low, Close, Adj Close, Volume) for **496** assets between two periods (start, end) into a pandas dataframe denoted as df.

We then store the returns of each of these assets in the 'data'.

In [28]:
## Use your Path !!

df = pd.read_csv(r'\Users\keteb\OneDrive\Bureau\github\Portfolio_clusturing\Data\DATA_Statapp.csv')

In [29]:
df.head(5)

Unnamed: 0,ticker,open,high,low,close,volume,OPCL,pvCLCL,prevAdjClose,SPpvCLCL,sharesOut,PERMNO,SICCD,PERMCO,prevRawOpen,prevRawClose,prevAdjOpen
0,AA,"[82.0, 80.5, 82.0, 85.875, 86.0, 87.125, 82.0,...","[83.5625, 81.8125, 86.5, 86.375, 86.875, 87.25...","[80.375, 80.3125, 81.0, 84.8125, 84.5625, 84.3...","[80.9375, 81.3125, 86.0, 84.875, 84.625, 84.37...","[1551299, 2234799, 3121599, 4494699, 4534699, ...","[-0.013042, 0.010043, 0.047628, -0.011713, -0....","[-0.024849, 0.004633, 0.057648, -0.013081, -0....","[83.0, 80.94, 81.31, 86.0, 84.88, 84.62, 84.38...","[-0.009549, -0.038345, 0.001922, 0.000956, 0.0...","[366407, 366407, 366407, 366407, 366407, 36640...","[24643, 24643, 24643, 24643, 24643, 24643, 246...","[3334, 3334, '3334', 3334, 3334, '3334', '3334...","[20060, 20060, 20060, 20060, 20060, 20060, 200...","[80.8125, 82.0, 80.5, 82.0, 85.875, 86.0, 87.1...","[83.0, 80.9375, 81.3125, 86.0, 84.875, 84.625,...","[80.84, 82.01, 80.5, 82.09, 85.88, 86.01, 87.1..."
1,ABM,"[20.5, 20.125, 20.25, 20.1875, 20.1875, 20.25,...","[20.625, 20.375, 20.25, 20.375, 20.375, 20.25,...","[20.0, 20.0, 20.0, 20.0625, 20.0625, 20.0, 20....","[20.3125, 20.375, 20.125, 20.1875, 20.25, 20.2...","[120800, 62400, 27400, 63900, 60500, 113100, 3...","[-0.009188, 0.012346, -0.006192, 0.0, 0.003091...","[-0.003067, 0.003077, -0.01227, 0.003106, 0.00...","[20.37, 20.31, 20.38, 20.12, 20.19, 20.25, 20....","[-0.009549, -0.038345, 0.001922, 0.000956, 0.0...","[22341, 22341, 22341, 22341, 22341, 22341, 223...","[47730, 47730, 47730, 47730, 47730, 47730, 477...","[7349, 7349, '7349', 7349, 7349, '7349', '7349...","[20068, 20068, 20068, 20068, 20068, 20068, 200...","[20.1875, 20.5, 20.125, 20.25, 20.1875, 20.187...","[20.375, 20.3125, 20.375, 20.125, 20.1875, 20....","[20.19, 20.5, 20.13, 20.25, 20.19, 20.19, 20.2..."
2,ABT,"[35.25, 34.4375, 33.5625, 34.0, 34.5, 36.0, 34...","[36.0, 34.75, 34.3125, 35.25, 36.25, 36.0625, ...","[34.75, 33.75, 33.5625, 33.8125, 34.5, 34.875,...","[35.0, 34.0, 33.9375, 35.125, 35.5, 35.25, 34....","[4774099, 4818899, 5262299, 7846599, 7072899, ...","[-0.007117, -0.012786, 0.011111, 0.032553, 0.0...","[-0.036145, -0.028571, -0.001838, 0.034991, 0....","[36.31, 35.0, 34.0, 33.94, 35.13, 35.5, 35.25,...","[-0.009549, -0.038345, 0.001922, 0.000956, 0.0...","[1537311, 1537311, 1537311, 1537311, 1537311, ...","[20482, 20482, 20482, 20482, 20482, 20482, 204...","[2834, 2834, '2834', 2834, 2834, '2834', '2834...","[20017, 20017, 20017, 20017, 20017, 20017, 200...","[36.4375, 35.25, 34.4375, 33.5625, 34.0, 34.5,...","[36.3125, 35.0, 34.0, 33.9375, 35.125, 35.5, 3...","[36.44, 35.25, 34.44, 33.56, 34.02, 34.51, 36...."
3,ADI,"[93.5, 89.5, 85.625, 86.875, 84.0, 90.0, 93.5,...","[93.875, 91.5, 88.25, 87.625, 88.5, 94.75, 94....","[88.0, 85.5625, 83.1875, 83.25, 82.625, 89.25,...","[90.1875, 85.625, 86.875, 84.5, 86.875, 94.437...","[1827799, 1266599, 1614000, 1300500, 945300, 1...","[-0.036071, -0.044261, 0.014493, -0.027719, 0....","[-0.030242, -0.050589, 0.014599, -0.027338, 0....","[93.0, 90.19, 85.62, 86.87, 84.5, 86.88, 94.44...","[-0.009549, -0.038345, 0.001922, 0.000956, 0.0...","[174459, 174459, 174459, 174459, 174459, 17445...","[60871, 60871, 60871, 60871, 60871, 60871, 608...","[3612, 3612, '3612', 3612, 3612, '3612', '3612...","[282, 282, 282, 282, 282, 282, 282, 282, 282, ...","[91.5, 93.5, 89.5, 85.625, 86.875, 84.0, 90.0,...","[93.0, 90.1875, 85.625, 86.875, 84.5, 86.875, ...","[91.51, 93.56, 89.59, 85.63, 86.91, 84.05, 90...."
4,ADM,"[12.0, 11.8125, 11.875, 11.625, 11.875, 12.0, ...","[12.0625, 12.1875, 11.875, 11.875, 12.0, 12.18...","[11.875, 11.8125, 11.625, 11.5625, 11.8125, 11...","[12.0, 11.875, 11.6875, 11.75, 11.9375, 11.937...","[893200, 986900, 986800, 816300, 1076000, 1346...","[0.0, 0.005277, -0.015915, 0.010695, 0.005249,...","[-0.010309, -0.010417, -0.015789, 0.005348, 0....","[12.12, 12.0, 11.87, 11.69, 11.75, 11.94, 11.9...","[-0.009549, -0.038345, 0.001922, 0.000956, 0.0...","[608360, 608360, 608360, 608360, 608360, 60836...","[10516, 10516, 10516, 10516, 10516, 10516, 105...","[2045, 2045, '2045', 2045, 2045, '2045', '2045...","[20207, 20207, 20207, 20207, 20207, 20207, 202...","[12.0, 12.0, 11.8125, 11.875, 11.625, 11.875, ...","[12.125, 12.0, 11.875, 11.6875, 11.75, 11.9375...","[12.0, 12.0, 11.81, 11.88, 11.63, 11.88, 12.0,..."


In [35]:
# We set the tickers as index of the data frame

data = df[['ticker', 'open', 'close', 'volume']]
data.set_index('ticker', inplace=True)

- The aim here is to create a returns column from operations on the open and close columns. The difficulty lies in the fact that each element is a string composed of a list (this is the only way to store a list for each dataframe element)

In [38]:
number_stocks = len(data.index) # we get the total number of stocks 
data['return'] = np.nan # we creat a NaN column 

for i in range(number_stocks): # clean the data frame and compute the return 

    # a few simple steps are required to get rid of strings
    open = np.array(data.iloc[i][0].replace('[', '').replace(']', '').split(', '), dtype=float)
    close = np.array(data.iloc[i][1].replace('[', '').replace(']', '').split(', '), dtype=float)
    returns = (close - open) / open # we campute the the return 

    data.iloc[i, 3] = str(returns.tolist()) # the third column corresponds to the returns column 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['return'] = np.nan # we creat a NaN column


In [48]:
def sub_data_slicing(data, m, k):
    '''
    The purpose of this function is to create a sub-dataframe of the data dataframe that only keeps m days of returns for each stock.
    
    Param : 

    - data : a data frame (data_statapp.csv)
    - m : number of days 
    - k : k-th slice of m-day range ( if we want the first m days k = 0, if we want the day m+1 to 2m, k = 1)
    '''


    sub_data_k = pd.DataFrame(index=data.index, columns=data.columns)

    number_stocks = len(data.index)

    for i in range(number_stocks):
        
        # use slicing in the first four columns to retrieve the desired interval of days transform 
        # the list into an array, then into a string for compatibility reasons 
        sub_data_k.iloc[i, 0] = str(np.array(data.iloc[i][0].replace('[', '').replace(']', '').split(', '), dtype=float)[k*m:(k+1)*m].tolist())
        sub_data_k.iloc[i, 1] = str(np.array(data.iloc[i][1].replace('[', '').replace(']', '').split(', '), dtype=float)[k*m:(k+1)*m].tolist())
        sub_data_k.iloc[i, 2] = str(np.array(data.iloc[i][2].replace('[', '').replace(']', '').split(', '), dtype=float)[k*m:(k+1)*m].tolist())
        sub_data_k.iloc[i, 3] = str(np.array(data.iloc[i][3].replace('[', '').replace(']', '').split(', '), dtype=float)[k*m:(k+1)*m].tolist())

    return sub_data_k

In [49]:
## We consider m = 250 days  

sub_data_slicing_1 = sub_data_slicing(data=data, m=250, k=0) ## the order of the first slice is 0
sub_data_slicing_2 = sub_data_slicing(data=data, m=250, k=1)

- Now that the preprocessing is complete, we need to calculate the adjacency matrices associated with each sub_data_slicing and implement the clustering in the SigNet package.

#### Idea - Similarity

A popular similarity measure in the literature is given by the **Pearson correlation coefficient** that measures linear dependence between variables and takes values in [−1, 1]. By interpreting the correlation matrix as a weighted network whose (signed) edge weights capture the pairwise correlations, we cluster the multivariate time series by clustering the underlying signed network. 

Pearson's correlation coefficient, when applied to a [[sample (statistics)|sample]], is commonly represented by $r_{xy}$ and may be referred to as the ''sample correlation coefficient'' or the ''sample Pearson correlation coefficient''. We can obtain a formula for $r_{xy}$ by substituting estimates of the covariances and variances based on a sample into the formula above. Given paired data $\left\{ (x_1,y_1),\ldots,(x_n,y_n) \right\}$ consisting of $n$ pairs, $r_{xy}$ is defined as

\begin{align}
r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}
\end{align}

where
- $n$ is sample size
- $x_i, y_i$ are the individual sample points indexed with $i$
- $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ (the sample mean); and analogously for $\bar{y}$.

In [53]:
def pearson_correlation(data):
    """
    the aim here is simply to calculate the correlation matrix 

    Param : 
    - data : sub_slicing_data

    """

    number_stocks = len(data.index)

    A = np.zeros((number_stocks, number_stocks)) # 

    for i in range(number_stocks):

        for j in range(i, number_stocks): ## the matrix A is symetric 

            if i == j:
                A[i, j] = 1 # because each vector is perfectly correlated with itself
            
            else:
                # beware, the code here is not very generic, as it is assumed that it is the third column 
                # of the dataframe entered as an argument, which contains data on returns
                returns_1 = np.array(data.iloc[i, 3].replace('[', '').replace(']', '').split(', '), dtype=float)
                returns_2 = np.array(data.iloc[j, 0].replace('[', '').replace(']', '').split(', '), dtype=float)
                A[i, j] = np.corrcoef(returns_1, returns_2)[0, 1]
    
    return pd.DataFrame(A + A.transpose() - np.eye(number_stocks)) # minus identity because A has ones in te diagonale and tA too



In [54]:
correlation_matrix_1 = pearson_correlation(sub_data_slicing_1)

- The correlation matrix can be seen as an adjency matrix of a signed graph

In [17]:
correlation_matrix_1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,654,655,656,657,658,659,660,661,662,663
0,1.000000,0.124747,0.096005,-0.104243,0.024935,-0.120584,0.060576,-0.071991,-0.046851,0.081453,...,0.026084,-0.097194,0.091137,0.103709,-0.075844,-0.095874,0.008330,-0.116348,-0.010924,0.016789
1,0.124747,1.000000,0.132588,-0.044521,0.046348,-0.117968,0.126917,-0.126336,-0.064732,0.141934,...,0.050512,-0.128076,0.125788,0.122749,-0.098049,-0.093056,0.071754,-0.141819,-0.005190,-0.015522
2,0.096005,0.132588,1.000000,-0.083501,-0.123678,0.000747,-0.031094,0.042462,-0.099844,-0.030388,...,0.008553,-0.009336,-0.043670,0.004492,-0.055257,0.005406,0.056858,0.028547,-0.042840,0.074267
3,-0.104243,-0.044521,-0.083501,1.000000,0.039352,0.004479,0.041708,-0.004358,0.101048,0.048480,...,-0.060096,-0.057270,0.013832,-0.033977,-0.062804,-0.011303,0.002942,-0.016627,-0.022719,-0.081809
4,0.024935,0.046348,-0.123678,0.039352,1.000000,-0.193873,0.099445,0.056878,-0.125763,0.139407,...,0.046078,-0.190962,0.153213,0.157185,-0.133564,-0.009165,0.159740,-0.079543,0.172067,0.004246
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
659,-0.095874,-0.093056,0.005406,-0.011303,-0.009165,0.034194,0.010300,-0.029239,0.000802,-0.028263,...,-0.077848,-0.014481,-0.110007,-0.083268,-0.095057,1.000000,0.002685,-0.048460,-0.083224,0.017714
660,0.008330,0.071754,0.056858,0.002942,0.159740,-0.076665,-0.010315,-0.073714,0.046390,0.086273,...,0.049480,-0.069666,0.092894,0.026978,0.034323,0.002685,1.000000,0.043339,-0.012151,-0.081628
661,-0.116348,-0.141819,0.028547,-0.016627,-0.079543,0.086572,-0.018668,0.037909,-0.020903,-0.078098,...,-0.034196,0.050122,-0.069385,-0.002717,-0.043909,-0.048460,0.043339,1.000000,-0.015491,-0.055018
662,-0.010924,-0.005190,-0.042840,-0.022719,0.172067,-0.021062,0.066470,-0.139308,0.115735,0.060671,...,-0.045718,-0.150913,-0.051309,-0.058920,-0.072849,-0.083224,-0.012151,-0.015491,1.000000,0.045507


In [59]:
def signed_adjency(mat):
    '''

    The idea here is to use a matrix correlation to return two matrices, A_positive and A_negative, 
    corresponding to the positive and negative correlation matrices matrices

    Param :
    - data : Correlation matrix 
    
    '''

    
    A_pos = mat.applymap(lambda x: x if x >= 0 else 0)
    A_neg = mat.applymap(lambda x: abs(x) if x < 0 else 0)
    
    return A_pos, A_neg


In [60]:
A_pos, A_neg = signed_adjency(correlation_matrix_1)

#### **Mise en application avec SigNet**

On utilise désormais le package signet dont une documentation peut-être trouvée à l'adresse suivante : 

https://github.com/alan-turing-institute/SigNet/blob/master/README.md

On a copié tout le code de signet dans le fichier signet et on importe désormais tout le code dans ce fichier

In [61]:
import sys

sys.path.append(r'\Users\keteb\OneDrive\Bureau\github\Portfolio_clusturing')  

from signet.cluster import Cluster # get the function named cluster in signet package 

In [23]:
from scipy import sparse

def apply_SPONGE(correlation_matrix, k): 

    '''
    Idea :
    Given a correlation matrix obtained from a database and the pearson similarity, 
    return a vector associating to each asset the number of the cluster to which it belongs once SPONGE has been applied
       (from the bookmark package)

    Param : 

    - correlation_matrix : a square dataframe of size (number_of_stocks, number_of_stocks)
    - k : the number of clusters to identify. If a list is given, the output is a corresponding list

    Return : array of int, or list of array of int: Output assignment to clusters.

    '''

   # We respect the format imposed by signet package. 
   # To do this, we need to change the type of the A_pos and A_neg matrices, which cannot remain as data frames.
    A_pos, A_neg = signed_adjency(correlation_matrix)

    A_pos_sparse = sparse.csc_matrix(A_pos.values) 
    A_neg_sparse = sparse.csc_matrix(A_neg.values)

    data = (A_pos_sparse, A_neg_sparse)

    cluster = Cluster(data)

   #  Apply SPONGE method : clusters the graph using the Signed Positive Over Negative Generalised Eigenproblem (SPONGE) clustering.

    return cluster.SPONGE(k = 20)



In [62]:
result = apply_SPONGE(correlation_matrix_1, k=20) # apply the sponge method with 20 clusters 

In [65]:
result

array([13, 10,  9, 11, 14, 11, 14, 11,  9, 11, 19,  0, 10, 13,  0, 10, 11,
        0,  0,  0,  0,  0, 16, 13, 16, 19, 11, 17, 16, 11,  0, 14, 17, 14,
       14, 17, 11, 19,  6,  9, 17, 11, 17, 13, 12,  0, 14,  0,  0, 17, 10,
       10,  0, 16, 17, 19, 14,  0,  0,  9, 11,  9, 18, 11, 16, 16, 14,  2,
       14, 17, 14, 13, 10, 15, 11, 19, 11, 13,  9, 13, 13, 10, 14, 11,  0,
       13, 11, 19, 13,  9, 11, 14, 19, 14, 16, 13,  0, 11, 14, 13, 14, 19,
       11, 17, 10,  0,  0, 14, 13, 13, 16, 10,  9, 10, 16, 14,  0, 16, 11,
       16, 11, 11, 11, 13, 14, 11,  0, 14,  0, 13, 11, 14,  0, 14,  0, 11,
       10, 11, 10, 13, 14,  0, 13, 16, 17,  3, 10, 13, 11, 10, 13, 16, 11,
       19,  0, 11,  0, 13, 17,  9, 13,  0, 10, 11, 11,  4, 13, 15, 14, 11,
       11,  9,  0, 11, 13, 11, 13, 11, 11, 16, 10, 17, 14, 16, 16, 10,  0,
       17, 11, 13,  0, 17, 14, 13, 14,  0, 16, 11, 13,  0, 11, 16, 13,  0,
       11,  0, 13, 14, 11, 14,  0, 11, 14, 16,  9, 16, 17,  9, 17, 14, 17,
       14, 14, 17, 11,  0