## SigNet Package test

**General idea** : test the functions defined in the module1.py file and make a first application.

**General plan** : 
1. Import the data using the DropBox dataset

2. Repeat the clustering several times to get several clustering outcomes

3. Compute the weights of each stocks within each cluster 

4. Compute the returns of each portfolio, where one portfolio corresponds to one clustering and is composed of 5 big assets (which are the 5 clusters) **in this notebook, we first fix the number of clusters to be equal to 5**

Here are the main package we use in this notebook

In [1]:
import pandas as pd
import numpy as np
import yfinance as yf
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from bs4 import BeautifulSoup
import requests 
from pypfopt.efficient_frontier import EfficientFrontier

In [2]:
pip -q install git+https://github.com/robertmartin8/PyPortfolioOpt.git

Note: you may need to restart the kernel to use updated packages.


We also import the module1.py file to use the function we defined there.

In [3]:
## we make some manipulations to correctly import module1

import os
import sys

# on a obtenu le chemin absolu vers notre répertoire en utilisant
# le terminal (ls, pwd, cd)
module_1_directory = '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/code'

# Ajouter le chemin du répertoire 'code' au chemin de recherche Python
sys.path.append(module_1_directory)

# Maintenant, vous pouvez importer module1
import module1

We also import the adjency.py file to use the function we defined there.

In [4]:
## we make some manipulations to correctly import module1

import os
import sys

# on a obtenu le chemin absolu vers notre répertoire en utilisant
# le terminal (ls, pwd, cd)
adjency_directory = '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/code'

# Ajouter le chemin du répertoire 'code' au chemin de recherche Python
sys.path.append(adjency_directory)

# Maintenant, vous pouvez importer module1
from adjency import Adjency

### 1. Importation of data

We download various types of data (Open, High, Low, Close, Adj Close, Volume) for **496** assets between two periods (start, end) into a pandas dataframe denoted as df.

We then store the returns of each of these assets in the 'data'.

In [5]:
## beware of the path if Jerome or Mohamed use it 

adjency = Adjency(pd.read_csv('/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/Data/DATA_Statapp.csv'))

In [6]:
adjency.clean_data()

  x = np.array(data.iloc[i][0].replace('[', '').replace(']', '').split(', '), dtype=float)## open
  y = np.array(data.iloc[i][1].replace('[', '').replace(']', '').split(', '), dtype=float) ## close


#### Idea - Similarity

A popular similarity measure in the literature is given by the **Pearson correlation coefficient** that measures linear dependence between variables and takes values in [−1, 1]. By interpreting the correlation matrix as a weighted network whose (signed) edge weights capture the pairwise correlations, we cluster the multivariate time series by clustering the underlying signed network. 

Pearson's correlation coefficient, when applied to a [[sample (statistics)|sample]], is commonly represented by $r_{xy}$ and may be referred to as the ''sample correlation coefficient'' or the ''sample Pearson correlation coefficient''. We can obtain a formula for $r_{xy}$ by substituting estimates of the covariances and variances based on a sample into the formula above. Given paired data $\left\{ (x_1,y_1),\ldots,(x_n,y_n) \right\}$ consisting of $n$ pairs, $r_{xy}$ is defined as

\begin{align}
r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}
\end{align}

where
- $n$ is sample size
- $x_i, y_i$ are the individual sample points indexed with $i$
- $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ (the sample mean); and analogously for $\bar{y}$.

In [14]:
adjency2 = Adjency(pd.read_csv('/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/Data/DATA_Statapp.csv'))

In [19]:
data = adjency2.data[['ticker', 'open', 'close', 'volume']]

list = data['ticker']
data = data.drop(data.columns[0], axis=1)
data.set_index(list)

## we now compute the returns 
        
n = data.shape[0]

df2 = pd.DataFrame(index=data.index, columns=['return', 'volume'])
df2['volume'] = data['volume'] # volume
for i in range(n):
    x = np.array(data.iloc[i][0].replace('[', '').replace(']', '').split(', '), dtype=float)## open
    y = np.array(data.iloc[i][1].replace('[', '').replace(']', '').split(', '), dtype=float) ## close
    z = (y - x) / x
    data_z = [str(elem) for elem in z]
    df2.iloc[i, 0] = '[' + ', '.join(data_z) + ']'



  x = np.array(data.iloc[i][0].replace('[', '').replace(']', '').split(', '), dtype=float)## open
  y = np.array(data.iloc[i][1].replace('[', '').replace(']', '').split(', '), dtype=float) ## close


In [20]:
df2

Unnamed: 0,return,volume
0,"[-0.012957317073170731, 0.010093167701863354, ...","[1551299, 2234799, 3121599, 4494699, 4534699, ..."
1,"[-0.009146341463414634, 0.012422360248447204, ...","[120800, 62400, 27400, 63900, 60500, 113100, 3..."
2,"[-0.0070921985815602835, -0.012704174228675136...","[4774099, 4818899, 5262299, 7846599, 7072899, ..."
3,"[-0.03542780748663102, -0.04329608938547486, 0...","[1827799, 1266599, 1614000, 1300500, 945300, 1..."
4,"[0.0, 0.005291005291005291, -0.015789473684210...","[893200, 986900, 986800, 816300, 1076000, 1346..."
...,...,...
659,"[-0.028225806451612902, -0.015789473684210527,...","[212900, 177200, 124600, 135800, 69400, 41900,..."
660,"[-0.01650943396226415, -0.007269789983844911, ...","[6728599, 7255399, 8742500, 9730699, 8302799, ..."
661,"[0.034759358288770054, -0.046511627906976744, ...","[8772500, 6223899, 6259099, 4348299, 6237000, ..."
662,"[-0.03241491085899514, -0.011824324324324325, ...","[545200, 595800, 833400, 709500, 1089299, 9941..."


In [13]:
x = np.array(B.iloc[0,0].replace('[', '').replace(']', '').split(', '), dtype=float)
y = np.array(B.iloc[1,0].replace('[', '').replace(']', '').split(', '), dtype=float)

ValueError: could not convert string to float: '-0.01295732  0.01009317  0.04878049 ...  0.00845594 -0.00787534\n  0.00709939'

In [12]:
x

array([1551299., 2234799., 3121599., ..., 4022735., 4507541., 3943966.])

In [18]:
## n_stocks = B.shape[0]

n_stocks = 100

A = np.zeros((n_stocks, n_stocks))

for i in range(n_stocks):
       
       for j in range(i, n_stocks):
            if i == j: 
                A[i, j] = 1
            
            else: 
                x = np.array(B.iloc[i,1].replace('[', '').replace(']', '').split(', '), dtype=float)
                y = np.array(B.iloc[j,1].replace('[', '').replace(']', '').split(', '), dtype=float)
                A[i, j] = np.corrcoef(x, y)[0, 1]


In [20]:
B = pd.DataFrame(A + A.transpose() - np.eye(n_stocks))

In [24]:
l = B[B <= 0].index

In [25]:
l

RangeIndex(start=0, stop=100, step=1)