## Clustering draft 1

**General idea** : test the functions defined in the module1.py file and make a first application.

**General plan** : 
1. Import the data using the module YahooFinance 

2. Repeat the clustering several times to get several clustering outcomes

3. Compute the weights of each stocks within each cluster 

4. Compute the returns of each portfolio, where one portfolio corresponds to one clustering and is composed of 5 big assets (which are the 5 clusters) **in this notebook, we first fix the number of clusters to be equal to 5**

Here are the main package we use in this notebook

In [1]:
import pandas as pd
import numpy as np
import yfinance as yf
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from bs4 import BeautifulSoup
import requests 
from pypfopt.efficient_frontier import EfficientFrontier


We also import the module1.py file to use the function we defined there.

In [2]:
pip install git+https://github.com/robertmartin8/PyPortfolioOpt.git

Collecting git+https://github.com/robertmartin8/PyPortfolioOpt.git
  Cloning https://github.com/robertmartin8/PyPortfolioOpt.git to /private/var/folders/q2/wg5gyfhj2r9cd97zfmckktvw0000gn/T/pip-req-build-8cgyk_kw
  Running command git clone --filter=blob:none --quiet https://github.com/robertmartin8/PyPortfolioOpt.git /private/var/folders/q2/wg5gyfhj2r9cd97zfmckktvw0000gn/T/pip-req-build-8cgyk_kw
  Resolved https://github.com/robertmartin8/PyPortfolioOpt.git to commit 30ab57147ba61eddc8301294a5b1c5ef260b23fa
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [3]:
## we make some manipulations to correctly import module1

import os
import sys

# on a obtenu le chemin absolu vers notre répertoire en utilisant
# le terminal (ls, pwd, cd)
module_1_directory = '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/code'

# Ajouter le chemin du répertoire 'code' au chemin de recherche Python
sys.path.append(module_1_directory)

# Maintenant, vous pouvez importer module1
import module1

### 1. Importation of data

We download various types of data (Open, High, Low, Close, Adj Close, Volume) for **496** assets between two periods (start, end) into a pandas dataframe denoted as df.

We then store the returns of each of these assets in the 'data'.

#### 1.1. Scraping the names of the assets of interest

- This is important because to import financial data for assets on Yahoo Finance, you need the tickers' names. This scraping work was performed in the file *get_S&P500_symbols.ipynb* and generated the *S&P500_symbols.csv* file.

- We rely on the csv package to read the *S&P500_symbols.csv* file and to create a list containing all the tickers

In [4]:
import csv 

with open('S&P500_symbols.csv', mode='r') as file:
    # Create a CSV reader
    csv_reader = csv.reader(file)
    
    # Transform the CSV data into a list
    stock_symbols = []
    for row in csv_reader:
        stock_symbols.append(row[0])

# Close the file
file.close()

stock_symbols.pop(0)

'Symbols of S&P 500'

In [5]:
stock_symbols

['MMM',
 'AOS',
 'ABT',
 'ABBV',
 'ACN',
 'ADM',
 'ADBE',
 'ADP',
 'AES',
 'AFL',
 'A',
 'ABNB',
 'APD',
 'AKAM',
 'ALK',
 'ALB',
 'ARE',
 'ALGN',
 'ALLE',
 'LNT',
 'ALL',
 'GOOGL',
 'GOOG',
 'MO',
 'AMZN',
 'AMCR',
 'AMD',
 'AEE',
 'AAL',
 'AEP',
 'AXP',
 'AIG',
 'AMT',
 'AWK',
 'AMP',
 'AME',
 'AMGN',
 'APH',
 'ADI',
 'ANSS',
 'AON',
 'APA',
 'AAPL',
 'AMAT',
 'APTV',
 'ACGL',
 'ANET',
 'AJG',
 'AIZ',
 'T',
 'ATO',
 'ADSK',
 'AZO',
 'AVB',
 'AVY',
 'AXON',
 'BKR',
 'BALL',
 'BAC',
 'BBWI',
 'BAX',
 'BDX',
 'WRB',
 'BRK.B',
 'BBY',
 'BIO',
 'TECH',
 'BIIB',
 'BLK',
 'BX',
 'BK',
 'BA',
 'BKNG',
 'BWA',
 'BXP',
 'BSX',
 'BMY',
 'AVGO',
 'BR',
 'BRO',
 'BF.B',
 'BG',
 'CHRW',
 'CDNS',
 'CZR',
 'CPT',
 'CPB',
 'COF',
 'CAH',
 'KMX',
 'CCL',
 'CARR',
 'CTLT',
 'CAT',
 'CBOE',
 'CBRE',
 'CDW',
 'CE',
 'COR',
 'CNC',
 'CNP',
 'CDAY',
 'CF',
 'CRL',
 'SCHW',
 'CHTR',
 'CVX',
 'CMG',
 'CB',
 'CHD',
 'CI',
 'CINF',
 'CTAS',
 'CSCO',
 'C',
 'CFG',
 'CLX',
 'CME',
 'CMS',
 'KO',
 'CTSH',
 'CL',


#### 1.2. Creating the dataframe using pandas and yfinance 

In this case, we focus on the financial data of asset during the year 2022

In [6]:
n_stocks = len(stock_symbols) # number of stocks = 502

start = "2022-01-01" # start date
end = "2022-12-01" # end date

df = pd.DataFrame(yf.download(stock_symbols, start, end)) # data on the 198 assets
data = np.log(df["Close"]/df["Open"]).transpose() # compute the returns of these assets
data = data.dropna()

[*********************100%%**********************]  502 of 502 completed


5 Failed downloads:
['BRK.B']: Exception('%ticker%: No timezone found, symbol may be delisted')
['VLTO', 'GEHC', 'KVUE']: Exception("%ticker%: Data doesn't exist for startDate = 1641013200, endDate = 1669870800")
['BF.B']: Exception('%ticker%: No price data found, symbol may be delisted (1d 2022-01-01 -> 2022-12-01)')





Due to missing values in the data, 5 stocks have been excluded from the intial 502 we considered. Thus the shape of the dataframe is (496, 230) 






### 2. We repeat the clustering step several times to start preparing multiple micro-portfolios

**General idea**: The multiple_clustering function takes as arguments:

1. *n_repeat*: an integer that corresponds to the number of times we want to train (.fit()) the model passed as an argument on the data.

2. *data*: a pandas dataframe (like the one we generated above) that corresponds to the financial data on which the model will be trained.

3. *model*: a sklearn clustering model (for now, we will test it with KMeans).

4. *model_name*: the name of the model, necessary for creating the pipeline properly.

**Outcome**: a pandas dataframe storing the results of the multiple clustering and a dictionary containing the centroids of the clusters each time 

We first make the test using the K-Means model and fix we fix the number of clusters to 5

In [7]:
model = KMeans(n_clusters=5)
model_name = 'kmeans'

Y, C = module1.multiple_clusterings(10, data, model, model_name)
print(Y, C)

  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)


      Clustering n°1  Clustering n°2  Clustering n°3  Clustering n°4  \
A                  1               1               1               4   
AAL                0               4               3               2   
AAPL               1               1               1               4   
ABBV               3               0               2               0   
ABNB               0               4               3               2   
...              ...             ...             ...             ...   
YUM                1               1               1               4   
ZBH                1               3               1               4   
ZBRA               1               1               4               3   
ZION               4               3               1               4   
ZTS                1               1               1               4   

      Clustering n°5  Clustering n°6  Clustering n°7  Clustering n°8  \
A                  2               1               0           

In [8]:
Y

Unnamed: 0,Clustering n°1,Clustering n°2,Clustering n°3,Clustering n°4,Clustering n°5,Clustering n°6,Clustering n°7,Clustering n°8,Clustering n°9,Clustering n°10
A,1,1,1,4,2,1,0,2,3,1
AAL,0,4,3,2,0,0,4,4,0,2
AAPL,1,1,1,4,3,1,0,2,2,1
ABBV,3,0,2,0,1,2,2,1,4,3
ABNB,0,4,3,2,0,0,4,4,0,2
...,...,...,...,...,...,...,...,...,...,...
YUM,1,1,1,4,2,1,0,2,3,1
ZBH,1,3,1,4,2,1,0,2,3,1
ZBRA,1,1,4,3,3,3,3,2,2,4
ZION,4,3,1,4,2,1,0,3,3,1


### 3. Compute the weights of the stocks in each cluster 

#### 3.1. Get the cluster compositions each time 

**Why doing so?**: the purpose of this work is to compute the returns of each cluster (seen as a new "fictive" asset composed of "real" assets) and this latter task is easier if we know the cluster composition in terms of tickers (because in the financial database we have data given a ticker).

In [9]:
Y_symbol = module1.cluster_composition(Y)
Y_symbol

Unnamed: 0,Clustering n°1,Clustering n°2,Clustering n°3,Clustering n°4,Clustering n°5,Clustering n°6,Clustering n°7,Clustering n°8,Clustering n°9,Clustering n°10
Cluster 1,"[A, AAPL, ABT, ACN, ADBE, ADI, ADP, AES, AKAM,...","[A, AAPL, ABT, ACN, ADI, ADP, AES, AKAM, ALLE,...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AKAM, ALLE,...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ...","[A, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ALLE, ...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AKAM, ALLE,...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ...","[A, AAPL, ABT, ACN, ADP, AES, AJG, AKAM, ALLE,...","[A, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ALLE, ...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AKAM, ALLE,..."
Cluster 2,"[AAL, ABNB, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ...","[AAL, ABNB, ADBE, ADSK, ALB, ALGN, AMAT, AMD, ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BBY, BKNG, BX...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BBY, BKNG, CC...","[AAL, ABNB, ADBE, ADI, ADSK, ALB, ALGN, AMAT, ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ..."
Cluster 3,"[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, AJG, ALL...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, AJG, ALL...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, AJG, ALL...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, ALL, AMG...","[AAPL, ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, ...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, AJG, ALL...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, ALL, AMG...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, ALL, AMC...","[AAPL, ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, ...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, AJG, ALL..."
Cluster 4,"[AIG, ALK, AMP, AVY, AXP, BA, BAC, BEN, BK, BW...","[AIG, ALK, AMP, AVY, AXP, BA, BAC, BALL, BBY, ...","[ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ...","[ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, ALL, AMG...","[ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ...","[ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ...","[AIG, ALK, AMP, AXP, BA, BAC, BEN, BK, BWA, BX...","[ABBV, ACGL, ADM, AEE, AEP, AFL, AIZ, ALL, AMG...","[ADBE, ADI, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ..."
Cluster 5,"[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ...","[APA, BKR, CF, COP, CTRA, CVX, DVN, EOG, EQT, ..."


In [10]:
Y_symbol.columns

Index(['Clustering n°1', 'Clustering n°2', 'Clustering n°3', 'Clustering n°4',
       'Clustering n°5', 'Clustering n°6', 'Clustering n°7', 'Clustering n°8',
       'Clustering n°9', 'Clustering n°10'],
      dtype='object')

#### 3.2. Compute the weights of each assets within a cluster

**Why doing so?**: because each stock in a cluster has a different weight. Stock are "representative" of the cluster in which they live to a different extent. So far, we built the weight of a stock based on its $\mathcal{L}^2$ distance from the cluster's centroid. We can therefore see the weights as a kind of "inertia contribution" of the stock in the cluster.

##### 3.2.1. Analysis of the centroids 

- The .multiple_clusterings() function returns two data frame, the first of which is a dataframe containing the centroid of each cluster. This centroid is an array containing the "average" return each day, where each daily average is taken on the returns of all the stocks contained in the corresponding cluster. 

- Here is an example:


In [11]:
C

Unnamed: 0,Clustering n°1,Clustering n°2,Clustering n°3,Clustering n°4,Clustering n°5,Clustering n°6,Clustering n°7,Clustering n°8,Clustering n°9,Clustering n°10
Cluster 1,"[0.22570570203763612, -1.064410565739268, -1.1...","[-0.022359339051409872, 0.02410534906210998, 0...","[1.5002065476511288, 1.0570143644422916, -0.31...","[-0.0042706263815413936, 0.008306660270957772,...","[1.061575996971384, -0.07455866218353864, -0.7...","[1.0818821358026836, -0.10293663224269473, -0....","[-0.21801994777423442, 0.21258383574749243, 0....","[1.5002065476511288, 1.0570143644422916, -0.31...","[1.0818821358026836, -0.10293663224269473, -0....","[1.5002065476511288, 1.0570143644422916, -0.31..."
Cluster 2,"[-0.5506535450274851, -0.35881129892114266, -0...","[-0.7079255623369027, -0.42236187436659356, -0...","[-0.19970203859444624, 0.19188206203603164, 0....","[1.5002065476511288, 1.0570143644422916, -0.31...","[-0.01355299793268889, -0.016196698341003096, ...","[-0.2193473978188367, 0.18334534706217956, 0.0...","[1.5002065476511288, 1.0570143644422916, -0.31...","[-0.010699393762917278, 0.009519698604763602, ...","[1.5002065476511288, 1.0570143644422916, -0.31...","[-0.17728704794979327, 0.2383289002185109, 0.0..."
Cluster 3,"[1.5371658112463586, 1.1520938247611001, -0.38...","[1.5002065476511288, 1.0570143644422916, -0.31...","[-0.061057196885880585, -0.03151338142432732, ...","[0.7319135360221356, -0.10235771868505306, -0....","[-0.1778054770541758, 0.24307238385577018, 0.0...","[-0.024011412937610967, -0.014560644562048955,...","[-0.009068819147147501, -0.008333068457933319,...","[-0.6418836855700921, -0.2970691093525735, -0....","[-0.3329347097534501, -1.0342619646614706, -1....","[1.0818821358026836, -0.10293663224269473, -0...."
Cluster 4,"[-0.012485403892586915, 0.018799742394001607, ...","[0.3682070445455948, 0.7044640189998757, 0.214...","[1.1301311810523154, -0.11219298097335409, -0....","[-0.06799306985647646, -1.0573149661235084, -1...","[-0.32423057429420576, -1.0232387908135685, -1...","[-0.1754528900478655, -1.1365329519416718, -1....","[-0.25594876226768776, -1.1625746749393877, -1...","[0.4431207869880619, 0.767029938534176, 0.2179...","[-0.16894505940844626, 0.24942832796256945, 0....","[-0.07168793234656273, -0.049355268970075826, ..."
Cluster 5,"[0.33993009128536267, 0.824346600227613, 0.209...","[0.168969578999866, -1.09751363474114, -1.2008...","[-0.17409859643615738, -1.138527299010478, -1....","[-0.26376452179957, 0.1417764783505492, 0.0341...","[1.5002065476511288, 1.0570143644422916, -0.31...","[1.5002065476511288, 1.0570143644422916, -0.31...","[0.7165797843155612, -0.006659895620634168, -0...","[0.1741501230036918, -1.0590245171558095, -1.1...","[-0.02078067958632562, -0.024783727498455022, ...","[-0.23752576235501585, -1.158431753465747, -1...."


##### 3.2.2. Computing the weights of each cluster

- In module1, there are two functions *cluster_weights* and *gaussian_weights* which, given one cluster and its centroids, return the weights of the stocks in the cluster. These weights corresponds to a sort of distance of the stocks to the cluster's centroid.

- If we consider the previous example, we can consider the first cluster (Cluster 1 of the Clusering n°1) and compute the weights of the stocks in this cluster. 


In [12]:
Y_symbol.head(2)

Unnamed: 0,Clustering n°1,Clustering n°2,Clustering n°3,Clustering n°4,Clustering n°5,Clustering n°6,Clustering n°7,Clustering n°8,Clustering n°9,Clustering n°10
Cluster 1,"[A, AAPL, ABT, ACN, ADBE, ADI, ADP, AES, AKAM,...","[A, AAPL, ABT, ACN, ADI, ADP, AES, AKAM, ALLE,...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AKAM, ALLE,...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ...","[A, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ALLE, ...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AKAM, ALLE,...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ...","[A, AAPL, ABT, ACN, ADP, AES, AJG, AKAM, ALLE,...","[A, ABT, ACN, ADP, AES, AIG, AJG, AKAM, ALLE, ...","[A, AAPL, ABT, ACN, ADP, AES, AIG, AKAM, ALLE,..."
Cluster 2,"[AAL, ABNB, ADSK, ALB, ALGN, AMAT, AMD, AMZN, ...","[AAL, ABNB, ADBE, ADSK, ALB, ALGN, AMAT, AMD, ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BBY, BKNG, BX...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BBY, BKNG, CC...","[AAL, ABNB, ADBE, ADI, ADSK, ALB, ALGN, AMAT, ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ...","[AAL, ABNB, ALK, APTV, BA, BBWI, BKNG, CCL, CZ..."


#### **Example**: 
- To take a basic example, we consider the first cluster of the first clustering (if we refer to the Y_symbol cell above this corresponds to *Y_symbol.iloc[0, 0]* for the tickers and to *C.iloc[0, 0]* for the centroids).

- We then compute the $\mathcal{L}^2$ weights and the Gaussian weights for this cluster 



In [13]:
    ## We consider the first cluster of the first clustering 
    #  (location [0, 0])

cluster = Y_symbol.iloc[0,0]
centroid = C.iloc[0,0]

    ## We then compute the weights (L2 and Gaussian) corresponding to 
    #  the stocks in this cluster 

weights_L2 = module1.cluster_weights(cluster, centroid, data)
weights_gaussian = module1.gaussian_weights(cluster, centroid, data)


In [14]:
cluster_composition = cluster 
micro_portfolio_return = pd.DataFrame(index=cluster_composition, columns=data.columns).transpose()
micro_portfolio_return

Unnamed: 0_level_0,A,AAPL,ABT,ACN,ADBE,ADI,ADP,AES,AKAM,ALLE,...,VRSN,WAT,WST,WTW,XRAY,XYL,YUM,ZBH,ZBRA,ZTS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-01-03,,,,,,,,,,,...,,,,,,,,,,
2022-01-04,,,,,,,,,,,...,,,,,,,,,,
2022-01-05,,,,,,,,,,,...,,,,,,,,,,
2022-01-06,,,,,,,,,,,...,,,,,,,,,,
2022-01-07,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-23,,,,,,,,,,,...,,,,,,,,,,
2022-11-25,,,,,,,,,,,...,,,,,,,,,,
2022-11-28,,,,,,,,,,,...,,,,,,,,,,
2022-11-29,,,,,,,,,,,...,,,,,,,,,,


- We make sure that the weights add up to one

In [15]:
print(weights_L2.sum(axis=1))
print(weights_gaussian.sum(axis=1))

0    1.0
dtype: float64
0    1.0
dtype: float64


In [53]:
weights_gaussian 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142,143,144,145,146,147,148,149,150,151
0,0.006283,0.009284,0.003134,0.006667,0.020377,0.012203,0.00341,0.00474,0.004349,0.004301,...,0.005927,0.004063,0.006111,0.003542,0.003636,0.004811,0.00334,0.00416,0.01452,0.003864


### 4. Compute the returns of each portfolio, 

- **Idead**: We considered that one portfolio corresponds to one clustering and is composed of 5 big assets (which are the 5 clusters) [*in this notebook, we first fix the number of clusters to be equal to 5*]

To have an idea of how we built the clustering_return() function, see the shape of cluter_data (which corresponds to the return data associated with the tickers in the cluster) and the weights of the stocks in this cluster 

In [16]:
cluster_data = data.loc[cluster]
print(cluster_data.shape)
print(weights_gaussian.shape)

(152, 230)
(1, 152)


#### **Remarks**:

- We see that the to get the return of each clusters, we have to multiply each columns of the cluster_data dataframe by the weights of the stocks (take the transpose of the line dataframe).

- The **clustering_return** function applies to a **clustering** i.e. one column of the Y_symbol dataframe (result of the cluster_composition() funciton). 

In [17]:
## We take as an example the first clustering of the Y_symbol dataframe
clustering_composition_1 = pd.DataFrame(Y_symbol.iloc[:, 0])
clustering_centroids_1 = pd.DataFrame(C.iloc[:, 0])

cluster_1 = clustering_composition_1.iloc[0, 0]
centroid_1 = clustering_centroids_1.iloc[0, 0]

In [18]:
module1.clustering_return(clustering_composition_1, clustering_centroids_1, data)

Unnamed: 0,Clustering n°1
Cluster 1,"[-0.012250254624485923, -0.003701860329726867,..."
Cluster 2,"[0.002552074787850512, -0.016921042346380102, ..."
Cluster 3,"[-0.0017642845952690288, 0.0033333034148049986..."
Cluster 4,"[0.005073581199086526, 0.018293685102378946, -..."
Cluster 5,"[0.02842256037628322, 0.024378112270331972, -0..."


In [23]:
Z = module1.clustering_return(Y_symbol, C, data)
Z

Unnamed: 0,Clustering n°1,Clustering n°2,Clustering n°3,Clustering n°4,Clustering n°5,Clustering n°6,Clustering n°7,Clustering n°8,Clustering n°9,Clustering n°10
Cluster 1,"[-0.012250254624485923, -0.003701860329726867,...","[-0.01531157786515177, -0.0048544302000855894,...","[-0.005404173746804517, 0.00656127101769116, -...","[-0.006659652960128889, 0.005616771696404612, ...","[-0.004966421784293264, 0.007512319089111065, ...","[-0.0057797609563924775, 0.006398309547821224,...","[-0.005837035992531932, 0.006983001717030174, ...","[-0.014023166222497326, -0.0025328563888639428...","[-0.0047938971637641525, 0.007629452941698897,...","[-0.004968285414934997, 0.0074230457502375575,..."
Cluster 2,"[0.002552074787850512, -0.016921042346380102, ...","[0.0014639522569251141, -0.017516563990495924,...","[0.020284973269521094, 0.0010583584288118585, ...","[0.012734273970879088, 0.00109175814314365, -0...","[0.01910961979670621, 0.0016979716279987385, -...","[0.0192778901044051, 0.001248932606635834, -0....","[0.012439742108075266, 0.002868794642612815, -...","[0.0018474192288009906, -0.01664385140684447, ...","[0.01954588543049361, 0.001079475659532826, -0...","[0.019298147351016693, 0.0012691564055324787, ..."
Cluster 3,"[-0.0017642845952690288, 0.0033333034148049986...","[-0.001959041300033884, 0.0034315774163352825,...","[-0.0027083258262663268, 0.002398760941834942,...","[-0.0016052054753844358, 0.003136385309621592,...","[-0.007928190987665965, -0.015815880964399887,...","[-0.0019927786060868068, 0.0027115320164120523...","[-0.0017037030615390223, 0.0028264614384716485...","[-0.0017699980588669964, 0.0031200644297932844...","[-0.007994257089869432, -0.016289584354304266,...","[-0.0029211543743727403, 0.0020617073562927365..."
Cluster 4,"[0.005073581199086526, 0.018293685102378946, -...","[0.00568128070409561, 0.016098175418814527, -0...","[-0.004902145407383425, -0.018186862766110656,...","[-0.0028582732879514055, -0.016717800421057333...","[-0.0017970276442460518, 0.002665793279090164,...","[-0.004941911984258264, -0.018179026412874232,...","[-0.006502308587832744, -0.018658616384662593,...","[0.007136691294444611, 0.01725303905978361, -0...","[-0.001926815358363909, 0.0025552523634554927,...","[-0.006178275168988836, -0.01850710066083942, ..."
Cluster 5,"[0.02842256037628322, 0.024378112270331972, -0...","[0.02768956210290857, 0.02261472922749256, -0....","[0.027689014900776236, 0.022613414430467164, -...","[0.027642178144216253, 0.022502836469670644, -...","[0.02772172266571096, 0.022657567660824593, -0...","[0.02772172266571096, 0.022657567660824593, -0...","[0.027694023768062532, 0.022618284577694613, -...","[0.027689543284871252, 0.02261449967604481, -0...","[0.02767735725875011, 0.022589998465235284, -0...","[0.027688540598528207, 0.022612977760373274, -..."


In [34]:
returns = Z.iloc[:, 0]
len(returns.values)

5

In [35]:
np_returns = np.array([returns.iloc[i] for i in range(len(returns.values))])

In [42]:
np_returns.T

array([[-1.22502546e-02,  2.55207479e-03, -1.76428460e-03,
         5.07358120e-03,  2.84225604e-02],
       [-3.70186033e-03, -1.69210423e-02,  3.33330341e-03,
         1.82936851e-02,  2.43781123e-02],
       [-2.20535444e-02, -3.94158733e-02, -6.57107767e-03,
        -1.38813365e-02, -2.47953669e-02],
       ...,
       [-9.54817091e-03, -1.02797376e-02, -5.70136415e-03,
        -1.36851160e-02,  5.30717307e-05],
       [-3.54061181e-04, -3.92398976e-03,  4.28129818e-03,
         8.26605290e-03,  4.09940307e-03],
       [ 3.08955246e-02,  4.48948956e-02,  2.04974956e-02,
         1.86668659e-02, -7.13534295e-03]])

In [47]:
expected_returns = np.mean(np_returns.T, axis=0)
cov_matrix = np.cov(np_returns.T, rowvar=False)

In [51]:
expected_returns 

array([-3.23447972e-04, -5.40822425e-04,  1.08025485e-04, -6.96866657e-05,
        1.59471396e-03])

In [50]:
module1.markowitz(expected_returns=expected_returns, cov_matrix=cov_matrix)


ValueError: at least one of the assets must have an expected return exceeding the risk-free rate