**ÉTAPE 0** : préparation des données

In [6]:
import process 
import pandas as pd

# Nail path : '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/Data/DATA_Statapp.csv'
# Jerome path : 'C:\Users\33640\OneDrive\Documents\GitHub\Portfolio_clustering_project\Data\DATA_Statapp.csv'
# Mohamed path : '/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/Data/DATA_Statapp.csv'
df = pd.read_csv('/Users/khelifanail/Documents/GitHub/Portfolio_clustering_project/Data/DATA_Statapp.csv')

# Apply conversion function to 'open' and 'close' columns
df['open'] = df['open'].apply(process.safe_literal_eval)
df['close'] = df['close'].apply(process.safe_literal_eval)

# Calculate returns for each line
df['return'] = df.apply(lambda row: [(close - open) / open for open, close in zip(row['open'], row['close'])], axis=1)

new_df = df[['ticker', 'return']] # create a new data frame with the column ticker and return 

# Créons le DataFrame à partir des listes dans 'return'
# On suppose ici que 'new_df' est déjà défini et contient la colonne 'return'

# Convertir chaque liste dans la colonne 'return' en plusieurs colonnes dans le nouveau DataFrame
returns_df = pd.DataFrame(new_df['return'].tolist())

# Ajouter la colonne 'ticker' du 'new_df' au début de 'returns_df'
returns_df.insert(0, 'ticker', new_df['ticker'])

# Renommer les colonnes pour refléter qu'elles sont des rendements
returns_df.columns = ['ticker'] + [f'return_{i}' for i in range(len(returns_df.columns) - 1)]

df_cleaned = process.remove_rows_with_nan(returns_df)
df_cleaned.reset_index(drop=True, inplace=True)

process.check_nan_inf(df_cleaned)

df_cleaned.shape

There are no NaN values in the dataframe


(632, 5532)

**ÉTAPE 1** : Phase d'entraînement

1. Obtention de la matrice de corrélation des actifs sur une fenêtre arrière de 30 jours (1 mois)

In [7]:
lookback_window = 30
correlation_matrix = process.correlation_matrix(lookback_window, df_cleaned)

2. Obtention de la composition de chaque cluster et du centroïde de chacun d'entre eux

In [8]:
## PROBLÈME DES ARRONDIS

cluster_composition = process.cluster_composition_and_centroid(df_cleaned=df_cleaned, correlation_matrix=correlation_matrix, number_of_clusters=20, lookback_window=30)

  A_pos = mat.applymap(lambda x: x if x >= 0 else 0)
  A_neg = mat.applymap(lambda x: abs(x) if x < 0 else 0)
  super()._check_params_vs_input(X, default_n_init=10)


**ÉTAPE 2** : construction de portefeuille

1. On donne, au sein d'un même cluster, un poids à chaque actif selon sa distance au centroïde de celui-ci. Cela nous servira plus tard pour calculer le rendement de chaque cluster (alors vu comme un nouvel actif synthétique)

In [12]:
constituent_weights = process.constituent_weights(df_cleaned=df_cleaned, cluster_composition=cluster_composition, sigma=2, lookback_window=30)

In [13]:
constituent_weights

[['cluster 1',
  [['AES', 0.9970978754845263],
   ['AIN', 0.9972482346922176],
   ['AIR', 0.9942589531926821],
   ['APH', 0.9924945550102521],
   ['ARE', 0.99935100579242],
   ['ATO', 0.9984349530434254],
   ['AVT', 0.9985750101168249],
   ['AWR', 0.9995283375636095],
   ['B', 0.999022879914758],
   ['BF', 0.9991326826193415],
   ['BRC', 0.9985203022957694],
   ['CHD', 0.9969361338323695],
   ['CLB', 0.9973385805481073],
   ['CMC', 0.9992315521989659],
   ['DE', 0.997833569494923],
   ['DHI', 0.9966809358319166],
   ['EMF', 0.9989230884500082],
   ['ETN', 0.9986889033451672],
   ['GF', 0.9993507358835538],
   ['GIS', 0.9988199299090078],
   ['GLW', 0.9903352308599289],
   ['GWW', 0.9975677918242448],
   ['KEP', 0.9983038462087516],
   ['KEX', 0.998405801339682],
   ['LH', 0.996438150733555],
   ['MAC', 0.9993278019020138],
   ['MTN', 0.9973734131632114],
   ['MTX', 0.9980301351607513],
   ['MTZ', 0.9981020247825863],
   ['NL', 0.9993406043670703],
   ['NSL', 0.9988042343494895],
   ['P

Le choix des rendements attendus (expected_returns) dans le modèle de Markowitz peut être un défi car il nécessite des prévisions pour chaque actif inclus dans le portefeuille. 

In [15]:
## on récupère le dataframe contenant les return de chaque cluster

cluster_return = process.cluster_return(constituent_weights=constituent_weights, df_cleaned=df_cleaned, lookback_window=30) 

## on construit la matrice de corrélation associée à ces returns, c'est donc une matrice de corrélation de return de cluster

cov_matrix = cluster_return.corr(method='pearson')

## on construit le vecteur d'expected return du cluster 
expected_returns = cluster_return.mean(axis=0) ## on fait ici le choix de prendre le rendement moyen comme objectif

In [16]:
cluster_return

Unnamed: 0,cluster 1,cluster 2,cluster 3,cluster 4,cluster 5,cluster 6,cluster 7,cluster 8,cluster 9,cluster 10,cluster 11,cluster 12,cluster 13,cluster 14,cluster 15,cluster 16,cluster 17,cluster 18,cluster 19
0,-1.031158,-0.352216,-0.48378,0.129413,-0.703082,-0.150042,-0.060518,-1.244276,-1.311404,-0.453082,-0.764747,-0.616582,-1.277044,-1.367721,-0.567812,-0.810596,0.157285,0.044745,0.160795
1,-1.061591,-0.316128,-0.596005,-0.356512,0.342245,0.362692,-0.085533,-1.290937,-1.719999,-0.208099,-1.016466,0.040645,-0.428302,-0.08559,-0.257147,0.074357,0.272219,-0.140634,0.069556
2,0.112163,-0.012761,0.528013,0.170657,1.207466,0.005092,0.059951,0.042938,-0.035207,0.186231,0.712066,0.157305,0.844239,0.118531,-0.016709,0.443778,0.490922,-0.003505,-0.055192
3,0.074652,-0.171928,0.01143,0.16226,-0.101058,-0.134401,-0.166652,1.345179,-0.81779,0.507309,1.205334,0.061629,0.955521,1.643397,0.226538,0.258025,0.51409,-0.1536,0.04503
4,0.892544,0.226668,0.303452,-0.152385,0.320757,-0.192172,0.563591,1.180801,1.742968,0.086272,2.973239,0.968011,0.024103,0.437556,0.043502,0.173556,0.160206,-0.141196,0.053245
5,0.010602,0.092401,0.126738,-0.235425,-0.169456,0.331606,-0.087189,-0.696151,1.785187,0.015773,-0.562498,-0.167009,-0.337863,-0.263303,-0.214606,0.190404,0.209753,0.225469,-0.128275
6,-0.347499,-0.219483,-0.416551,0.597725,-0.12043,0.111955,-0.54922,-0.149946,-0.898245,0.089779,-0.09862,0.056843,-0.25148,0.171738,-0.296984,-0.487177,-0.599305,-0.111118,-0.145315
7,0.225454,-0.085263,-0.022003,-0.233531,0.219223,-0.035144,-0.253262,0.150366,-0.100841,-0.088883,-0.366896,-0.420115,0.234006,-0.394242,0.030688,-0.341268,-0.018473,-0.012253,0.06664
8,0.547164,0.142805,0.536365,0.137862,0.035938,0.053906,0.51148,0.494983,-0.057981,-0.161256,0.715527,-0.391836,0.249539,0.61912,0.161172,0.541552,-0.169741,-0.009943,0.074941
9,-0.033915,0.03293,0.169417,0.019878,-0.139199,-0.200462,-0.210957,-0.072279,0.268005,-0.000149,-0.56863,0.512223,-0.44824,0.14466,-0.298853,0.303666,-0.090521,-0.029239,0.035719


In [213]:
from pypfopt.efficient_frontier import EfficientFrontier

# Assuming risk_free_rate is the appropriate value for your analysis
risk_free_rate = 0.02

ef = EfficientFrontier(expected_returns, cov_matrix)
weights = ef.max_sharpe(risk_free_rate=risk_free_rate)

ValueError: at least one of the assets must have an expected return exceeding the risk-free rate