## DATA PREPARATION
In this notebook we will prepare the data in order to compare the clustering techniques. We decided to pick at random n different stocks from k. The resulting data is clustered in different fashions, and different portfolios are therefore created. Once the portfolio are created we compute the return and the risk of each portfolio.

In [6]:
#imports
import random
import numpy as np
import pandas as pd

In [79]:
def drop_nans(df,limit = 0.5):
    """given a dataframe and a limit value, first drops all the columns that have
    more than limit % nans than drops all raws containing at least one nan value"""
    raw_num = df.shape[0]
    nan_col = df.isnull().sum(axis=0)>(raw_num*limit)
    df=df.drop(columns=df.columns[nan_col])
    return df.dropna()

In [90]:
def pick_n_from_k(df,n,seed = 0,onlynames = False):
    """given a dataframe and a number N, returns a dataframe that 
    contains n randomly selected columns of the input dataframe"""
    k = df.shape[0]
    
    #safety check
    assert k >= n, 'K should be >= N'
    
    #if onlynames is active return only the name of the columns
    if onlynames:
        random.seed(seed)
        return random.sample(list(df.columns), n)
    
    return df.sample(n=n, random_state=seed, axis='columns')

In [87]:
DF=pd.read_parquet("us_equities_logreturns.parquet")
display(DF.head(2))
DF = drop_nans(DF)
display(DF.head(2))
print(list(DF.columns))

Unnamed: 0,^GSPC,GE,IBM,DIS,BA,CAT,AA,HPQ,DD,KO,...,NSM,CLP,CTX,CTR,DYN,AIB,KIM,SFN,TCO,S
0,,,,,,,,,,,...,,,,,,,,,,
1,0.01134,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0,^GSPC,GE,IBM,DIS,BA,CAT,AA,HPQ,DD,KO,...,NBL,MAT,JCP,AVT,THC,GRA,LPX,VLO,WMB,TXI
8033,-0.02216,0.0,-0.022696,0.0,-0.030214,-0.006135,-0.013606,-0.029414,-0.026387,0.0,...,-0.024693,-0.047791,0.0,-0.010292,0.00681,0.0,0.0,-0.01227,-0.04879,-0.003431
8034,-0.007273,-0.021506,-0.004313,-0.012739,0.0,-0.003082,-0.006873,-0.015038,-0.005362,0.0,...,-0.051293,0.0,0.0,-0.003454,-0.034526,0.0,-0.00597,-0.063716,-0.025318,-0.006897


['^GSPC', 'GE', 'IBM', 'DIS', 'BA', 'CAT', 'AA', 'HPQ', 'DD', 'KO', 'XOM', 'PG', 'JNJ', 'CVX', 'MCD', 'MRK', 'UTX', 'MMM', 'MO', 'HON', 'ED', 'GT', 'AEP', 'FL', 'MRO', 'DTE', 'IP', 'CNP', 'NAV', 'WMT', 'BMY', 'BP', 'LMT', 'C', 'KR', 'AET', 'XRX', 'F', 'DOW', 'PEP', 'CL', 'UIS', 'GD', 'WY', 'AXP', 'FRM', 'ASA', 'EXC', 'UNP', 'EIX', 'LUV', 'FDX', 'PCG', 'R', 'CNW', 'MOT', 'CSX', 'SLB', 'APA', 'HUM', 'DBD', 'BC', 'OXY', 'GLW', 'PFE', 'COP', 'LLY', 'PBI', 'RSH', 'RTN', 'TXN', 'SO', 'ETR', 'HAL', 'NOC', 'AVP', 'HRS', 'CSC', 'IFF', 'SKY', 'MDT', 'ROK', 'EMR', 'DE', 'NBL', 'MAT', 'JCP', 'AVT', 'THC', 'GRA', 'LPX', 'VLO', 'WMB', 'TXI']


In [95]:
#The advantage of fixing the seed: always give the same result
for i in range(2):
    tst = pick_n_from_k(DF,5, i,onlynames = True)
    display(tst)

for i in range(2):
    tst = pick_n_from_k(DF,5, i)
    display(tst)
    print('Here is the risK:')
    display(tst.std(axis=0))
    
    display(tst)

['EIX', 'R', 'CAT', 'C', 'COP']

['MMM', 'ETR', 'DD', 'LMT', 'MRK']

Here is the risK:


IBM    0.017847
BMY    0.017522
CSX    0.020357
UTX    0.017615
CVX    0.016797
dtype: float64

Unnamed: 0,IBM,BMY,CSX,UTX,CVX
8033,-0.022696,-0.010363,-0.030305,0.000000,-0.038343
8034,-0.004313,0.000000,0.015267,0.000000,-0.023065
8035,0.000000,0.000000,0.000000,0.000000,-0.010050
8036,0.000000,0.010363,-0.015267,0.008511,-0.006757
8037,0.000000,-0.020834,-0.047253,-0.034486,-0.024015
...,...,...,...,...,...
15345,0.002549,0.003423,0.002012,0.000272,0.011890
15346,0.005513,-0.001282,0.000502,-0.002453,0.001948
15347,0.001012,-0.003857,0.001506,-0.003144,0.002429
15348,0.000650,0.000858,0.002504,-0.001644,-0.003768


Here is the risK:


CL     0.017127
BP     0.017509
FRM    0.036611
JCP    0.022924
DE     0.021591
dtype: float64

Unnamed: 0,CL,BP,FRM,JCP,DE
8033,0.000000,-0.025001,-0.065018,0.000000,-0.022141
8034,-0.051293,0.000000,0.007648,0.000000,-0.015038
8035,0.000000,-0.006349,0.000000,0.023530,0.000000
8036,0.051293,-0.012821,0.000000,0.000000,0.015038
8037,-0.051293,-0.042840,-0.042808,-0.023530,-0.022642
...,...,...,...,...,...
15345,0.004566,0.003154,-0.014620,0.003190,-0.000385
15346,-0.004027,-0.003417,0.020409,0.003180,0.001540
15347,-0.001885,-0.001581,-0.013072,-0.007968,-0.000128
15348,0.000808,0.006571,0.010182,-0.002884,-0.003084
