## Machine Learning
### Clustering

Este cuaderno muestra un ejemplo de como hacer clustering de las empresas del IBEX35 en función
de sus rendimientos mensuales, utilizando el algoritmo K-Means

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
import pickle

Pickle con las series históricas de los componentes del IBEX35

In [3]:
with open('../data/stock_data.pkl', 'rb') as handle:
    stock_data = pickle.load(handle)

In [4]:
close_dict = {ticker: df.close for ticker, df in stock_data.items()} 
close_data = pd.DataFrame(close_dict) 

Haremos clustering de las empresas del IBEX tomando 
como características las rentabilidades mensuales del último
año

In [5]:
close_df = close_data['2018-01-02':]

In [6]:
month_close = close_df.resample('M').last().dropna(axis=1)
month_ret = month_close.pct_change()[1:]

In [7]:
month_ret.head()

Unnamed: 0,ACS,ACX,AENA,AMS,ANA,BBVA,BKIA,BKT,CABK,CLNX,...,NTGY,REE,REP,SAB,SAN,SGRE,TEF,TL5,TRE,VIS
2018-02-28,-0.121823,0.02786,-0.044457,-0.03073,-0.043872,-0.089588,-0.037255,-0.020946,-0.077065,-0.027126,...,0.011287,-0.062683,-0.029693,-0.098485,-0.052816,0.040971,-0.031504,0.045455,-0.024781,-0.071048
2018-03-31,0.11719,-0.068172,-0.023859,-0.009577,-0.114999,-0.066909,-0.073574,-0.067953,-0.034895,0.025047,...,0.030561,0.045625,-0.019721,-0.03738,-0.065643,-0.004585,0.004379,-0.131905,-0.103513,0.075192
2018-04-30,0.106793,0.026443,0.086133,0.012004,0.124757,0.071509,0.032939,0.038775,0.064523,0.025357,...,0.078391,0.032875,0.099549,0.008523,0.026227,0.095202,0.050448,0.036465,0.115048,-0.019608
2018-05-31,0.013703,-0.010734,-0.039743,0.118616,-0.068856,-0.132392,-0.109953,-0.05023,-0.099777,-0.023831,...,0.003826,-0.036458,0.028707,-0.115893,-0.143867,-0.077813,-0.105063,-0.030928,-0.054206,0.035455
2018-06-30,0.002244,-0.015625,-0.053561,0.005279,0.143356,0.040247,-0.012015,0.019619,0.019252,-0.005067,...,0.080515,0.086968,0.058451,-0.001739,-0.001739,-0.125428,-0.009725,-0.063051,0.090514,0.024583


In [8]:
features = month_ret.T
features.head()

Unnamed: 0,2018-02-28,2018-03-31,2018-04-30,2018-05-31,2018-06-30,2018-07-31,2018-08-31,2018-09-30,2018-10-31,2018-11-30,2018-12-31,2019-01-31,2019-02-28,2019-03-31,2019-04-30,2019-05-31,2019-06-30
ACS,-0.121823,0.11719,0.106793,0.013703,0.002244,0.081268,-0.04371,0.022297,-0.097328,0.023256,-0.001476,0.08114,0.079778,0.004618,0.045199,-0.098461,0.033604
ACX,0.02786,-0.068172,0.026443,-0.010734,-0.015625,0.137037,-0.064242,0.063903,-0.199026,-0.108634,-0.015234,0.097206,-0.020623,-0.050924,0.049808,-0.087341,0.049486
AENA,-0.044457,-0.023859,0.086133,-0.039743,-0.053561,-0.000643,-0.018662,-0.019672,-0.055853,-0.00673,-0.03174,0.110866,0.040451,0.022945,0.075672,-0.002721,0.047301
AMS,-0.03073,-0.009577,0.012004,0.118616,0.005279,0.079882,0.094795,0.001251,-0.110472,-0.111267,-0.038255,0.052301,0.042848,0.07855,-0.006723,-0.036097,0.043885
ANA,-0.043872,-0.114999,0.124757,-0.068856,0.143356,0.036661,0.031556,0.029272,-0.044581,0.087423,-0.088779,0.124493,0.028159,0.162219,0.040282,-0.07212,0.048513


In [9]:
month_ret.ACS

2018-02-28   -0.121823
2018-03-31    0.117190
2018-04-30    0.106793
2018-05-31    0.013703
2018-06-30    0.002244
2018-07-31    0.081268
2018-08-31   -0.043710
2018-09-30    0.022297
2018-10-31   -0.097328
2018-11-30    0.023256
2018-12-31   -0.001476
2019-01-31    0.081140
2019-02-28    0.079778
2019-03-31    0.004618
2019-04-30    0.045199
2019-05-31   -0.098461
2019-06-30    0.033604
Freq: M, Name: ACS, dtype: float64

In [10]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

In [11]:
kmeans = KMeans(n_clusters=5)

Las instancias son las empresas y las características los retornos mensuales

In [12]:
kmeans.fit(features)

KMeans(n_clusters=5)

In [13]:
kmeans.labels_

array([3, 2, 0, 3, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 3, 0, 1, 0, 0, 2, 0, 1,
       2, 1, 1, 1, 2, 2, 4, 1, 0, 0, 3], dtype=int32)

In [14]:
grp_ibex = pd.Series(kmeans.labels_, index=features.index)

In [15]:
grp_ibex.sort_values()

AENA    0
TL5     0
MEL     0
ITX     0
IDR     0
TRE     0
IAG     0
IBE     1
REP     1
REE     1
NTGY    1
MRL     1
TEF     1
FER     1
ENG     1
ELE     1
COL     1
CLNX    1
ANA     1
CABK    2
MAP     2
BKT     2
BKIA    2
MTS     2
BBVA    2
ACX     2
SAB     2
SAN     2
ACS     3
GRF     3
AMS     3
VIS     3
SGRE    4
dtype: int32

In [16]:
grupos_dict = dict()
for icluster in grp_ibex.unique():
    grupos_dict[icluster] = list(grp_ibex[grp_ibex == icluster].index)


In [17]:
grupos_dict

{3: ['ACS', 'AMS', 'GRF', 'VIS'],
 2: ['ACX', 'BBVA', 'BKIA', 'BKT', 'CABK', 'MAP', 'MTS', 'SAB', 'SAN'],
 0: ['AENA', 'IAG', 'IDR', 'ITX', 'MEL', 'TL5', 'TRE'],
 1: ['ANA',
  'CLNX',
  'COL',
  'ELE',
  'ENG',
  'FER',
  'IBE',
  'MRL',
  'NTGY',
  'REE',
  'REP',
  'TEF'],
 4: ['SGRE']}

___

#### Etiquetar a partir de un modelo de clustering

Vamos a mostrar a que cluster asignamos a la empresa DAIMLER siguiendo el clustering que hemos creado para las empresas del IBEX

In [18]:
daimler_df = pd.read_csv('../data/daimler.csv', index_col=0)

In [19]:
daimler_df.set_index(pd.DatetimeIndex(daimler_df.index), inplace=True)
daimler_df

Unnamed: 0,open,high,low,close,vol
2010-01-04,37.220,37.610,37.040,37.505,38864
2010-01-05,37.400,37.520,36.895,37.295,21535
2010-01-06,37.110,37.280,36.650,37.170,17436
2010-01-07,36.850,36.850,36.355,36.830,36455
2010-01-08,36.950,37.115,36.295,36.840,32034
...,...,...,...,...,...
2019-06-24,46.810,48.200,46.720,47.815,20471
2019-06-25,47.550,47.635,47.250,47.525,11895
2019-06-26,47.495,48.350,47.340,48.000,7370
2019-06-27,48.345,49.000,48.155,48.305,6848


Tenemos que generar las mismas características para el ejemplo a etiquetar

In [20]:
dai_close = daimler_df.close['2018-01-02':]

In [21]:
month_dai = dai_close.resample('M').last()
month_dai_ret = month_dai.pct_change()[1:]

In [22]:
month_dai_ret.to_frame().T

Unnamed: 0,2018-02-28,2018-03-31,2018-04-30,2018-05-31,2018-06-30,2018-07-31,2018-08-31,2018-09-30,2018-10-31,2018-11-30,2018-12-31,2019-01-31,2019-02-28,2019-03-31,2019-04-30,2019-05-31,2019-06-30
close,-0.044913,-0.021671,-0.055451,-0.055641,-0.101932,0.073378,-0.061963,-0.023335,-0.030142,-0.053724,-0.083108,0.133996,0.01926,-0.013605,0.11954,-0.206793,0.057815


In [23]:
x_test = month_dai_ret.values.reshape(1,-1)
x_test

array([[-0.04491342, -0.02167139, -0.05545099, -0.05564071, -0.1019315 ,
         0.07337791, -0.06196329, -0.02333513, -0.03014152, -0.05372371,
        -0.08310804,  0.13399585,  0.0192604 , -0.01360544,  0.11954023,
        -0.20679329,  0.05781469]])

Calculamos el cluster asignado con la función *predict*

In [24]:
kmeans.predict(x_test)

array([0], dtype=int32)