<a href="https://colab.research.google.com/github/Psyclophe/datasets/blob/master/Semana3_1_Aps_Financieras5_Regresion_Logistica.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MaxMitre/Aplicaciones-Financieras/blob/main/Semana3/1_Regresion_Logistica.ipynb)

# Descripción del problema

Datos originales: https://challengedata.ens.fr/participants/challenges/31/

El problema trata de buscar un algoritmo de clasificación que ayude a crear estrategias de inversión en criptomonedas, basado en el "sentimiento" extraído de noticias y redes sociales.

Por cada hora de trading se contabilizó la ocurrencia de algunos terminos, tales como 'adoption' y 'hack', en un selecto numero de cuentas influyentes de twitter y en algunos foros como 'Bitcointalk'.

Se han creado 10 temas diferentes, algunos positivos y otros negativos y se han contabilizado las palabras antes mencionadas, antes de una normalización.

Dado un tema, hemos visto los conteos de las últimas 48 horas y se estandarizaron esos conteos. El resultado se multiplicó por el conteo promedio por hora y se dividió por el conteo promedio por hora de todo el entrenamiento

Para un tiempo T en el periodo de tiempo i, con lag k ($k\in[\![0;47]\!]$) el valor F ode la característica será:

$$
F_{i,k}=\frac{T_{i,k}-\overline{T_{i}}}{\sqrt{\frac{1}{47}\sum\limits_{j=0}^{47}{(T_{i,j}-\overline{T_{i}})^{2}}}}*\frac{\overline{T_i}}{\overline{T}} 
$$


Se agregaron 5 características correspondientes a los precios finales en periodos de 1 hr, 6 hrs, 12 hrs, 24 hrs y 48 hrs
El objetivo es predecir si el precio del Bitcoin tendrá un retorno (en la próxima hora) que sea de mas del 0.2%, entre -0.2% y 0.2% o menos al -0.2%.

La métrica utilizada para la perdida es la perdida logistica, definita como el negativo de la log-verosimilitud de las etiquetas verdaderas comparadas con las probabilidades predichas por el clasificador.

Las verdaderas etiquetas están codificadas como una matríz de 3 columnas, donde hay unos o ceros dependiendo si el elemento pertenece a la categoría de una columna u otra.
 
Dada una matriz P de probabilidades $p_{i,k}=Pr(t_{i,k}=1)'$ , la función de perdida se define como

$$
L_{log}(Y,P)=-log{Pr(Y|P)}=-\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^3{y_{i,k}log(p_{i,k})}
$$

Entre más bajo el score de ésta medida, mejor.



# Dependencias

In [2]:
# !pip install -U plotly

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from keras import layers

import plotly.graph_objects as go

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Funciones

In [6]:
def evaluate_model(estimator, train, val, test):
    print('train cross_entropy = ', estimator.evaluate(train[0], train[1], verbose = False))
    print('  val cross_entropy = ', estimator.evaluate(val[0], val[1], verbose = False))
    print(' test cross_entropy = ', estimator.evaluate(test[0], test[1], verbose = False))

In [7]:
# TODO: Modificar para seleccionar características cambiando n_features, no sólo las primeras n_features
# TODO: Revisar los resultados generados cuando se seleccionan distintos parámetros
# NOTE: Asuma que el dataframe tiene 5 X, y 48 columnas para cada una de las 10 I ordenadas de reciente a antigua
def transform_dataframe(df, len_prices = 5, n_features = 10, len_features = 48):
    if type(len_prices) != int or type(n_features) != int or type(len_features) != int:
        raise ValueError(f'Los parámetros len_prices, n_features y len_features deben ser de tipo int. Recibibo {type(len_prices)},{type(n_features)} y {type(len_features)}')

    assert 0 < len_prices <= 5, 'len_prices debe estar entre 1 y 5'
    assert 0 < n_features <= 10, 'n_features debe estar entre 1 y 10'
    assert 0 < len_features <= 48, 'len_features debe estar entre 1 y 48'

    df.reset_index(inplace = True, drop = True)
    
    # Los nombres de las columnas están al reves para tener primer la observación más antigua
    prices_cols = ['X5', 'X4', 'X3', 'X2', 'X1']

    prices = np.zeros((len(df), len_prices, 1))
    features = np.zeros((len(df), len_features, n_features))

    for i in range(len(df)):
        # Se transforman la forma de los precios
        prices[i] = df.loc[i, prices_cols[-len_prices:]].values.reshape((len_prices, 1))
        # Para cada característica
        for j in range(n_features):
            # Se obtiene los 48 rezagos y se voltea el arreglo para tener el más antiguo primero
            # Aquí se aplica el supuesto de que el dataframe tiene 5 columnas de 5
            features[i, :, j] = np.flip(df.iloc[i, 5+48*j:5+len_features + 48*j].values)
    return prices, features

> The Input data contains 10 time series of 48 trading hours representing complementary features based on sentiment analysis from news extracted from twitter or forums like Bitcointalk on Bitcoin, and 5 time series based on the variation of Bitcoin price during the past 1, 6, 12, 24 and 48 hours normalised by volatility during the period. Input data, for training and testing, will be given by a .csv file, whose first line contains the header. Then each line corresponds to a sample, each column to a feature. The features are the following:

>- ***ID***: Id of the sample which is linked to the ID of the output file;
- ***I_1_lag(k)*** to ***I_10_lag(k)***: Values of Indicators *I_1* to *I_10* for each k lag ($k\in[\![0;47]\!]$) representing the normalized value of Indicators *I_1* to *I_10* each hour of the past 48 trading hours;
- ***X_1*** to ***X_5***: Values of 5 normalised indicators representing price variation of Bitcoin on the last 1, 6, 12, 24 and 48 hours.

> There will be 14 000 samples for the train set and 5 000 for the test set. For a given sample, the time series (for the 10 sentiment indicators) are given over the same 48 trading hours.

>The training outputs are given in a .csv file. Each line corresponds to a sample:

>- ***ID***: Id of the sample;
- ***Target_-1***: classification of the return of Bitcoin in the next hour. -1 signifies a down move of less than -0.2%;
- ***Target_0***: classification of the return of Bitcoin in the next hour. 0 signifies a move between -0.2% and 0.2%;
- ***Target_1***: classification of the return of Bitcoin in the next hour. 1 signifies a up move of more than 0.2%.



In [8]:
X_raw = pd.read_csv('/content/drive/MyDrive/Cruso-ApsFinancieras/semana7/input_training_IrTAw7w.csv').set_index('ID')
X_raw

Unnamed: 0_level_0,X1,X2,X3,X4,X5,I1_lag0,I1_lag1,I1_lag2,I1_lag3,I1_lag4,...,I10_lag38,I10_lag39,I10_lag40,I10_lag41,I10_lag42,I10_lag43,I10_lag44,I10_lag45,I10_lag46,I10_lag47
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.460020,0.620360,-0.972192,2.745197,4.177783,2.325865,2.060138,0.071162,2.360597,-0.611526,...,-0.342912,-0.194165,0.122331,0.028682,-0.093626,-0.559840,0.562584,-0.557868,1.424906,-0.016294
1,-0.347872,-2.199925,-0.222026,3.741888,8.608291,-4.091293,-3.502499,-1.463631,0.383153,-3.669962,...,1.261341,-0.082428,-1.035813,-0.249607,-0.971215,-0.058408,1.460632,-0.653394,-1.743487,4.065305
2,-2.152963,-0.432461,1.619057,-0.003912,3.870262,-0.598858,-0.412391,-0.765354,-0.998152,-0.938755,...,2.245204,3.002347,2.674186,2.656251,1.062974,-0.484619,-0.044594,1.579731,0.962836,1.146983
3,-1.827669,-1.881770,-4.214322,0.178225,0.992362,0.383757,2.512478,-0.383434,-0.208506,-1.104289,...,1.383203,-1.338892,0.298076,1.808275,2.837975,2.054112,0.741138,1.701911,0.110082,0.114980
4,0.748761,1.799939,1.561006,5.204120,2.161637,-1.275226,-1.544131,-1.802590,-1.128526,-0.469835,...,-0.477313,0.742923,-0.273225,1.311015,0.744330,2.914322,1.030602,0.480722,-0.492838,1.377958
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,-0.074522,-0.472044,-2.860659,-1.266230,-10.229167,0.583145,-0.051301,-0.584659,1.458945,2.004759,...,-1.634290,-1.527111,-1.008016,-0.500519,1.277636,1.257714,0.502732,1.751844,0.150679,-0.533808
13996,1.730118,3.177408,0.816198,1.136877,-1.588960,1.011735,-0.185748,-0.522647,2.316802,1.219339,...,-0.409750,0.840944,-1.804313,0.357944,-1.058557,-0.196874,-2.507582,0.125756,1.532976,-1.087343
13997,2.093028,4.108092,1.056253,8.163642,8.916299,2.338713,2.554397,1.665492,3.719985,-0.278893,...,-0.531223,-1.249847,-1.288419,-0.897649,-0.199824,-0.033545,0.240647,2.188396,0.039340,0.756515
13998,1.483381,-1.602078,-2.851078,-2.639386,-4.805661,-2.252937,2.370613,4.450028,0.947600,2.364395,...,-1.451115,-4.188150,-2.397168,-1.126340,-0.841850,-4.231824,-2.640152,-4.048115,-4.629418,-3.566115


In [9]:
y_raw = pd.read_csv('/content/drive/MyDrive/Cruso-ApsFinancieras/semana7/output_training_F2dZW38.csv').set_index('ID')
y_raw

Unnamed: 0_level_0,Target -1,Target 0,Target 1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,1,0
...,...,...,...
13995,0,0,1
13996,0,1,0
13997,0,1,0
13998,0,0,1


## División: Entrenamiento, Validación y Prueba

In [10]:
# Division para entrenamiento de red LSTM
X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, train_size = .8, random_state = 10, shuffle = False)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size = .75, random_state = 10, shuffle = False)

print('     Train shape', X_train.shape)
print('Validation shape', X_val.shape)
print('      Test shape', X_test.shape)

     Train shape (8400, 485)
Validation shape (2800, 485)
      Test shape (2800, 485)


## Transformaciones

In [11]:
len_prices = 5
n_features = 10
len_features = 48

train_prices, train_features = transform_dataframe(X_train, len_prices, n_features, len_features)
val_prices, val_features     = transform_dataframe(X_val,   len_prices, n_features, len_features)
test_prices, test_features   = transform_dataframe(X_test,  len_prices, n_features, len_features)

In [12]:
print('        Labels: (samples, sequence length, features)')
print('  Train prices:', train_prices.shape)
print('Train features:', train_features.shape)

        Labels: (samples, sequence length, features)
  Train prices: (8400, 5, 1)
Train features: (8400, 48, 10)


In [13]:
train_prices[0].reshape(1,5,1)

array([[[ 4.17778324],
        [ 2.74519725],
        [-0.97219231],
        [ 0.62035968],
        [ 0.46002033]]])

In [14]:
train_prices[0]

array([[ 4.17778324],
       [ 2.74519725],
       [-0.97219231],
       [ 0.62035968],
       [ 0.46002033]])

# Algoritmos-Modelos

## Regresión logística

El punto de referencia de los propietarios de los datos es una regresión logística tomando como características X1, $\dots$, X5.

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
# Impresion bonita de los números, sin notación científica
np.set_printoptions(suppress=True)
train_prices

array([[[ 4.17778324],
        [ 2.74519725],
        [-0.97219231],
        [ 0.62035968],
        [ 0.46002033]],

       [[ 8.60829058],
        [ 3.74188805],
        [-0.22202592],
        [-2.19992478],
        [-0.34787206]],

       [[ 3.87026224],
        [-0.00391238],
        [ 1.61905739],
        [-0.43246086],
        [-2.15296262]],

       ...,

       [[-8.01161549],
        [-3.25267826],
        [-3.23511161],
        [-4.42009744],
        [-1.64528608]],

       [[-5.7431553 ],
        [-4.55756999],
        [-4.1314012 ],
        [-4.21510083],
        [ 0.26622901]],

       [[ 3.53645002],
        [ 6.40746335],
        [ 0.75612526],
        [ 1.8952313 ],
        [ 1.59236659]]])

In [17]:
X_train_0, X_test_0, y_train_0, y_test_0 = train_test_split(X_raw, y_raw, train_size = .8, random_state = 10, shuffle = False)

In [18]:
X_train_0 = X_train_0[['X1', 'X2', 'X3', 'X4', 'X5']]
X_train_0

Unnamed: 0_level_0,X1,X2,X3,X4,X5
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.460020,0.620360,-0.972192,2.745197,4.177783
1,-0.347872,-2.199925,-0.222026,3.741888,8.608291
2,-2.152963,-0.432461,1.619057,-0.003912,3.870262
3,-1.827669,-1.881770,-4.214322,0.178225,0.992362
4,0.748761,1.799939,1.561006,5.204120,2.161637
...,...,...,...,...,...
11195,-0.197817,-0.503618,-5.999348,-7.625040,-2.796258
11196,0.511663,1.874224,0.079970,2.570751,7.507616
11197,3.405716,4.082560,5.426551,4.861720,10.263339
11198,0.476242,2.696333,3.123024,5.921784,-2.697430


In [19]:
X_test_0 = X_test_0[['X1', 'X2', 'X3', 'X4', 'X5']]
X_test_0

Unnamed: 0_level_0,X1,X2,X3,X4,X5
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
11200,-0.214679,3.356129,4.749172,5.709250,1.351358
11201,-1.192837,0.957475,-0.430136,-2.751596,6.960146
11202,0.450306,-0.061307,0.284349,2.194250,-3.470744
11203,0.169981,0.180298,5.055600,3.872075,7.453231
11204,0.223090,-0.213792,4.367535,4.579829,6.908846
...,...,...,...,...,...
13995,-0.074522,-0.472044,-2.860659,-1.266230,-10.229167
13996,1.730118,3.177408,0.816198,1.136877,-1.588960
13997,2.093028,4.108092,1.056253,8.163642,8.916299
13998,1.483381,-1.602078,-2.851078,-2.639386,-4.805661


In [20]:
y_train_0

Unnamed: 0_level_0,Target -1,Target 0,Target 1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,1,0
...,...,...,...
11195,1,0,0
11196,1,0,0
11197,0,1,0
11198,1,0,0


In [21]:
y_train_0 = y_train_0.idxmax(axis=1)
y_train_0

ID
0         Target 1
1        Target -1
2         Target 1
3        Target -1
4         Target 0
           ...    
11195    Target -1
11196    Target -1
11197     Target 0
11198    Target -1
11199     Target 1
Length: 11200, dtype: object

In [22]:
y_test_0 = y_test_0.idxmax(axis=1)
y_test_0

ID
11200    Target 1
11201    Target 1
11202    Target 0
11203    Target 0
11204    Target 0
           ...   
13995    Target 1
13996    Target 0
13997    Target 0
13998    Target 1
13999    Target 0
Length: 2800, dtype: object

In [23]:
model_0 = LogisticRegression(multi_class='multinomial', random_state=1)

In [24]:
# ¿Que aparece al imprimir esta variable?
model_0

In [25]:
model_0.fit(X_train_0, y_train_0)

In [26]:
model_0.coef_

array([[ 0.05645319,  0.03479517,  0.00151598, -0.01326596, -0.00142158],
       [ 0.02201247,  0.00354643,  0.01033971,  0.00019174, -0.00648346],
       [-0.07846565, -0.0383416 , -0.01185569,  0.01307421,  0.00790504]])

In [27]:
model_0.intercept_ #bayes o sesgo

array([-0.02628687,  0.03538381, -0.00909694])

In [28]:
# ¿como ver que atributos tiene mi objeto?
dir(model_0)

['C',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_feature_names',
 '_check_n_features',
 '_estimator_type',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_predict_proba_lr',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_validate_data',
 '_validate_params',
 'class_weight',
 'classes_',
 'coef_',
 'decision_function',
 'densify',
 'dual',
 'feature_names_in_',
 'fit',
 'fit_intercept',
 'get_params',
 'intercept_',
 'intercept_scaling',
 'l1_ratio',
 'max_iter',
 'multi_class',
 'n_features_in_',
 'n_iter_',
 'n_jobs',
 'penalty',
 'predict',
 'predict_log_

In [29]:
model_0.classes_

array(['Target -1', 'Target 0', 'Target 1'], dtype=object)

In [30]:
y_pred_0 = model_0.predict(X_test_0)

In [31]:
y_test_0

ID
11200    Target 1
11201    Target 1
11202    Target 0
11203    Target 0
11204    Target 0
           ...   
13995    Target 1
13996    Target 0
13997    Target 0
13998    Target 1
13999    Target 0
Length: 2800, dtype: object

In [32]:
y_pred_0

array(['Target 0', 'Target 1', 'Target 0', ..., 'Target -1', 'Target 0',
       'Target 1'], dtype=object)

In [None]:
# Método para obtener las probabilidades de pertenencia a las clases
y_probas = model_0.predict_proba(X_test_0)
y_probas 

In [None]:
# Pueden utilizar el minimo para explorar
y_probas.max(axis=0)

In [None]:
(y_pred_0 == y_test_0)

In [None]:
(y_pred_0 == y_test_0).value_counts()

In [None]:
(y_pred_0 == y_test_0).sum() / len(y_pred_0)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test_0, y_pred_0))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test_0, y_pred_0)

In [None]:
from sklearn.metrics import log_loss

In [None]:
y_test_0

In [None]:
# Valor de la función de perdida
log_loss(y_test_0, model_0.predict_proba(X_test_0))


- ¿Que tan buenos son los resultados basados en la matriz de confusión?

- ¿Qué podríamos hacer para mejorar el modelo?



# Ejercicio:

Ejecutar el algoritmo de regresión logística para los datos de TRAIN que tenemos (se evaluó en datos de TEST pero no de TRAIN)

- Clasification report
- Matriz de confusión
- Valor de la función de perdida

In [None]:
# Espacio para ejercicio

Ejecutar un algoritmo de clasificación con solo 2 CLASES ("Target 1" y "Target -1")