## Probemos un poquito Learning to Rank con la librería LightGBM

Seguimos el ejemplo del código en https://mlexplained.com/2019/05/27/learning-to-rank-explained-with-code/

Para eso hay que descargar los datos con el archivo trans_data.py, ejecutando retrieve_30k.sh

#### Para Linux

Si el sistema que corren es Linux, se puede ejecutar la celda siguiente.

In [25]:
! bash retrieve_30k.sh

--2020-11-21 13:17:35--  https://s3-us-west-2.amazonaws.com/xgboost-examples/MQ2008.rar
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.234.112
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.234.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15448795 (15M) [application/x-rar-compressed]
Saving to: ‘MQ2008.rar.2’


2020-11-21 13:17:41 (3,13 MB/s) - ‘MQ2008.rar.2’ saved [15448795/15448795]


UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from MQ2008.rar


Would you like to replace the existing file MQ2008/Fold1/test.txt
1768645 bytes, modified on 2009-03-17 17:22
with a new one
1768645 bytes, modified on 2009-03-17 17:22

[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit ^C

User break

[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit 

#### Para Windows

En el caso de Windows, deben tener instalado [7zip](https://www.7-zip.org/) primero. Luego deberán ejecutar las siguientes celdas.

In [None]:
!pip install patool

In [None]:
import os
import patoolib
import requests

rarfile = requests.get("https://s3-us-west-2.amazonaws.com/xgboost-examples/MQ2008.rar")
with open("./MQ2008.rar", "wb") as fh:
    fh.write(rarfile.content)

patoolib.extract_archive("./MQ2008.rar", outdir="./")
os.system("move /-y MQ2008\Fold1\*.txt .")

In [None]:
!python trans_data.py train.txt mq2008.train mq2008.train.group

In [None]:
!python trans_data.py test.txt mq2008.test mq2008.test.group

In [None]:
!python trans_data.py vali.txt mq2008.vali mq2008.vali.group

## Learning to Rank

Introducción al Dataset: https://arxiv.org/pdf/1306.2597.pdf

In [27]:
# Importemos las librerías más importantes
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.datasets import load_svmlight_file
from scipy.stats import spearmanr

# Carguemos los archivos que pudimos bajar con el script trans_data.py
x_train, y_train = load_svmlight_file("mq2008.train")
x_valid, y_valid = load_svmlight_file("mq2008.vali")
x_test, y_test = load_svmlight_file("mq2008.test")

In [28]:
data = pd.read_csv("train.txt", header=None, sep=" ")
separate_colon = lambda x: x.split(":")[-1]
for column in data.columns[1:48]:
    data[column] = data[column].apply(separate_colon)

In [29]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0,10002,0.007477,0.0,1.0,0.0,0.00747,0.0,0.0,0.0,...,0.007042,#docid,=,GX008-86-4444840,inc,=,1.0,prob,=,0.086622
1,0,10002,0.603738,0.0,1.0,0.0,0.603175,0.0,0.0,0.0,...,1.0,#docid,=,GX037-06-11625428,inc,=,0.003159,prob,=,0.089745
2,0,10002,0.214953,0.0,0.0,0.0,0.213819,0.0,0.0,0.0,...,0.021127,#docid,=,GX044-30-4142998,inc,=,0.008419,prob,=,0.099974
3,0,10002,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,#docid,=,GX228-42-3888699,inc,=,0.008419,prob,=,0.044481
4,0,10002,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,#docid,=,GX229-14-12863205,inc,=,1.0,prob,=,0.041016


In [30]:
y_train

array([0., 0., 0., ..., 0., 0., 0.])

In [31]:
q_train = np.loadtxt('mq2008.train.group')
q_valid = np.loadtxt('mq2008.vali.group')
q_test = np.loadtxt('mq2008.test.group')

In [32]:
x_test

<2874x46 sparse matrix of type '<class 'numpy.float64'>'
	with 71241 stored elements in Compressed Sparse Row format>

In [33]:
q_test

array([  8.,  61.,   7.,   8.,   8.,   7.,   7.,  16.,  15.,   8.,  16.,
        16.,  32.,  16.,  16.,   7.,  16.,  15.,   8.,  14.,  16.,  15.,
         8.,   8.,  16.,   7.,  59.,  61.,  56.,  60.,   8.,   8.,  28.,
       117.,  16.,  15.,   8.,   8.,  16.,  15.,   8.,   8.,   8.,   8.,
         7.,  16.,   8.,   8.,  16.,   7.,   8.,  16.,   8.,   8.,  32.,
        16.,  31.,  15.,   6.,  31.,  15.,  16.,  16.,  31.,  16.,   8.,
        56.,  15.,   8.,  16.,   8.,  31.,  28.,  32.,   8.,   8.,   8.,
       115.,  57.,  12.,   8.,   8.,   8.,  15.,   7.,   8.,  15.,   8.,
         8.,   8.,  16.,   8.,  31., 119.,   8.,  15.,   7.,   8.,   8.,
        16.,   8.,  15.,   8.,  16.,   8.,   8.,  15.,   8.,  16.,   8.,
         8.,   8.,  16.,  16.,  15.,   7.,  15.,   8.,  15.,   8.,   8.,
         8.,   8.,  29.,   7.,   8.,   8.,   8.,   8.,  61.,   8.,   7.,
        32.,   8., 114.,  15.,  16.,   8.,  16.,  61.,   8.,  15.,  15.,
         8.,   8.,   8.,  16.,  31.,  16.,  32.,  1

In [35]:
y_test[8:69]

array([0., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 2., 1., 0., 0., 0.,
       0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 0.,
       1., 1., 1., 2., 1., 2., 1., 1., 2., 1., 1., 2., 1., 1., 2., 1., 1.,
       0., 2., 1., 0., 0., 0., 1., 0., 0., 2.])

In [36]:
# LGBMRanker doc: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html
gbm = lgb.LGBMRanker()

gbm.fit(
    x_train, y_train, group=q_train, eval_set=[(x_valid, y_valid)],
    eval_group=[q_valid], eval_at=[1, 3], early_stopping_rounds=20, verbose=True,
    callbacks=[lgb.reset_parameter(learning_rate=lambda x: 0.95 ** x * 0.1)]
)

[1]	valid_0's ndcg@1: 0.556263	valid_0's ndcg@3: 0.635254
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's ndcg@1: 0.571125	valid_0's ndcg@3: 0.66308
[3]	valid_0's ndcg@1: 0.617834	valid_0's ndcg@3: 0.675826
[4]	valid_0's ndcg@1: 0.622081	valid_0's ndcg@3: 0.682879
[5]	valid_0's ndcg@1: 0.632696	valid_0's ndcg@3: 0.6769
[6]	valid_0's ndcg@1: 0.624204	valid_0's ndcg@3: 0.677868
[7]	valid_0's ndcg@1: 0.613588	valid_0's ndcg@3: 0.671651
[8]	valid_0's ndcg@1: 0.613588	valid_0's ndcg@3: 0.670299
[9]	valid_0's ndcg@1: 0.626327	valid_0's ndcg@3: 0.675637
[10]	valid_0's ndcg@1: 0.62845	valid_0's ndcg@3: 0.680989
[11]	valid_0's ndcg@1: 0.62845	valid_0's ndcg@3: 0.679431
[12]	valid_0's ndcg@1: 0.626327	valid_0's ndcg@3: 0.684499
[13]	valid_0's ndcg@1: 0.641189	valid_0's ndcg@3: 0.688833
[14]	valid_0's ndcg@1: 0.619958	valid_0's ndcg@3: 0.688054
[15]	valid_0's ndcg@1: 0.619958	valid_0's ndcg@3: 0.686626
[16]	valid_0's ndcg@1: 0.626327	valid_0's ndcg@3: 0.691959
[17]	val

LGBMRanker(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
      importance_type='split', learning_rate=0.1, max_depth=-1,
      min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
      n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
      random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
      subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [37]:
# Tiremos el predictor sobre los datos de test
preds_test = gbm.predict(x_test)
preds_test

array([ 0.21021768, -0.32371846,  0.18258013, ..., -0.19822539,
        0.02179842, -0.26031275])

In [38]:
# Usemos la métrica de Spearman para correlación de Rankings
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html
spearmanr(y_test, preds_test)

SpearmanrResult(correlation=0.3888310513462384, pvalue=2.2613542430263908e-104)

## Agrupemos todo el dataset y reentrenemos!

In [46]:
q_train = [x_train.shape[0]]
q_valid = [x_valid.shape[0]]
q_test = [x_test.shape[0]]

gbm = lgb.LGBMRanker()
# Posible bug en la librería? 
gbm.fit(
    x_train, y_train, group=q_train, eval_set=[(x_valid, y_valid)],
    eval_group=[q_valid], eval_at=(3, 5), early_stopping_rounds=20, verbose=True,
    callbacks=[lgb.reset_parameter(learning_rate=lambda x: 0.95 ** x * 0.1)]
)

[1]	valid_0's ndcg@3: 0	valid_0's ndcg@5: 0
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's ndcg@3: 0	valid_0's ndcg@5: 0
[3]	valid_0's ndcg@3: 0	valid_0's ndcg@5: 0.0924245
[4]	valid_0's ndcg@3: 0.0782131	valid_0's ndcg@5: 0.236421
[5]	valid_0's ndcg@3: 0.176907	valid_0's ndcg@5: 0.317659
[6]	valid_0's ndcg@3: 0.176907	valid_0's ndcg@5: 0.317659
[7]	valid_0's ndcg@3: 0.176907	valid_0's ndcg@5: 0.405129
[8]	valid_0's ndcg@3: 0.176907	valid_0's ndcg@5: 0.405129
[9]	valid_0's ndcg@3: 0.176907	valid_0's ndcg@5: 0.405129
[10]	valid_0's ndcg@3: 0.48976	valid_0's ndcg@5: 0.543766
[11]	valid_0's ndcg@3: 0.687148	valid_0's ndcg@5: 0.676514
[12]	valid_0's ndcg@3: 0.530721	valid_0's ndcg@5: 0.383566
[13]	valid_0's ndcg@3: 0.530721	valid_0's ndcg@5: 0.514771
[14]	valid_0's ndcg@3: 0.530721	valid_0's ndcg@5: 0.514771
[15]	valid_0's ndcg@3: 0.452508	valid_0's ndcg@5: 0.604313
[16]	valid_0's ndcg@3: 0.391066	valid_0's ndcg@5: 0.559907
[17]	valid_0's ndcg@3: 0.391066	valid

LGBMRanker(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
      importance_type='split', learning_rate=0.1, max_depth=-1,
      min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
      n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
      random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
      subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [44]:
preds_test = gbm.predict(x_test)
preds_test
spearmanr(y_test, preds_test)

SpearmanrResult(correlation=0.38635242974671874, pvalue=5.851119987299715e-103)

In [None]:
gbm.be