In [215]:
import lightgbm

import numpy as np
import pandas as pd

from sklearn.metrics import ndcg_score
from sklearn.preprocessing import MinMaxScaler

## Pre Processamento de Dados

As features já estão salvas em arquivos dentro da pasta ```results```. A gente precisa carregar e formar as entradas do Learning To Rank e então dividir em conjunto de treino, validação e teste...

### Qrels

Qrels é o arquivo que carrega a relevância que uma certa entidade, ou documento, tem para uma consulta. No nosso caso 0: nada relevante, 1: pouco relevante e 2: bastante relevante.

In [216]:
qrels = pd.read_csv('../data/train_qrels.csv')
qrels.head()

Unnamed: 0,QueryId,EntityId,Relevance
0,9,367937,1
1,9,429992,2
2,9,513435,1
3,9,571751,2
4,9,582040,2


### Arquivos de Features

Features seria o score que cada um dos modelos que a gente implementar da pra um par (Consulta, Documento). Então por exemplo, uma feature pode ser BM25, outra TFIDF, outra Dense Retrieval e por aí vai. O Learning to Rank (se tudo der certo) vai aprender a dar relevancia para cada uma dessas features extraindo um pouquinho (e talvez o melhor) de cada uma.

Agora parando pra pensar esse merge vai ser bem mais complicado do q eu tinha imaginado de primeira, pq tem elements q não vão estar presentes em alguns casos, nesses casos acho que eu vou só por 0 nessa célula. Se eu for retornar os 100 melhores de cada feature e eu usar 5 features, pode acontecer de ter 500 amostras por consulta (exagero da minha parte), vou ter que analisar isso com cuidado.

A gente vai juntar nossas features com a relevancia do gabarito (na pratica só pra ter um negócio bonito pra ver...)

In [217]:
features_paths = [
    "../results/train_conjuntive_daat_BM25_scores.csv", 
    "../results/train_disjunctive_daat_BM25_scores.csv",
    "../results/train_conjuntive_daat_TFIDF_scores.csv", 
    "../results/train_disjunctive_daat_TFIDF_scores.csv"
]

ltr_df = qrels.copy()
for feature in features_paths:
    df = pd.read_csv(feature)
    df[df.columns[2]] = MinMaxScaler().fit_transform(df[df.columns[2]].values.reshape(-1, 1)) + 1
    ltr_df = pd.merge(ltr_df, df, on=['QueryId', 'EntityId'], how="outer").fillna(0)


ltr_df.head(8)

Unnamed: 0,QueryId,EntityId,Relevance,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF
0,9,367937,1.0,0.0,1.270463,0.0,0.0
1,9,429992,2.0,0.0,1.331486,0.0,1.035801
2,9,513435,1.0,0.0,0.0,0.0,0.0
3,9,571751,2.0,0.0,0.0,0.0,0.0
4,9,582040,2.0,0.0,0.0,0.0,0.0
5,9,595484,2.0,0.0,0.0,0.0,0.0
6,9,853906,1.0,1.296792,1.287392,1.058033,0.0
7,9,1113171,1.0,0.0,0.0,0.0,0.0


### Divisão Treino, Validação e Teste

É melhor eu escrever isso se não vou esquecer...
Primeiro eu obtenho os ids de todas as consultas de teste dando um ```unique()``` na coluna de ids e armazendo em ```querie_ids```. Em seguida acho a quantidade de ids únicos e armazeno em ```amount_of_queries```. <br>

Minha proporção treino validação vai ser 80/20, então o tanto de consultas no treino vai ser ```train_size = 0.8 * amount_of_queries``` e o tanto de consultas no val  ```0.2 * amount_of_queries``` (como tenho quantidades de sample para consulta variaveis o buraco é mais embaixo). 

Agora vou achar a consulta que vai ser o ponto de "corte" fazendo ```cut = queries_ids[train_size]```. Assim para achar o df de treino basta selecionar todas as rows cujo ```QueryId > cut```, o restante será o teste.


Obs: Não existe amostra de uma mesma consulta em treino e teste. Fiz isso propositalmente pois não tinha certeza se isso se enquadrava em data leakage.

In [218]:

queries_ids = qrels['QueryId'].unique()
amount_of_queries = len(queries_ids)

train_size = int(0.8 * amount_of_queries)
val_size = int(0.1 * amount_of_queries)
test_size = int(0.1 * amount_of_queries)

cut1 = queries_ids[train_size]
cut2 = queries_ids[train_size + val_size]

train_df = ltr_df[(ltr_df['QueryId'] <= cut1)].copy()
val_df = ltr_df[(ltr_df['QueryId'] > cut1) & (ltr_df['QueryId'] <= cut2)].copy()
test_df = ltr_df[(ltr_df['QueryId'] > cut2)].copy()

print(f"Train size: {len(train_df)}, Val size: {len(val_df)}, Test size: {len(test_df)}")

train_df.tail()

Train size: 40053, Val size: 4466, Test size: 4418


Unnamed: 0,QueryId,EntityId,Relevance,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF
45867,371,2763787,0.0,0.0,0.0,0.0,1.01512
45868,371,2763776,0.0,0.0,0.0,0.0,1.01512
45869,371,2763772,0.0,0.0,0.0,0.0,1.01512
45870,371,2763769,0.0,0.0,0.0,0.0,1.01512
45871,371,2763766,0.0,0.0,0.0,0.0,1.01512


In [219]:
val_df.tail()

Unnamed: 0,QueryId,EntityId,Relevance,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF
47276,421,4512154,0.0,0.0,0.0,0.0,1.038759
47277,421,4300687,0.0,0.0,0.0,0.0,1.038759
47278,421,2138070,0.0,0.0,0.0,0.0,1.038759
47279,421,1779588,0.0,0.0,0.0,0.0,1.038759
47280,421,4513922,0.0,0.0,0.0,0.0,1.038655


### Divisão em X, y e Group

Seguinte, o formato para o lgbm é confuso, mas na pratica x_train é as features, y_train vai ser a relevancia e group vai dividir todas as amostras pela query a qual essa se refere. Então se a primeira query tem 10 samples, a segunda 12, a terceira 14, e a quinta 16, ```group = [10, 12, 14, 16]```. Veja bem, nesse exemplo hipotetico ```x_train``` possui 10 linhas referentes a primeira query, 12 referente a segunda... e a mesma coisa para ```y_val```!

In [220]:
qids_train = train_df.groupby("QueryId")["QueryId"].count().to_numpy()
X_train = train_df.drop(["QueryId", "EntityId", "Relevance"], axis=1)
y_train = train_df["Relevance"]

X_train.head()

Unnamed: 0,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF
0,0.0,1.270463,0.0,0.0
1,0.0,1.331486,0.0,1.035801
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0


In [221]:
qids_validation = val_df.groupby("QueryId")["QueryId"].count().to_numpy()
X_validation = val_df.drop(["QueryId", "EntityId", "Relevance"], axis=1)
y_validation = val_df["Relevance"]

y_validation.head()

7206    1.0
7207    2.0
7208    1.0
7209    1.0
7210    1.0
Name: Relevance, dtype: float64

In [222]:
qids_test = test_df.groupby("QueryId")["QueryId"].count().to_numpy()
X_test = test_df.drop(["QueryId", "EntityId", "Relevance"], axis=1)
y_test = test_df["Relevance"]

qids_test

array([163, 203, 186, 185, 246, 286, 197, 247, 194, 182, 194, 189, 275,
       193, 212, 193, 166, 196, 170, 150, 206, 185], dtype=int64)

## Treino do Modelo

Agora é instanciar o LightGMB, tem um monte de hiperparametro e _acho_ que esse [link](https://neptune.ai/blog/lightgbm-parameters-guide) aqui tem uma explicação que vai ser ótima pro futuro...

In [223]:
model = lightgbm.LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    num_leaves=32,
    
)

In [224]:
model.fit(
    X=X_train,
    y=y_train,
    group=qids_train,
    eval_set=[(X_validation, y_validation)],
    eval_group=[qids_validation],
    eval_at=5,
    early_stopping_rounds=30,
    verbose=True,
)



[1]	valid_0's ndcg@5: 0.941598
[2]	valid_0's ndcg@5: 0.950747
[3]	valid_0's ndcg@5: 0.958668
[4]	valid_0's ndcg@5: 0.963152
[5]	valid_0's ndcg@5: 0.955546
[6]	valid_0's ndcg@5: 0.950631
[7]	valid_0's ndcg@5: 0.950631
[8]	valid_0's ndcg@5: 0.948057
[9]	valid_0's ndcg@5: 0.948057
[10]	valid_0's ndcg@5: 0.948057
[11]	valid_0's ndcg@5: 0.948231
[12]	valid_0's ndcg@5: 0.948231
[13]	valid_0's ndcg@5: 0.948231
[14]	valid_0's ndcg@5: 0.955663
[15]	valid_0's ndcg@5: 0.95695
[16]	valid_0's ndcg@5: 0.953147
[17]	valid_0's ndcg@5: 0.953147
[18]	valid_0's ndcg@5: 0.95186
[19]	valid_0's ndcg@5: 0.956094
[20]	valid_0's ndcg@5: 0.956094
[21]	valid_0's ndcg@5: 0.960327
[22]	valid_0's ndcg@5: 0.961009
[23]	valid_0's ndcg@5: 0.961009
[24]	valid_0's ndcg@5: 0.961009
[25]	valid_0's ndcg@5: 0.961615
[26]	valid_0's ndcg@5: 0.961865
[27]	valid_0's ndcg@5: 0.967893
[28]	valid_0's ndcg@5: 0.96918
[29]	valid_0's ndcg@5: 0.965377
[30]	valid_0's ndcg@5: 0.965377
[31]	valid_0's ndcg@5: 0.959856
[32]	valid_0's ndcg@

## Teste do Modelo

Agora é verificar a qualidade com o conjunto de teste.

In [225]:
test_df["predicted_ranking"] = model.predict(X_test) + 1
test_df.sort_values("predicted_ranking", ascending=False)

Unnamed: 0,QueryId,EntityId,Relevance,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF,predicted_ranking
8125,462,1993002,1.0,2.000000,2.000000,1.284927,1.090462,3.379136
31615,449,2548623,0.0,0.000000,0.000000,1.056004,1.017525,3.320888
13461,437,2471827,0.0,1.454007,1.446709,1.077836,0.000000,3.302030
13756,458,3789297,0.0,1.459713,1.452491,1.083946,0.000000,3.278120
7816,433,1894869,2.0,1.360629,1.352082,1.089233,1.028112,3.265408
...,...,...,...,...,...,...,...,...
13314,430,1157524,0.0,1.206359,1.195750,0.000000,0.000000,-0.844961
13312,430,2523556,0.0,1.206841,1.196239,0.000000,0.000000,-0.844961
13311,430,1157565,0.0,1.207215,1.196618,0.000000,0.000000,-0.844961
7740,430,2555650,1.0,1.209598,1.199033,0.000000,0.000000,-0.844961


In [226]:
ndcg_sum_ltr = 0
ndcg_sum_best_model = 0
for query in test_df["QueryId"].unique():
    query_df = test_df[test_df["QueryId"] == query]
    
    current_ndcg_ltr = ndcg_score(query_df['Relevance'].values.reshape(1, -1), query_df['predicted_ranking'].values.reshape(1, -1), k = 100)
    current_ndcg_bm = ndcg_score(query_df['Relevance'].values.reshape(1, -1), query_df['Relevance D_BM25'].values.reshape(1, -1), k = 100)
    
    # print(f"Query {query_df['QueryId'].iloc[0]} NDCG: {current_ndcg}")
    ndcg_sum_ltr += current_ndcg_ltr
    ndcg_sum_best_model += current_ndcg_bm

print(f"Average NDCG for LTR: {ndcg_sum_ltr/len(test_df['QueryId'].unique())}")
print(f"Average NDCG for Best Model: {ndcg_sum_best_model/len(test_df['QueryId'].unique())}")

Average NDCG for LTR: 0.4354103233254202
Average NDCG for Best Model: 0.41193606301556207


## Construção da Submissão pro Kaggle

Agora é que são elas, até onde eu entendo eu vou ter q gerar as features todas para a submissão do kaggle (arquivo separado) e mergear ela (não parece trivial), e aí passar pra função abaixo (para cada consulta (???)). Cada row vai ser uma entidade né, então vou ter um monte de score para cada entidade, aí é ordenar pivotando para ter os ids ordenados tb e retornar esses ids.

In [227]:
features_paths = [
    "../results/test_conjuntive_daat_BM25_scores.csv", 
    "../results/test_disjunctive_daat_BM25_scores.csv",
    "../results/test_conjuntive_daat_TFIDF_scores.csv", 
    "../results/test_disjunctive_daat_TFIDF_scores.csv"
]

submission_df = pd.DataFrame({"QueryId": [], "EntityId": []})
for feature in features_paths:
    df = pd.read_csv(feature)
    df[df.columns[2]] = MinMaxScaler().fit_transform(df[df.columns[2]].values.reshape(-1, 1)) + 1
    submission_df = pd.merge(submission_df, df, on=['QueryId', 'EntityId'], how="outer").fillna(0)


submission_df.head(8)

Unnamed: 0,QueryId,EntityId,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF
0,2,366601,1.23305,1.189526,1.120158,1.05531
1,2,2508826,1.219499,1.176071,1.08473,1.039002
2,2,3572540,1.216312,1.172907,1.095603,1.044007
3,2,1850915,1.206548,1.163211,1.08192,1.037708
4,2,3957054,1.206173,1.162839,0.0,0.0
5,2,2367833,1.204583,1.16126,0.0,0.0
6,2,2508842,1.202855,1.159544,1.08192,1.037708
7,2,758546,1.199941,1.156651,1.235237,1.108281


In [228]:
features = submission_df.drop(["QueryId", "EntityId"], axis=1).copy()
test_pred = model.predict(features)
submission_df["predicted_ranking"] = test_pred
submission_df.head()

Unnamed: 0,QueryId,EntityId,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF,predicted_ranking
0,2,366601,1.23305,1.189526,1.120158,1.05531,0.769664
1,2,2508826,1.219499,1.176071,1.08473,1.039002,-0.047755
2,2,3572540,1.216312,1.172907,1.095603,1.044007,0.029892
3,2,1850915,1.206548,1.163211,1.08192,1.037708,-0.425381
4,2,3957054,1.206173,1.162839,0.0,0.0,-1.977611


In [229]:
query_rankings_sorted_first_hundred = submission_df.sort_values(["QueryId", "predicted_ranking"], ascending=[True, False]).groupby("QueryId").head(100) 
query_rankings_sorted_first_hundred

Unnamed: 0,QueryId,EntityId,Relevance C_BM25,Relevance D_BM25,Relevance C_TFIDF,Relevance D_TFIDF,predicted_ranking
24571,2,2616207,0.000000,0.000000,1.092427,1.042545,1.907945
24572,2,4406241,0.000000,0.000000,1.084364,1.038833,1.755728
24573,2,1745425,0.000000,0.000000,1.084364,1.038833,1.755728
24574,2,2496331,0.000000,0.000000,1.081554,1.037540,1.544401
24570,2,2496299,0.000000,0.000000,1.103299,1.047550,1.223334
...,...,...,...,...,...,...,...
7682,465,1294279,1.339807,0.000000,1.118870,0.000000,-0.048984
24533,465,3056348,0.000000,1.376981,0.000000,0.000000,-0.092991
7690,465,3977490,1.331132,0.000000,1.112395,0.000000,-0.099250
7730,465,3292653,1.257951,0.000000,1.074485,0.000000,-0.119303


In [230]:
output = query_rankings_sorted.drop(["Relevance C_BM25", "Relevance D_BM25", "Relevance C_TFIDF", "Relevance D_TFIDF", "predicted_ranking"], axis=1)
output.to_csv("../outputs/ltr_CBM25_DBM25_CTFIDF_DTFIDF.csv", index=False)

In [255]:
test = pd.read_json("../data/corpus.jsonl", lines=True)
test.head()

Unnamed: 0,id,title,text,keywords
0,1,!!!,!!! is a dance-punk band that formed in Sacram...,"[1996 establishments in California, American i..."
1,2,!!! (album),!!! is the eponymous debut album by !!!. It wa...,"[!!! albums, 2001 debut albums, English-langua..."
2,3,!!Destroy-Oh-Boy!!,!!Destroy-Oh-Boy!! is the debut album by the A...,"[1993 debut albums, Crypt Records albums, Engl..."
3,4,!Action Pact!,"!Action Pact! were a punk rock band, formed in...","[English punk rock groups, Musical groups dise..."
4,5,!Arriba! La Pachanga,!Arriba! La Pachanga is an album by Mongo Sant...,[1961 albums]


In [256]:
import nltk, re

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download('punkt')

def preprocesser(text: str) -> list:
    """
    Does stemming and removes stopwords and punctuation
    """
    
    snow_stemmer = SnowballStemmer(language='english')
    
    text = re.sub(r'\n|\r', ' ', text)       #Removes breaklines
    text = re.sub(r'[^\w\s]', ' ', text)       #Removes punctuation
    words = word_tokenize(text.lower())       #Tokenizes the text

    filtered_sentence = []
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(snow_stemmer.stem(w))

    return " ".join(filtered_sentence)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ritar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ritar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [270]:
def parse_documents(documents):

  
  parsed_documents = documents.rename(columns = {'id':'docno'})
  parsed_documents['title'] = documents['title'].apply(lambda x: preprocesser(x))  
  parsed_documents['text'] = documents['text'].apply(lambda x: preprocesser(x))
  parsed_documents['keywords'] = documents['keywords'].apply(lambda x: " ".join([preprocesser(el) for el in x]))

  return parsed_documents

documents.head()

Unnamed: 0,docno,title,text,keywords
0,1,,danc punk band form sacramento california 1996...,1996 establish california american indi rock g...
1,2,album,eponym debut album releas 2001 gold standard l...,album 2001 debut album english languag album
2,3,destroy oh boy,destroy oh boy debut album american garag punk...,1993 debut album crypt record album english la...
3,4,action pact,action pact punk rock band form 1981 bad samar...,english punk rock group music group disestabli...
4,5,arriba la pachanga,arriba la pachanga album mongo santamaría publ...,1961 album


In [266]:
documents = pd.read_csv("../data/parsed_corpus.csv")
documents.head()

Unnamed: 0,docno,title,text,keywords
0,1,[''],['danc punk band form sacramento california 19...,1996 establish california american indi rock g...
1,2,['album'],['eponym debut album releas 2001 gold standard...,album 2001 debut album english languag album
2,3,['destroy oh boy'],['destroy oh boy debut album american garag pu...,1993 debut album crypt record album english la...
3,4,['action pact'],['action pact punk rock band form 1981 bad sam...,english punk rock group music group disestabli...
4,5,['arriba la pachanga'],['arriba la pachanga album mongo santamaría pu...,1961 album


In [268]:
documents['text'] = eval(documents['text']).str[0]
documents.head()

TypeError: eval() arg 1 must be a string, bytes or code object

In [267]:
documents['text'] = documents['title'] + documents['text'] + documents['keywords']
documents.head()

Unnamed: 0,docno,title,text,keywords
0,1,[''],['']['danc punk band form sacramento californi...,1996 establish california american indi rock g...
1,2,['album'],['album']['eponym debut album releas 2001 gold...,album 2001 debut album english languag album
2,3,['destroy oh boy'],['destroy oh boy']['destroy oh boy debut album...,1993 debut album crypt record album english la...
3,4,['action pact'],['action pact']['action pact punk rock band fo...,english punk rock group music group disestabli...
4,5,['arriba la pachanga'],['arriba la pachanga']['arriba la pachanga alb...,1961 album


In [259]:
documents.to_csv("../data/parsed_corpus.csv", index=False)

## Referências

- https://tamaracucumides.medium.com/learning-to-rank-with-lightgbm-code-example-in-python-843bd7b44574
- https://towardsdatascience.com/how-to-evaluate-learning-to-rank-models-d12cadb99d47
- https://towardsdatascience.com/how-to-implement-learning-to-rank-model-using-python-569cd9c49b08
- https://stackoverflow.com/questions/64294962/how-to-implement-learning-to-rank-using-lightgbm
- https://www.kaggle.com/code/bturan19/lightgbm-ranker-introduction/notebook
- https://stackoverflow.com/questions/62555987/lightgbm-ranking-example/67621253#67621253
- https://neptune.ai/blog/lightgbm-parameters-guide
- https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#
- https://github.com/uni-assignments/research-challenge-2/blob/master/src/ltr.py