**Date :** Created on Tuesday January 12 2021

**Group 8 - Innovation**

**generate_qrels_learning_to_rank_v0** 

**@author :** Katia Belaid, Jordi Mora Fernandez, Célya Marcélo. 

**Description :** To summarise, this notebook generate qrels dataframe and use the experiment of the Learning to Rank

The formulation of learning to rank pipelines in four phases :
> 1. indexing and generating qrels with retrieval models
> 2. identifying a candidate set of documents for each query
> 3. computing extra features on these documents and applying Learning To Rank pipelines
> 4. using a learned model to re-rank the candidate documents to obtain a more effective ranking.



# Part 1 : Install / Download / Import Librairy

## Install library

In [1]:
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git #egg=python-terrier

## Import library

### - Useful library :

In [3]:
import pandas as pd
import numpy as np
import pickle
from tqdm import trange
from google.colab import drive

### - System library :

In [4]:
import os

### - Text library :

In [5]:
import pyterrier as pt

### - Machine Learning Libraries :

In [6]:
import lightgbm as lgb
import xgboost as xgb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Part 2 : Data Loading

In [7]:
def Load_data(helper_path: str) -> pd.DataFrame:
    """Documentation
    
    Parameters :
        - helper_path : the file path

    Output (if exists) :
        - df : My Dataframe cleaned and reindexed

    """
    
    # Data Load with pandas librairy
    df = pd.read_csv(helper_path)

    # Drop articles with no content
    df = df[df['art_content'] != '']

    # Reset my dataframe index
    df = df.reset_index(drop = True)
    
    # Returns my clean dataframe
    return df

In [8]:
def Load_Pickle(helper_path: str) -> pd.DataFrame:
    """Documentation
    
    Parameters :
        - helper_path : the file path

    Output (if exists) :
        - pick_file : My pickle file

    """

    # Open My file path
    with open(helper_path, 'rb') as f1:

        # Load Pickle file
        pick_file = pickle.load(f1)

        # Return Pickle file
        return pick_file

- Phase 1 : Personal paths to data (CSV and Pickle) :

In [9]:
# Connect the drive folder
drive.mount('/content/drive')

# My file path for the fonction
Helper_path_D : str = '/content/drive/MyDrive/data_interpromo/Data/abstract_V0.csv'

# My file path for the fonction
Helper_path_P: str = '/content/drive/MyDrive/data_interpromo/Data/request_word_weight'

Mounted at /content/drive


- Phase 2 : Get CSV and Pickle data :

In [10]:
# My DataFrame variable
My_Data : pd.DataFrame = Load_data(Helper_path_D)

# My Pickle variable
List_topics = Load_Pickle(Helper_path_P)

# Get topics DataFrame
Indices = np.arange(1, len(List_topics) + 1)
Topics : pd.DataFrame = pd.DataFrame(List_topics, \
                                     columns = ['query'])

# Create new Column
Topics['qid'] = Indices 

# To Show My Data Dataframe
My_Data.head(10)

Unnamed: 0,art_id,art_content,art_content_html,art_extract_datetime,art_lang,art_title,art_url,src_name,src_type,src_url,src_img,art_auth,art_tag,abstract
0,1,le FNCDG et l’ andcdg avoir publier en septemb...,"<p style=""text-align: justify;"">La FNCDG et l’...",22 septembre 2020,fr,9ème édition du Panorama de l’emploi territorial,http://fncdg.com/9eme-edition-du-panorama-de-l...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/09/im...,,,le FNCDG et l’ andcdg avoir publier en septemb...
1,2,malgré le levée un mesure de confinement le 11...,"<p style=""text-align: justify;"">Malgré la levé...",17 mars 2020,fr,ACTUALITÉS FNCDG / COVID19,http://fncdg.com/actualites-covid19/,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/03/co...,,,malgré le levée un mesure de confinement le 11...
2,25,quel être le objectif poursuivre par le gouver...,"<p style=""text-align: justify;""><strong>Quels ...",24 octobre 2019,fr,"Interview de M. Olivier DUSSOPT, Secretaire d’...",http://fncdg.com/interview-de-m-olivier-dussop...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2019/10/in...,,,quel être le objectif poursuivre par le gouver...
3,27,"le journée thématique , qui avoir lieu durant ...","<p style=""text-align: justify;""><strong>La jo...",31 mai 2017,fr,Journée Thématique FNCDG « Les services de san...,http://fncdg.com/journee-thematique-fncdg-les-...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/05/pu...,,,"le journée thématique , qui avoir lieu durant ..."
4,28,le 1ère journée thématique en région sur le th...,"<p style=""text-align: justify;"">La 1<sup>ère</...",13 mars 2017,fr,Journée Thématique FNCDG « Vers de nouveaux mo...,http://fncdg.com/journee-thematique-fncdg-vers...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/03/Sa...,,,le 1ère journée thématique en région sur le th...
5,30,l’ un un innovation de le loi n degré 2019 - 8...,"<p style=""text-align: justify;"">L’une des inno...",22 octobre 2020,fr,La publication d’un guide d’accompagnement à l...,http://fncdg.com/la-publication-dun-guide-dacc...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/10/LG...,,,l’ un un innovation de le loi n degré 2019 - 8...
6,31,"le FNCDG mener , en collaboration avec d’ autr...","<p style=""text-align: justify;"">La FNCDG mène,...",10 décembre 2020,fr,La publication d’un guide de sensibilisation a...,http://fncdg.com/la-publication-dun-guide-de-s...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2020/12/im...,,,"le FNCDG mener , en collaboration avec d’ autr..."
7,32,"créer pour et par le décideur territorial , ét...","<p style=""text-align: justify;"">Créé pour et p...",24 février 2017,fr,Lancement du réseau Étoile,http://fncdg.com/lancement-du-reseau-etoile/,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/02/re...,,,"créer pour et par le décideur territorial , ét..."
8,34,le décret n degré 2017 - 397 et n degré 2017 -...,"<p style=""text-align: justify;"">Les décrets n°...",5 avril 2017,fr,Le cadre d’emplois des agents de police munici...,http://fncdg.com/le-cadre-demplois-des-agents-...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/04/po...,,,le décret n degré 2017 - 397 et n degré 2017 -...
9,35,un candidat à un examen professionnel organise...,"<p style=""text-align: justify;"">Une candidate ...",6 juillet 2017,fr,Le Conseil d’Etat confirme la souveraineté des...,http://fncdg.com/le-conseil-detat-confirme-la-...,FNCDG,xpath_source,http://fncdg.com/actualites/,http://fncdg.com/wp-content/uploads/2017/07/Co...,,,un candidat à un examen professionnel organise...


In [11]:
# To Show My Topics Dataframe
Topics.head(10)

Unnamed: 0,query,qid
0,energie finance lenvironnement reforme parleme...,1
1,medias mobilite emploi societe relations intel...,2
2,institutions innovation emploi union travaux b...,3
3,recherche emploi experiencecollaborateur insti...,4
4,iot numerique international economique institu...,5
5,collectivites finances vacances financement ea...,6
6,droits vie economie elections tech sciences pu...,7
7,societe letat securite internationales territo...,8
8,gestion societe institutions emploi lenvironne...,9
9,5g economie robot tech defense institutions cu...,10



## Part 3 : Indexing

- Phase 1 : Install / Initialization / Creation 

In [12]:
# JAVA_HOME Declaration
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

# Export my JAVA_HOME
!export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

In [13]:
# JVM initialization
if not pt.started():
    pt.init()

terrier-assemblies 5.4  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.4  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.3.1 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


In [14]:
# Path Creation
!rm -rf ./pd_index

# Index stockage
Pd_indexer = pt.DFIndexer("./pd_index")

- Phase 2 : Docs Creation

In [15]:
# Select columns to the docs
Docs_columns : list = ['art_title', 'art_url', 'abstract']

# My Docs DataFrame
Docs : pd.DataFrame = My_Data[Docs_columns].copy()

# Add Column to my DataFrame
Docs['docno'] = My_Data['art_id'].astype(str)

# To show my docs DataFrame
Docs.head()

Unnamed: 0,art_title,art_url,abstract,docno
0,9ème édition du Panorama de l’emploi territorial,http://fncdg.com/9eme-edition-du-panorama-de-l...,le FNCDG et l’ andcdg avoir publier en septemb...,1
1,ACTUALITÉS FNCDG / COVID19,http://fncdg.com/actualites-covid19/,malgré le levée un mesure de confinement le 11...,2
2,"Interview de M. Olivier DUSSOPT, Secretaire d’...",http://fncdg.com/interview-de-m-olivier-dussop...,quel être le objectif poursuivre par le gouver...,25
3,Journée Thématique FNCDG « Les services de san...,http://fncdg.com/journee-thematique-fncdg-les-...,"le journée thématique , qui avoir lieu durant ...",27
4,Journée Thématique FNCDG « Vers de nouveaux mo...,http://fncdg.com/journee-thematique-fncdg-vers...,le 1ère journée thématique en région sur le th...,28


- Phase 3 : My reference index creation



In [16]:
Indexref = Pd_indexer.index(Docs["abstract"], Docs["docno"], Docs["art_url"], Docs["art_title"])

10:58:38.584 [main] WARN  o.t.structures.indexing.Indexer - Indexed 6 empty documents


- Optional Phase : Index Statistics

In [17]:
# Index = pt.IndexFactory.of(Indexref)
# print(Index.getCollectionStatistics().toString())

## Part 4 : Generating qrels from retrieval models output

- **First Rankers :** Make the candidate set of documents for each query

In [18]:
BM25 = pt.BatchRetrieve(Indexref, \
                        controls = {"wmodel": "BM25"})

- **Second Rankers :** Make for use to re-rank the BM25 results

In [19]:
TF_IDF =  pt.BatchRetrieve(Indexref, \
                           controls = {"wmodel": "TF_IDF"})

DPH = pt.BatchRetrieve(Indexref, \
                       controls = {"wmodel": "DPH"})

In [20]:
def Remove_querys(topics : pd.DataFrame, model) -> pd.DataFrame :
    """Documentation
    
    Parameters :
        - topics: data frame of queries in french
        - model: learning to rank model

    Output (if exists) :
        - topics_: data frame of queries with at 
        least one corresponding document
        
    """

    # Create my topics variable
    topics_ : pd.DataFrame = topics

    # Step 1 : Check matching document
    for i in range(len(topics)):
      
        # Get length of query
        l = len(model.transform(topics["query"][i]))
      
        # Check the match
        if l == 0:
        
            # Remove query if any match
            topics_ : pd.DataFrame = topics_.drop(index = i, \
                                                  axis = 0)
    
    # Return my query DataFrame
    return topics_

In [21]:
def Get_qrels(topics : pd.DataFrame, model) -> pd.DataFrame :
    """Documentation
    
    Parameters :
        - topics: data frame of queries in french
        - model: learning to rank model

    Output (if exists) :
        - qrels_bis: new qrels data frame 
        
    """

    # My selected list
    columns : list = ["qid", "docno"]
    
    # Define new DataFrame
    qrels_bis : pd.DataFrame = pd.DataFrame(columns = columns)

    # Step 1 : Browse every queries 
    for i in topics.index:

        # Use my model for get qrels
        result = model.transform(Topics_final["query"][i])
      
        # Select queries randomly
        result_ = result.sample(n = 10)
        result_["qid"] = i + 1
      
        # Concatenate results 
        qrels_bis : pd.DataFrame = pd.concat([qrels_bis, \
                                            result_[["qid", "docno"]]])
    
    # Add label
    qrels_bis["label"] : pd.DataFrame = '1'

    # Return my new Qrels DataFrame
    return qrels_bis

- Fonction Application to get Topics and Qrels :

In [22]:
Topics_final = Remove_querys(Topics, BM25)
Qrels_bis = Get_qrels(Topics_final, BM25)



In [23]:
# To show my Qrels DataFrame
Qrels_bis

Unnamed: 0,qid,docno,label
528,1,1608,1
488,1,467,1
130,1,3803,1
499,1,7575,1
136,1,7849,1
...,...,...,...
495,150,2106,1
827,150,6723,1
716,150,6157,1
816,150,7233,1


# Part 5 : Learning To Rank 

### Phase 1 : Learning after re-ranking with extra-features

- **Step 1 :** Data partitioning (`Train`, `Test`, `Validation`) for my **Topics data**

In [24]:
train_topics, valid_topics, test_topics = np.split(Topics_final, \
                                                   [int(.6*len(Topics_final)), \
                                                    int(.8*len(Topics_final))])

- **Step 2 :** Change the data type for learning in my **Qrels DataFrame**

In [25]:
Qrels_bis['qid'] = Qrels_bis['qid'].apply(str)
Qrels_bis['docno'] = Qrels_bis['docno'].apply(str)
Qrels_bis['label'] = Qrels_bis['label'].apply(str)

- **Step 3 :** Data partitioning (`Train`, `Test`, `Validation`) for my **Qrels data**

In [26]:
train_qrels, valid_qrels, test_qrels = np.split(Qrels_bis, \
                                                [int(.6*len(Qrels_bis)), \
                                                 int(.8*len(Qrels_bis))])

### Phase 2 : Learning To Rank with sklearn regressor



- **Step 1 :** Create a Random Forest regressor

In [27]:
# My personnal random forest regressor
Rf = RandomForestRegressor(n_estimators = 100)

- **Step 2 :** Generate the LTR pipeline

In [28]:
# Using my Rankers to generate
Pipeline = BM25 >> (DPH ** TF_IDF) 
Pipeline.fit(train_topics, \
             Qrels_bis)

# Using my regressor 
Rf_pipe = Pipeline >> pt.ltr.apply_learned_model(Rf)
Rf_pipe.fit(train_topics, \
            Qrels_bis)

- **Step 3 :** Execute the LTR pipeline

In [29]:
Res_LTR = pt.pipelines.Experiment([Pipeline, Rf_pipe], \
                                  test_topics, \
                                  Qrels_bis, \
                                  ["map","ndcg"], \
                                  names = ["BM25 Baseline","LTR"])
# To show my score 
Res_LTR

Unnamed: 0,name,map,ndcg
0,BM25 Baseline,0.027271,0.295848
1,LTR,0.031954,0.306723


### Phase 3 : Learning To Rank with Gradient Boosted Trees & LambdaMART

#### Method 1 : Using my `Map` Metric

- **Step 1 :** Build a XGBoost regressor

In [30]:
# Set the XGBoost parameters for MAP metric
Lmart_x_map = xgb.sklearn.XGBRanker(objective ='rank:map', \
                                    learning_rate = 0.1, \
                                    gamma = 1.0, \
                                    min_child_weight = 0.1, \
                                    max_depth = 6, \
                                    verbose = 2, \
                                    random_state = 42)

# Generate the boosted LTR pipeline 
Lmart_x_pip = Rf_pipe >> pt.ltr.apply_learned_model(Lmart_x_map, \
                                                    form = "ltr")

# Execute the boosted LTR pipeline
Lmart_x_pip.fit(train_topics, \
                train_qrels, \
                valid_topics, \
                valid_qrels)

- **Step 2 :** Build a LightGBM regressor

In [31]:
# Set the LightGBM parameters
Lmart_l_map = lgb.LGBMRanker(task = "train", \
                             min_data_in_leaf = 1, \
                             min_sum_hessian_in_leaf = 100, \
                             max_bin = 255, \
                             num_leaves = 7, \
                             objective = "lambdarank", \
                             metric = "map", \
                             ndcg_eval_at = [1, 3, 5, 10], \
                             learning_rate = 0.1, \
                             importance_type = "gain", \
                             num_iterations = 10)

# Generate the boosted LTR pipeline
Lmart_l_pip = Rf_pipe >> pt.ltr.apply_learned_model(Lmart_l_map, \
                                                    form="ltr")

# Execute the boosted LTR pipeline
Lmart_l_pip.fit(train_topics, \
                train_qrels, \
                valid_topics, \
                valid_qrels)



[1]	valid_0's map@1: 0
[2]	valid_0's map@1: 0
[3]	valid_0's map@1: 0
[4]	valid_0's map@1: 0
[5]	valid_0's map@1: 0
[6]	valid_0's map@1: 0
[7]	valid_0's map@1: 0
[8]	valid_0's map@1: 0
[9]	valid_0's map@1: 0
[10]	valid_0's map@1: 0


- **Step 3 :**  Experimenting with my regressors

In [33]:
# Write my names resultats
Result_list : list = ["BM25 Baseline", \
                      "LambdaMART (xgBoost)", \
                      "LambdaMART (LightGBM)" ]

# Execute my experiment
Res_LTR_Map = pt.Experiment([Pipeline, Lmart_x_pip, Lmart_l_pip], \
                            test_topics, \
                            Qrels_bis, \
                            ["map"], \
                            names = Result_list)

- **Step 4 :** Drop the BM25 Baseline row to compare both LambdaMART techniques

In [34]:
Res_LTR_Map = Res_LTR_Map[Res_LTR_Map.name != 'BM25 Baseline']

- **Step 5 :** Print best LambdaMART techniques and its MAP value

In [35]:
Best_map_gra = Res_LTR_Map.iloc[Res_LTR_Map['map'].idxmax()]

# To show results
Best_map_gra

name    LambdaMART (LightGBM)
map                 0.0260612
Name: 2, dtype: object

#### Method 2 : Using my `NDCG` Metric

- **Step 1 :** Build a XGBoost regressor

In [42]:
# Set the XGBoost parameters for ndcg metric
Lmart_x_ndcg = xgb.sklearn.XGBRanker(objective = 'rank:ndcg', \
                                     learning_rate = 0.1, \
                                     gamma = 1.0, \
                                     min_child_weight = 0.1, \
                                     max_depth = 6, \
                                     verbose = 2, \
                                     random_state = 42)

# Generate the boosted LTR pipeline 
Lmart_x_pip = Rf_pipe >> pt.ltr.apply_learned_model(Lmart_x_ndcg, \
                                                    form = "ltr")

# Execute the boosted LTR pipeline
Lmart_x_pip.fit(train_topics, \
                train_qrels, \
                valid_topics, \
                valid_qrels)

- **Step 2 :** Build a LightGBM regressor

In [43]:
# Set the LightGBM parameters
Lmart_l_ndcg = lgb.LGBMRanker(task = "train", \
                              min_data_in_leaf = 1, \
                              min_sum_hessian_in_leaf = 100, \
                              max_bin = 255, \
                              num_leaves = 7, \
                              objective = "lambdarank", \
                              metric = "ndcg", \
                              ndcg_eval_at = [1, 3, 5, 10], \
                              learning_rate = 0.1, \
                              importance_type = "gain", \
                              num_iterations = 10)

# Generate the boosted LTR pipeline
Lmart_l_pip = Rf_pipe >> pt.ltr.apply_learned_model(Lmart_l_ndcg, \
                                                    form = "ltr")

# Execute the boosted LTR pipeline
Lmart_l_pip.fit(train_topics, \
                train_qrels, \
                valid_topics, \
                valid_qrels)



[1]	valid_0's ndcg@1: 0
[2]	valid_0's ndcg@1: 0
[3]	valid_0's ndcg@1: 0
[4]	valid_0's ndcg@1: 0
[5]	valid_0's ndcg@1: 0
[6]	valid_0's ndcg@1: 0
[7]	valid_0's ndcg@1: 0
[8]	valid_0's ndcg@1: 0
[9]	valid_0's ndcg@1: 0
[10]	valid_0's ndcg@1: 0


- **Step 3 :**  Experimenting with my regressors

In [45]:
# Write my names resultats
Result_list : list = ["BM25 Baseline", \
                      "LambdaMART (xgBoost)", \
                      "LambdaMART (LightGBM)" ]

# Execute my experiment
Res_LTR_Ndcg = pt.Experiment([Pipeline, Lmart_x_pip, Lmart_l_pip], \
                             test_topics, \
                             Qrels_bis, \
                             ["ndcg"], \
                             names = Result_list )

- **Step 4 :** Drop the BM25 Baseline row to compare both LambdaMART techniques

In [46]:
Res_LTR_Ndcg = Res_LTR_Ndcg[Res_LTR_Ndcg.name != 'BM25 Baseline']

- **Step 5 :** Print best LambdaMART techniques and its NDCG value

In [47]:
Best_Ndcg_gra = Res_LTR_Ndcg.iloc[Res_LTR_Ndcg['ndcg'].idxmax()]
Best_Ndcg_gra

name    LambdaMART (LightGBM)
ndcg                  0.29242
Name: 2, dtype: object