LICENSE

This notebook uses the Ruey-Cheng Chen library to train and test an AdaRank model:
https://github.com/rueycheng/AdaRank

The model is trained with a training svmlight file and evaluated with a test svmlight file. 

In [1]:
import helper

# adarank_lib contains the AdaRank implementation form Ruey-Cheng Chen library
from adarank_lib.adarank import AdaRank
from adarank_lib.metrics import NDCGScorer, APScorer, PScorer
from adarank_lib.utils import load_docno, print_ranking

import os

from sklearn.datasets import load_svmlight_file

# 1.Load excels labeled and turn them into DataFrames

In [2]:
# Excel Files

excel_doc = "../data/Loinc/loinc_dataset_labels-v2.xlsx"
extended_excel_doc = "../data/Loinc/extended_loinc_dataset-v2.xlsx"

### Loing_dataset_labels-v2.xlsx columns:
Original columns
* __Loinc_num__: loinc identifier number.
* __long_common_name__: description
* __component__: name of the component
* __system__: ¿?
* __property__: ¿?

Added columns
* __doc_numb__: document number for each query. #1 to #67 for each query (201 in total)
* __qid__: Query id. [1, 2, 3 ]. For the extended file [1,2 3, 4, 5, 6]
* __Label__: Relevance number 0: irrelevant, 1: relevant, 2: super-relevant.
  
_(*) Nota: Explicar en que nos hemos basado para poner la etiqueta de relevancia. P.e. para el query "Glucose in Blood", buscamos todas las entradas donde esté la palabra "glucose" y le añadimos un 1. Luego buscamos la palabra "blood" y cuando coincida con alguna de las entradas anteriores le sumamos 1. Tendremos relevancia 1 cuando aparezca la palabra "glucose" y relevancia 2 cuando aparezcan "glucose"+"blood"._

In [3]:
# Create Dataframes

# helper.excel_to_df: Reads an excel file and returns a dataframe with all the sheets concatenated
df = helper.excel_to_df(excel_doc)
extended_df = helper.excel_to_df(extended_excel_doc)

# 2. Create the Train and Test sets based on Features (X), Labels (y), and Queries (qid)

### Function: helper.df_to_svmlight_flies()
Transforms a dataframe into svmlight files
  * Takes the column "long_common_name" and creates anotherone called "cleaned_text" with the text preprocesed:
    - Remove punctuation
    - Tokenize
    - Remove small words
    - Remove stopwords
    - Stemize ¿?
    - Lemmatize ¿?
    - Join sentences
  * Once the text is cleaned uses _TfidfVectorizer()_ to get the document's features.
  * Creates a new column "features" for each document with these features.
  * Split the whole df in 70% train and 30% test:
    - First suffles the dataset so we dont have the qid 1 docs at the begining and the qid3 at the end.
    - Takes the first 70% of the shuffled df 
    - Takes the last 30% of the shuffled df as test and sort it by qid
  * Takes the column "features" from Train and Test as X_train and X_test respectively. Takes "label" as y_train and y_test. Takes "qid" as qid_train, qid_test.
  * Uses sklearn.dataset function "_dump_svmlight_file_" to dump the dataset in svmlight / libsvm file format. This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset. Finally creates .dat files with all this.


In [4]:
# Svmlight files
os.chdir('../data/Loinc/svmlight_files')
file, train_file, test_file  = helper.df_to_svmlight_files(df)
extended_file, extended_train_file, extended_test_file = helper.df_to_svmlight_files(extended_df)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\laura\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\laura\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [18]:
def get_adarank_score_and_ranking(excel_doc, sc="NDCGScorer"):
    """ 
    Obtain a Score and Ranking for the AdaRank algorithm:
    1. NDCGScorer (Normalized Discounted Cumulative Gain scorer) Default:
        A measure of ranking quality that is often used to measure effectiveness 
        of web search engine algorithms or related applications.
    2. APScorer (Average Precision scorer):
        ...
    3. PScorer (Precision scorer):
        ...
    """
    if excel_doc == extended_excel_doc:
        tr_file = extended_train_file
        tst_file = extended_test_file
        if sc == "NDCGScorer":
            ranking_file = 'extended_NDCG_ranking.txt'
        elif sc == "APScorer":
            ranking_file = 'extended_AP_ranking.txt'
        elif sc == "PScorer":
            ranking_file = 'extended_P_ranking.txt'
    else:
        tr_file = train_file
        tst_file = test_file
        if sc == "NDCGScorer":
            ranking_file = 'NDCG_ranking.txt'
        elif sc == "APScorer":
            ranking_file = 'AP_ranking.txt'
        elif sc == "PScorer":
            ranking_file = 'P_ranking.txt'
   
    X_train, y_train, qid_train = load_svmlight_file(tr_file, query_id=True)
    X_test, y_test, qid_test = load_svmlight_file(tst_file, query_id=True)
    
    
    '''
    Run AdaRank for 100 iterations optimizing for NDCG@10 / AP@10 / P@10. 
    When no improvement is made within the previous 10 iterations, 
    the algorithm will stop.
    '''
    if sc == "NDCGScorer":
        model = AdaRank(max_iter=100, stop=10, scorer=NDCGScorer(k=10)).fit(X_train, y_train, qid_train)
    elif sc == "APScorer":
        model = AdaRank(max_iter=100, stop=10, scorer=APScorer()).fit(X_train, y_train, qid_train)
    elif sc == "PScorer":
        model = AdaRank(max_iter=100, stop=10, scorer=PScorer()).fit(X_train, y_train, qid_train)
    pred = model.predict(X_test, qid_test)
    
    # nDCG scores
    if sc == "NDCGScorer":
        for k in (1, 2, 3, 4, 5, 10, 20):
                score = NDCGScorer(k=k)(y_test, pred, qid_test).mean()
                print('nDCG@{}\t{}'.format(k, score))
    # AP scores
    elif sc == "APScorer":
        score = APScorer()(y_test, pred, qid_test).mean()
        print('AP\t{}'.format(score))
    # Precision scores
    elif sc == "PScorer":
        score = PScorer()(y_test, pred, qid_test).mean()
        print('P\t{}'.format(score))
        
    
    # Return ranking
    docno = load_docno(tst_file, letor=False)
    os.chdir('../../../Assignment3/Rankings')
    print_ranking(qid_test, docno, pred, output=open(ranking_file, 'w'))
    os.chdir("../../data/Loinc/svmlight_files")

In [19]:
# Get scores for excel documents
print('nDCG scores for original dataset:')
get_adarank_score_and_ranking(excel_doc)


print('\nnDCG scores for extended dataset:')
get_adarank_score_and_ranking(extended_excel_doc)



nDCG scores for original dataset:
nDCG@1	0.0
nDCG@2	0.1289509357448472
nDCG@3	0.1289509357448472
nDCG@4	0.2169736432690442
nDCG@5	0.2481896027351573
nDCG@10	0.34795917795457837
nDCG@20	0.34795917795457837

nDCG scores for extended dataset:
nDCG@1	0.0
nDCG@2	0.0
nDCG@3	0.05109559939712153
nDCG@4	0.16688637950478555
nDCG@5	0.16688637950478555
nDCG@10	0.19912411344099734
nDCG@20	0.2756690561705189


In [20]:
print('AP scores for original dataset:')
get_adarank_score_and_ranking(excel_doc, "APScorer")


print('\nAP scores for extended dataset:')
get_adarank_score_and_ranking(extended_excel_doc, "APScorer")

AP scores for original dataset:
AP	0.25767195767195766

AP scores for extended dataset:
AP	0.16508295561362615


In [21]:
print('Precision scores for original dataset:')
get_adarank_score_and_ranking(excel_doc, "PScorer")


print('\nPrecision scores for extended dataset:')
get_adarank_score_and_ranking(extended_excel_doc, "PScorer")

Precision scores for original dataset:
P	0.08293460925039872

Precision scores for extended dataset:
P	0.06430557275967905
