# Learn to rank on top of Azure Cognitive Search

This notebook showcases how to train an L2 ranker, using a [Learn to rank](https://en.wikipedia.org/wiki/Learning_to_rank) approach, to be run on top of Azure Cognitive Search. 

Through this experiment, we are going to:
1. Use Azure Cognitive Search's new feature computation capability to extract text-based similarity features that describe query-to-document relationships
2. Do additional feature engineering to enhance our dataset further 
2. Train a model using [XGBOOST](https://xgboost.readthedocs.io/en/latest/)
3. Evaluate the ranking produced by the trained model against the base Azure Cognitive Search ranking using the [NDCG metric](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG).





In [1]:
%load_ext autoreload
%autoreload 2
# %load_ext memory_profiler
import sys
import os
import json
import warnings
warnings.filterwarnings(action='once')

from pathlib import Path
from pprint import pprint

import numpy as np
import requests

import azs_helpers.l2r_helper as azs

from azs_helpers.azure_search_client import azure_search_client as azs_client 
from azs_helpers.azs_msft_docs import azs_msft_docs as azs_docs

### Experiment setup

This experiment uses a dataset containing **7102 articles** from the **docs.microsoft.com** website. Each article contains a title, body, description, list of api names and a url path. Articles were augmented using the [key phrase extraction cognitive skill](https://docs.microsoft.com/en-us/azure/search/cognitive-search-skill-keyphrases) as well as with popular search terms that led to those articles. Additionally, we augmented the articles with easy to compute metadata that will be leveraged when training, such as the number of sections and tables in each article, as well as the normalized page views count.

You can find the full index definition : `azs_helpers\index_schema\docs-multilingual-20200217.json`

```json
{
  "name": "docs-multilingual-20200217",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "facetable": true,
        ...
    },
    ...
  ]
}
```


The experiment also relies on a labeled training set containing over 900 unique queries evaluated against various articles. We refer to this data as the "judgment" list, which will be used as the ground truth when evaluating ranking. Each query is evaluated against 1 to 10 different documents, and for each, provides a "grade" representing how relevant that specific document is to that query. A value of 10 indicates high relevance, while a value of 1 indicates lower relevance.

The judgement list can be found here: `PATH_TO_JUDGEMENT_LIST.CSV`

#### Configuration
Details and secrets about your search service should be added to a 'config.json' file of this format:

```json
{
   "service_name": "YOUR_SERVICE_NAME",
   "endpoint": "https://YOUR_SERVICE_NAME.search.windows.net",
   "api_version": "2019-05-06-preview",
   "api_key": "YOUR_API_KEY",
   "index_name": "rankingindex-msft-docs"
}
```   

The 'config.json' file should be placed in the `config` folder within the repository. If you prefer, you can instead uncomment & fill in the cell below, and it'll generate a service config for you.

In [2]:
# Azure search service configuration. You can uncomment the code here 
# and fill in the details if you prefer.

# service_config = {
#     "serivce_name": <Your Azure Search Service Name>,
#     "endpoint": <YOUR ENDPOINT HERE>,
#     "api_version": "2019-05-06-preview",
#     "api_key": <YOUR API KEY HERE>,
#     "index_name": “docs-multilingual-20200217"
# }

# service_config_root = Path.cwd() / 'config'
# service_config_root.mkdir(parents=True, exist_ok=True)

# service_config_path = service_config_root / 'config.json'
# with open(service_config_path, 'w') as f:
#     json.dump(service_config, f)

In [3]:
local_config = {
    "dataset_seed": 42,
    "verbose": False,
    "reindex": False,
    "min_document_count": 7102,
    # Documents path is hardcoded, please change this to your local dir.
    "local_documents_directory_path" : Path("D:/ranking/doc/extracted_ndcg_light"),
    "judgement_file_path": Path.cwd() / 'data' / 'raw' / 'msft_docs_labels.csv',
    "service_metadata_config_path": Path.cwd() / 'config' / 'config.json'
}

azs_service = azs_client.from_json(local_config['service_metadata_config_path'])
msft_docs = azs_docs(local_config['judgement_file_path'])

## Service preparation
The following section is meant to setup an index with the required data to run the experiment. 

In [4]:
if not azs_service.index_exist():
    print(f"Index {azs_service.index_name} does not exist in service. Creating.")
    index = msft_docs.create_index(azs_service, schema_file="docs-multilingual-20200217.json")
else:
    print(f"Index {azs_service.index_name} already exists. Skipping creation.")
    
doc_count = azs_service.index_documents_count()
if doc_count < local_config["min_document_count"]: 
    print(f"Index {azs_service.index_name} contains only {doc_count} out of {local_config['min_document_count']} documents. Uploading documents.")
    docs = msft_docs.get_documents_from_local_folder(local_config["local_documents_directory_path"])
    azs_service.upload_documents(docs, 100)
else:
    print(f"Index {azs_service.index_name} contains all {doc_count} documents. Skipping uploading.")



Succesfully connected to search service 'azslearntorank'
Index docs-multilingual-20200217 already exists. Skipping creation.
Index docs-multilingual-20200217 contains all 7102 documents. Skipping uploading.


### Extracting features from search service

The following functions are designed to efficiently use the Azure Search service to extract document-query features.

1. We filter each queries to only consider the documents we want to judge. This is achieved by adding a **"filter"** clause to our search query which will restrict the results to the group of documents contained in the group of documents we have judgment values for.
2. We set **"featuresMode"** to "enabled". This will tell the search service to return additional features with the results, including per-field similarity scores.
3. We use the **"select"** clause to only return the url of each documents, as well as a few non-text based fields that could potentially be used as features. This will greatly reduce the amount of data that needs to be transfered between the server and the client.
4. We use the **"searchFields"** parameter to select which text-based fields we want to include in the search process. Those fields will be the only ones for which the service will extract text-based features from (such as per-field similarity).

The expect response to this query will have the following format:

```json
    "value": [
     {
        "@search.score": 5.1958685,
        "@search.features": {
            "description_en_us": {
                "uniqueTokenMatches": 1.0,
                "similarityScore": 0.29541412
            },
            "body_en_us": {
                "uniqueTokenMatches": 3.0,
                "similarityScore": 0.36644348400000004
            },
            "keyPhrases_en_us": {
                "uniqueTokenMatches": 3.0,
                "similarityScore": 0.35014877
            },
            "title_en_us": {
                "uniqueTokenMatches": 3.0,
                "similarityScore": 1.75451557
            },
            "urlPath": {
                "uniqueTokenMatches": 2.0,
                "similarityScore": 1.07175103
            },
            "searchTerms": {
                "uniqueTokenMatches": 3.0,
                "similarityScore": 1.3575956200000001
            }
        },
        "normalized_pageview": null,
        "tableCount": 0,
        "sectionCount": 7,
        "url_en_us": "https://docs.microsoft.com/en-us/azure/search/"
    }]
```

In [5]:
import json
import pandas as pd

def get_search_results_from_service(service, query, urls_filter):
    search_request_body = {
        "search":azs.escape_query(query),
        "featuresMode": "enabled",
        "select": "title_en_us, url_en_us, sectionCount, tableCount, normalized_pageview", 
        "searchFields": "body_en_us,description_en_us,title_en_us,apiNames,urlPath,searchTerms, keyPhrases_en_us",
        "scoringStatistics": "global",
        "sessionId" : "my_session",
        "top" : 20
    }
    if len(urls_filter) > 0:
        search_request_body["filter"] = " or ".join(f"url_en_us eq '{url}'" for url in urls_filter)

    return service.search(search_request_body)

def get_features_from_service(service, query, group):
    urls = group['url'].values.tolist()
    
    search_results = get_search_results_from_service(service, query, urls)

    # this will flatten the search json response into a panda dataframe
    azs_features = pd.json_normalize(search_results)
    
    # we add the data extracted from azure search to our labeled data by merging them on the "url" field
    merged_results = group.join(azs_features.set_index('url_en_us'), on='url')
    
    # some of the labeled documents in our dataset did not match any documents in the Azure Search instance,
    # we will remove them from our data by dropping any row that did not produce a search score
    return merged_results.dropna(subset=['@search.score'])

### Make parallel calls to the Azure Search service to extract features

To extract all the features from our dataset, we start by grouping the judgment list by query. This will provide us with a list of judged documents for each query. Each call to the Azure Cognitive Search service will use the query from the group,  with filters to make sure we only return the documents from the group. In this dataset, we can expect aproximately 900 queries.

To quickly execute those queries, we setup a thread pool executor which will run the queries in parallel. The level of parallelism can be changed to accomodate different search service capacity.

In [6]:
import concurrent
import datetime
from itertools import chain

query_groups = msft_docs.judgements.groupby('query')

print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

executor = concurrent.futures.ThreadPoolExecutor(30)
futures = [executor.submit(get_features_from_service, azs_service, query, group) for (query, group) in query_groups]
concurrent.futures.wait(futures)

print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

all_features = pd.concat([future.result() for future in futures if future], sort=False).fillna(0)

2020-04-20 17:25:35
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0

Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
Search request failed with status: 503. Sleeping 100ms. Retrying... Retry count so far 0
S

### Serialize data for next step

Now that we've successfully processed our dataset, we're going to serialize it to disk for the next part of our tutorial.

In [7]:
interim_data_dir = Path.cwd() / 'data' / 'interim'
interim_data_dir.mkdir(parents=True, exist_ok=True)

all_features.to_pickle(interim_data_dir / 'features.pkl')