# Learning to Rank using Tmdb data

This notebook runs through the learning to rank example supported by codes in the scripts folder. For more information about the example and the Elastic Search Learning to Rank plugin please see [this blogpost from Open Source Connections](http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/). 

## Prerequisites

To run through the codes in this notebook you will need:
0. Python (obviously), with the packages listed in `requirements.txt`. The scripts shall work with either Python 2 or 3.
1. An Elastic Search instance with the learning to rank pluglin installed. You would also need to have the permission to create indexes. If you don't have one available, you can always download and setup Elastic Search at your local computer. Note for now that the plugin only works with ES 5.1.1.
2. The _movielens_ dataset. This dataset can be retrieved by running [prepareData.sh](prepareData.sh). The script will download the zip file, and unzip it to a folder named `ml-20m`.
3.  The tmdb movie data. To generate this file, you'd need an account and an API key from tmdb (obtain from https://www.themoviedb.org/documentation/api). After getting the api key, define an enviromental variable (i.e. `export TMDB_API_KEY=<Your Key>`), and execute [tmdb.py](tmdb.py). The script will generate a file named `tmdb.json`( Warning: due to API restrictions, `tmdb.py` will run for about 6 hours to collect all entries). `tmdb.py` will need `ml-20m/links.csv` when creating `tmdb.json`. 
4. Have RankLibPlus-0.1.0.jar copied over to the scripts/ folder. RankLibPlus is a hardened version of RankLib that addresses some performance issues. The repository can be accessed at https://github.com/o19s/RankyMcRankFace. You can use `mvn package` to build the jar file.
5. Add the following line to the config/elasticsearch.yml file of your elastic search instance:
   
   `script.max_size_in_bytes: [super large integer]`
   
   The plugin saves the model files as scripts, which defaults to a soft limit of 655535 bytes.
   


## Indexing tmdb Data

The first step is to upload the tmdb data to elastic search to create an index:

In [1]:
%%capture
import json
from indexMlTmdb import reindex
from elasticsearch import Elasticsearch

esUrl="http://localhost:9200" #assume a local elastic search instance

es = Elasticsearch(esUrl, timeout=30)
movieDict = json.loads(open('tmdb.json').read()) #load and format the movie documents
reindex(es, movieDict=movieDict, index='tmdb') #delete existing index and recreate a new one

To verify the index is created, let's retrieve a document: 

In [3]:
es.search(index='tmdb', body={"query": {"match": { "title": "rambo"}}}, size=1)

{'_shards': {'failed': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '7555',
    '_index': 'tmdb',
    '_score': 12.279254,
    '_source': {'adult': False,
     'backdrop_path': '/mgSMefETH89UnNWMffevxZPKDnO.jpg',
     'belongs_to_collection': {'backdrop_path': '/Yt2ZxbJv2HM842B6FNMr59Vhyb.jpg',
      'id': 5039,
      'name': 'Rambo Collection',
      'poster_path': '/feGOEOVrOLyjtEnVa88rQLgD3XY.jpg'},
     'budget': 50000000,
     'cast': [{'cast_id': 12,
       'character': 'John Rambo',
       'credit_id': '52fe4484c3a36847f809a9e9',
       'id': 16483,
       'name': 'Sylvester Stallone',
       'order': 0,
       'profile_path': '/gnmwOa46C2TP35N7ARSzboTdx2u.jpg'},
      {'cast_id': 13,
       'character': 'Sarah',
       'credit_id': '52fe4484c3a36847f809a9ed',
       'id': 35551,
       'name': 'Julie Benz',
       'order': 1,
       'profile_path': '/3ymRPOitD5IKXZgDLoy8FLRGQCn.jpg'},
      {'cast_id': 15,
       'character': 'School Boy',
       'credit_id': '52

## Training Data Creation
The next step is to create the training data needed to build the Learning to Rank model. The training data set, like all supervised learning machine learning algorithms, consists of data labels (in our case the relevance grades for each search result) and features. 

### Data Labels (relevance grades)

In this example the relevance grades are loaded from [sample_judgements.txt](sample_judgements.txt) and formatted to ranklib specifications.

In [4]:
%%capture
from judgments import judgmentsFromFile, judgmentsByQid
judgements = judgmentsByQid(judgmentsFromFile(filename='sample_judgements.txt'))

In [5]:
#print out the relevance scores and query ids
for qid, judgmentList in judgements.items():
    for judgment in judgmentList:
        print(judgment.toRanklibFormat())

3	qid:1	 # 1370	rambo
3	qid:1	 # 1369	rambo
3	qid:1	 # 1368	rambo
0	qid:1	 # 136278	rambo
0	qid:1	 # 102947	rambo
0	qid:1	 # 13969	rambo
0	qid:1	 # 61645	rambo
0	qid:1	 # 14423	rambo
0	qid:1	 # 54156	rambo
4	qid:2	 # 1366	rocky
3	qid:2	 # 1246	rocky
3	qid:2	 # 60375	rocky
3	qid:2	 # 1371	rocky
3	qid:2	 # 1375	rocky
3	qid:2	 # 1374	rocky
0	qid:2	 # 110123	rocky
0	qid:2	 # 17711	rocky
0	qid:2	 # 36685	rocky
4	qid:3	 # 17711	bullwinkle
0	qid:3	 # 1246	bullwinkle
0	qid:3	 # 60375	bullwinkle
0	qid:3	 # 1371	bullwinkle
0	qid:3	 # 1375	bullwinkle
0	qid:3	 # 1374	bullwinkle


The first column is the judgement score, the second column is the query id, while the inforation behind the # sign are human-readable annotations ignored by Ranklib.

### Features 
Features of learning to rank models are often retrieved and/or calculated from the search engine itself via the query API. In the tmdb example, we apply two features:

1. The tf-idf score of the movie title. This query can be expressed as the following query template (see [1.json.jinja](1.json.jinja)):

>```
  {
    "query": {
        "match": {
           "title": "{{keywords}}"
        }
    }
  }
```

2. The tf-idf score of a multi-match over multiple fields, expressed as the following query template [2.json.jinja](2.json.jinja)):  

> ```
{
   "query": {
       "multi_match": {
          "query": "{{keywords}}",
          "type": "cross_fields",
          "fields": ["overview", "genres.name", "title", "tagline", 
                     "belongs_to_collection.name", "cast.name", "directors.name"],
          "tie_breaker": 1.0
       }
   }
}
```

The following function calls Elastic Search to generate the features, and append them to the judgement file.

In [6]:
%%capture
from features import kwDocFeatures
kwDocFeatures(es, index='tmdb', searchType='movie', judgements=judgements)

The augmented full judgement file (i.e. training data) looks like below:

In [7]:
for qid, judgmentList in judgements.items():
    for judgment in judgmentList:
        print(judgment.toRanklibFormat())

3	qid:1	1:9.482015	2:25.469782 # 1370	rambo
3	qid:1	1:6.826077	2:23.13993 # 1369	rambo
3	qid:1	1:0.0	2:17.151937 # 1368	rambo
0	qid:1	1:0.0	2:0.0 # 136278	rambo
0	qid:1	1:0.0	2:0.0 # 102947	rambo
0	qid:1	1:0.0	2:0.0 # 13969	rambo
0	qid:1	1:0.0	2:0.0 # 61645	rambo
0	qid:1	1:0.0	2:0.0 # 14423	rambo
0	qid:1	1:0.0	2:0.0 # 54156	rambo
4	qid:2	1:10.646808	2:20.49501 # 1366	rocky
3	qid:2	1:8.221444	2:21.073606 # 1246	rocky
3	qid:2	1:8.221444	2:14.424812 # 60375	rocky
3	qid:2	1:8.221444	2:16.640888 # 1371	rocky
3	qid:2	1:8.221444	2:18.506395 # 1375	rocky
3	qid:2	1:8.221444	2:19.772667 # 1374	rocky
0	qid:2	1:6.7930174	2:5.5697646 # 110123	rocky
0	qid:2	1:5.9185953	2:13.270146 # 17711	rocky
0	qid:2	1:5.9185953	2:14.717502 # 36685	rocky
4	qid:3	1:7.472444	2:19.033852 # 17711	bullwinkle
0	qid:3	1:0.0	2:0.0 # 1246	bullwinkle
0	qid:3	1:0.0	2:0.0 # 60375	bullwinkle
0	qid:3	1:0.0	2:0.0 # 1371	bullwinkle
0	qid:3	1:0.0	2:0.0 # 1375	bullwinkle
0	qid:3	1:0.0	2:0.0 # 1374	bullwinkle


The file is then saved to disk for model training.

In [8]:
#save the judgement file for model training
from features import buildFeaturesJudgmentsFile
buildFeaturesJudgmentsFile(judgements, filename='sample_judgements_wfeatures.txt')

## Model Training

Model training is conducted by invoking ranklib. The actual command is 

```java -jar RankLibPlus-0.1.0.jar  -ranker 6 -train [name of jugdgement file] -save [name of model file]
```

Below we use a python wrapper that executes the same command.

In [11]:
from train import trainModel, saveModel

trainModel(judgmentsWithFeaturesFile='sample_judgements_wfeatures.txt', modelOutput='model.txt')

Running java -jar RankLibPlus-0.1.0.jar  -ranker 6 -train sample_judgements_wfeatures.txt -save model.txt


The contents of the model file can be viewed at [model.txt](model.txt). We can then save the model to Elastic Search as a script:

In [12]:
saveModel(es, scriptName='test', modelFname='model.txt')

## Scoring
Once the model sript is uploaded, we can generate relevant scores by passing in an ltr rescore query.
The features for the new query can be generated by the same queries template we used to generate training data.

The query body below rescores a query that tries to search for all documents that contains the word 'rambo'.

In [20]:
query_body = {
              "query": {
                   "match": {
                      "_all": "rambo"
                    }
               },

               "rescore": {
                   "window_size": 20,
                   "query": {
                        "rescore_query": {        
                         "ltr": {
                            "model": {
                                "stored": "test"
                            },
                            "features": [{
                                          "match": {
                                          "title": "rambo"
                                          }
                                         }, 
                                         {
                                          "multi_match": {
                                             "query": "rambo",
                                             "type": "cross_fields",
                                             "tie_breaker": 1.0,
                                             "fields": ["overview", "genres.name", "title", 
                                                        "tagline", "belongs_to_collection.name",
                                                        "cast.name", "directors.name"]
                                           }
                                         }]
                            }
                     }
                }
            }
    }


In [35]:
results = es.search(index='tmdb', doc_type='movie', body=query_body, size=10)
for result in results['hits']['hits']:
    print (dict([('_id', result['_id']), ('title', result['_source']['title']), ('_score', result['_score'])]))

{'_id': '7555', 'title': 'Rambo', '_score': 113.367805}
{'_id': '1368', 'title': 'First Blood', '_score': 62.41924}
{'_id': '1369', 'title': 'Rambo: First Blood Part II', '_score': 59.755802}
{'_id': '1370', 'title': 'Rambo III', '_score': 58.007786}
{'_id': '288183', 'title': 'A Hole in the Soul', '_score': 48.01506}
{'_id': '61410', 'title': 'Spud', '_score': 46.65277}
{'_id': '13258', 'title': 'Son of Rambow', '_score': 45.773308}
{'_id': '98136', 'title': 'Hit Lady', '_score': 45.773308}
{'_id': '31362', 'title': 'In the Line of Duty: The F.B.I. Murders', '_score': 45.773308}
{'_id': '123961', 'title': 'Which Way to the Front?', '_score': 45.773308}
