# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [1]:
from ltr import download, index
download.run(); index.run()

GET https://dl.bintray.com/o19s/RankyMcRankFace/com/o19s/RankyMcRankFace/0.1.1/RankyMcRankFace-0.1.1.jar
GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
Done.


# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [1]:
from ltr import date_genre_judgments
judgments = date_genre_judgments.buildJudgments(judgmentsFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
Done


In [2]:
# Uncomment this line to see the judgments
# 
# for judgment in judgments:
#    print(judgment.toRanklibFormat())

### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [3]:
config = {"featureset": {
            "features": [
            {
                "name": "release_year",
                "params": [],
                "template": {
                    "function_score": {
                        "field_value_factor": {
                        "field": "release_year",
                        "missing": 2000
                    },
                    "query": { "match_all": {} }
                }
            }
            },
             {
                "name": "is_sci_fi",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Science Fiction"}
                        },
                        "boost": 10.0
                    }
            }
            },
             {
                "name": "is_drama",
                "params": [],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "Drama"}
                        },
                        "boost": 4.0
                    }
                }
            },
             {
                "name": "is_genre_match",
                "params": ["keywords"],
                "template": {
                    "constant_score": {
                        "filter": {
                            "match_phrase": {"genres": "{{keywords}}"}
                        },
                        "boost": 100.0
                    }
                }
            }
    ]
    }}

from ltr import setup_ltr
setup_ltr.run(config=config, featureSet='genre')

Removed LTR feature store: 200
Initialize LTR: 200
Created genre feature set: 201


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [6]:
from ltr import collectFeatures
trainingSet = collectFeatures.trainingSetFromJudgments(judgmentInFile='data/genre_by_date_judgments.txt', 
                                                       trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                                       featureSet='genre')

Recognizing 2 queries...
REBUILDING TRAINING DATA for Science Fiction (0/2)
REBUILDING TRAINING DATA for Drama (1/2)


### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [14]:
from ltr import train
trainLog = train.run(trainingInFile='data/genre_by_date_judgments_train.txt',
                     metric2t='NDCG@10',
                     featureSet='genre',
                     modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Delete model genre: 404
Created model genre: 201

Impact of each feature on the model
1 - 95669916.43199906
2 - 3167416.4620914753
3 - 9.343045139389716
4 - 0.0
Perfect NDCG! 1.0


### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [16]:
from ltr import search
search.run(keywords="Science Fiction", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "Science Fiction"}, "model": "genre"}}}
Rogue One: A Star Wars Story 
10.175102 
2016 
['Adventure', 'Science Fiction', 'Action'] 
A rogue band of resistance fighters unite for a mission to steal the Death Star plans and bring a new hope to the galaxy. 
---------------------------------------
Guardians of the Galaxy Vol. 2 
10.175102 
2017 
['Action', 'Adventure', 'Comedy', 'Science Fiction'] 
The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage. 
---------------------------------------
Wonder Woman 
10.175102 
2017 
['Action', 'Adventure', 'Fantasy', 'Science Fiction'] 
An Amazon princess comes to the world of Man to become the greatest of the female superheroes. 
---------------------------------------
Captain America: Civil War 
10.175102 
2016 
['Adventure', 'Action', 'Science Fiction'] 
Following the events of Age of Ultron, the collective governments of t

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [21]:
from ltr import date_genre_judgments, collectFeatures
date_genre_judgments.buildJudgments(judgmentsFile='data/genre_by_date_judgments.txt',
                                    autoNegate=True)

collectFeatures.trainingSetFromJudgments(judgmentInFile='data/genre_by_date_judgments.txt', 
                                         trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                         featureSet='genre')

from ltr import train
trainLog = train.run(trainingInFile='data/genre_by_date_judgments_train.txt',
                     metric2t='NDCG@10',
                     featureSet='genre',
                     modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
Done
Recognizing 2 queries...
REBUILDING TRAINING DATA for Science Fiction (0/2)
REBUILDING TRAINING DATA for Drama (1/2)
Delete model genre: 200
Created model genre: 201

Impact of each feature on the model
1 - 56711025.15383451
4 - 34243334.40204898
2 - 59104.665908504234
3 - 26274.36408814093
Perfect NDCG! 1.0


### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [26]:
from ltr import search
search.run(keywords="space", modelName="genre")

{"size": 5, "query": {"sltr": {"params": {"keywords": "space"}, "model": "genre"}}}
Battleship Potemkin 
-2.3706646 
1925 
['Drama', 'History'] 
A dramatized account of a great Russian naval mutiny and a resulting street demonstration which brought on a police massacre. The film had an incredible impact on the development of cinema and was a masterful example of montage editing. 
---------------------------------------
The Gold Rush 
-2.3706646 
1925 
['Adventure', 'Comedy', 'Drama', 'Family'] 
A lone prospector ventures into Alaska looking for gold. He gets mixed up with some burly characters and falls in love with the beautiful Georgia. He tries to win her heart with his singular charm. 
---------------------------------------
The General 
-2.3706646 
1926 
['Western', 'Adventure', 'Drama', 'Action', 'Comedy', 'War'] 
The General is a 1927 American silent film comedy from Buster Keaton. The film flopped when first released but is now regarded as the height of silent film comedy. The 

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases