# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [1]:
from ltr import download, index
download.run(); index.run()

GET https://dl.bintray.com/o19s/RankyMcRankFace/com/o19s/RankyMcRankFace/0.1.1/RankyMcRankFace-0.1.1.jar
GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
Done.


## Switch to Solr
By default examples run against elastic, the following snippet will change things to Solr

In [1]:
from ltr import useSolr
useSolr()

Switched to Solr client_mode


# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [2]:
from ltr import date_genre_judgments
judgments = date_genre_judgments.buildJudgments(judgmentsFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
Done


In [2]:
# Uncomment this line to see the judgments
# 
# for judgment in judgments:
#    print(judgment.toRanklibFormat())

### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [3]:
config = [
            {
                "store": "genre", # Note: This overrides the _DEFAULT_ feature store location
                "name" : "release_year",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "{!func}def(release_year,2000)"
                }
            },
            {
                "store": "genre",
                "name" : "is_sci_fi",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Science Fiction\"^=10.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_drama",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Drama\"^=4.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_genre_match",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"${keywords}\"^=100.0"
                }
            }
]


from ltr import setup_ltr
setup_ltr.run(config=config, featureset='genre')

Deleted classic model: 200
Deleted genre model: 200
Deleted latest model: 200
Delete _DEFAULT feature store: 200
Delete genre feature store: 200
Delete release feature store: 200
Created genre feature store under tmdb: 200


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [4]:
from ltr import collectFeatures
trainingSet = collectFeatures.trainingSetFromJudgments(judgmentInFile='data/genre_by_date_judgments.txt', 
                                                       trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                                       featureSet='genre')

Recognizing 2 queries...
REBUILDING TRAINING DATA for Science Fiction (0/2)
REBUILDING TRAINING DATA for Drama (1/2)


### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [5]:
from ltr import train
trainLog = train.run(trainingInFile='data/genre_by_date_judgments_train.txt',
                     metric2t='NDCG@10',
                     featureSet='genre',
                     modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 100 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt
DONE
{
  "responseHeader":{
    "status":0,
    "QTime":22}}

PUT genre model under genre: 200

Impact of each feature on the model
2 - 69637125.91395159
1 - 4343523.117296852
3 - 243.4263191311367
4 - 0.0
Perfect NDCG! 1.0


### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [2]:
from ltr import search
search.run(keywords="Science Fiction", modelName="genre")

['The General'] 
6.6352477 
1926 
['Western', 'Adventure', 'Drama', 'Action', 'Comedy', 'War'] 
['The General is a 1927 American silent film comedy from Buster Keaton. The film flopped when first released but is now regarded as the height of silent film comedy. The film is based on events from America’s civil war.'] 
---------------------------------------
['The Kid'] 
3.1759663 
1921 
['Comedy', 'Drama'] 
["Considered one of Charlie Chaplin's best films, The Kid also made a star of little Jackie Coogan, who plays a boy cared for by The Tramp when he's abandoned by his mother, Edna. Later, Edna has a change of heart and aches to be reunited with her son. When she finds him and wrests him from The Tramp, it makes for what turns out be one of the most heart-wrenching scenes ever included in a comedy."] 
---------------------------------------
['Sherlock, Jr.'] 
2.7476196 
1924 
['Fantasy', 'Drama', 'Comedy', 'Mystery'] 
["A film projectionist longs to be a detective, and puts his meagre 

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [2]:
from ltr import date_genre_judgments, collectFeatures
date_genre_judgments.buildJudgments(judgmentsFile='data/genre_by_date_judgments.txt',
                                    autoNegate=True)

collectFeatures.trainingSetFromJudgments(judgmentInFile='data/genre_by_date_judgments.txt', 
                                         trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                         featureSet='genre')

from ltr import train
trainLog = train.run(trainingInFile='data/genre_by_date_judgments_train.txt',
                     metric2t='NDCG@10',
                     featureSet='genre',
                     modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
Done
Recognizing 2 queries...
REBUILDING TRAINING DATA for Science Fiction (0/2)
REBUILDING TRAINING DATA for Drama (1/2)
Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 100 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt
DONE
Deleted genre model [200]
PUT genre model under genre: 200

Impact of each feature on the model
4 - 561061180.8468233
1 - 1707216.1348845295
2 - 820.3178113177884
3 - 44.640347891206886
Perfect NDCG! 1.0


### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [5]:
from ltr import search
search.run(keywords="Science Fiction", modelName="genre")

['Guardians of the Galaxy'] 
4.020194 
2014 
['Action', 'Science Fiction', 'Adventure'] 
['Light years from Earth, 26 years after being abducted, Peter Quill finds himself the prime target of a manhunt after discovering an orb wanted by Ronan the Accuser.'] 
---------------------------------------
['World of Tomorrow'] 
3.689034 
2015 
['Animation', 'Comedy', 'Science Fiction'] 
['A little girl is contacted by a mysterious woman.'] 
---------------------------------------
['Doctor Who: The Day of the Doctor'] 
3.5284767 
2013 
['Science Fiction', 'Adventure'] 
["In 2013, something terrible is awakening in London's National Gallery; in 1562, a murderous plot is afoot in Elizabethan England; and somewhere in space an ancient battle reaches its devastating conclusion. All of reality is at stake as the Doctor's own dangerous past comes back to haunt him."] 
---------------------------------------
['Okja'] 
-3.8640637 
2017 
['Action', 'Science Fiction', 'Drama', 'Adventure'] 
['A young gir

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases