# Basics & Prereqs (run once)

If you don't already have the downloaded dependencies; if you don't have TheMovieDB data indexed run this

In [2]:
from ltr.client.solr_client import SolrClient
client = SolrClient()

from ltr import download, index
download.download(); index.rebuild_tmdb(client)

GET http://es-learn-to-rank.labs.o19s.com/tmdb.json
GET http://es-learn-to-rank.labs.o19s.com/RankyMcRankFace.jar
GET http://es-learn-to-rank.labs.o19s.com/title_judgments.txt
Done.
RUNNING INDEXING...
DO IT!
Deleted index tmdb [Status: 200]
Created index tmdb [Status: 200]
Indexing 27846 documents
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Flushing 500 movies
Done [Status: 200]
Skipping 67456
Skipping 67479
Skipping 133252
Flushing 500 movies
Done [Status: 200]
Skipping 211779
Skipping 200039
Skipping 69372
Skipping 69487
Skipping 164721
Flushing 500 movies
Done [Stat

### Use Solr Client

In [1]:
from ltr.client.solr_client import SolrClient
client = SolrClient()

# Our Task: Optimizing "Drama" and "Science Fiction" queries

In this example we have two user queries

- Drama
- Science Fiction

And we want to train a model to return the best movies for these movies when a user types them into our search bar.

We learn through analysis that searchers prefer newer science fiction, but older drama. Like a lot of search relevance problems, two queries need to be optimized in *different* directions

### Synthetic Judgment List Generation

To setup this example, we'll generate a judgment list that rewards new science fiction movies as more relevant; and old drama movies as relevant.

In [4]:
from ltr.date_genre_judgments import synthesize
judgments = synthesize(client, judgmentsOutFile='data/genre_by_date_judgments.txt')

Generating judgments for scifi & drama movies
Query {'q': '*:*... [Status: 200]
Done


In [2]:
# Uncomment this line to see the judgments
# 
# for judgment in judgments:
#    print(judgment.toRanklibFormat())

### Feature selection should be *easy!*

Notice we have 4 proposed features, that seem like they should work! This should be a piece of cake...

1. Release Year of a movie `release_year` - feature ID 1
2. Is the movie Science Fiction `is_scifi` - feature ID 2
3. Is the movie Drama `is_drama` - feature ID 3
4. Does the search term match the genre field `is_genre_match` - feature ID 4


In [14]:
config = [
            {
                "store": "genre", # Note: This overrides the _DEFAULT_ feature store location
                "name" : "release_year",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "{!func}def(release_year,2000)"
                }
            },
            {
                "store": "genre",
                "name" : "is_sci_fi",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Science Fiction\"^=10.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_drama",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"Drama\"^=4.0"
                }
            },
            {
                "store": "genre",
                "name" : "is_genre_match",
                "class" : "org.apache.solr.ltr.feature.SolrFeature",
                "params" : {
                  "q" : "genres:\"${keywords}\"^=100.0"
                }
            }
]


from ltr.setup import setup
setup(client, config=config, featureset='genre')

Deleted classic model [Status: 200]
Deleted genre model [Status: 200]
Deleted latest model [Status: 200]
Deleted title model [Status: 200]
Deleted title_fuzzy model [Status: 200]
Deleted _DEFAULT Featurestore [Status: 200]
Deleted genre Featurestore [Status: 200]
Deleted release Featurestore [Status: 200]
Deleted title Featurestore [Status: 200]
Deleted title_fuzzy Featurestore [Status: 200]
Created genre feature store under tmdb: [Status: 200]


### Log from search engine -> to training set

Each feature is a query to be scored against the judgment list

In [6]:
from ltr.collect_features import judgments_to_training_set
trainingSet = judgments_to_training_set(client,
                                        judgmentInFile='data/genre_by_date_judgments.txt', 
                                        trainingOutFile='data/genre_by_date_judgments_train.txt', 
                                        featureSet='genre')

Recognizing 2 queries...
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Science Fiction (0/2)
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Drama (1/2)


### Training - Guaraneed Perfect Search Results!

We'll train a LambdaMART model against this training data.

In [2]:
from ltr.train import train
trainLog = train(client, 
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='genre',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 100 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt
DONE
Submit Model genre Ftr Set genre [Status: 200]
Deleted Model genre [Status: 200]
Created Model genre [Status: 200]

Impact of each feature on the model
4 - 178579056.50291184
1 - 107974596.42237584
3 - 14215772.384097328
2 - 7654167.426058862
Perfect NDCG! 1.0


### But this search sucks!
Try searches for "Science Fiction" and "Drama"

In [8]:
from ltr.search import search
search(client, keywords="Science Fiction", modelName="genre")

Query {'fl': '*,... [Status: 200]
['A Man There Was'] 
8.84513 
1917 
['Drama'] 
["Terje Vigen, a sailor, suffers the loss of his family through the cruelty of another man. Years later, when his enemy's family finds itself dependent on Terje's benevolence, Terje must decide whether to avenge himself."] 
---------------------------------------
['The Immigrant'] 
8.84513 
1917 
['Comedy', 'Drama'] 
['Charlie is an immigrant who endures a challenging voyage and gets into trouble as soon as he arrives in America.'] 
---------------------------------------
['The Cabinet of Dr. Caligari'] 
8.844578 
1920 
['Drama', 'Horror', 'Thriller', 'Crime'] 
['The Cabinet of Dr. Caligari is eerie and expressionistic, silent and surreal. It has become not only a classic of German Expressionist cinema, but a landmark in film history, with creative scenery and an unusual ending.'] 
---------------------------------------
['Within Our Gates'] 
8.844578 
1920 
['Drama', 'Romance'] 
['Abandoned by her fiancé,

### Why didn't it work!?!? Training data

1. Examine the training data, do we cover every example of a BAD result
2. Examine the feature impacts, do any of the features the model uses even USE the keywords?

### Ranklib only sees the data you give it, we don't have good enough coverage

You need to have feature coverage, especially over negative examples. Most documents in the index are negative! 

One trick commonly used is to treat other queries positive results as this queries negative results. Indeed what we're missing here are negative examples for "Science Fiction" that are not science fiction movies. A glaring omission, we'll handle now... With the `autoNegate` flag, we'll add additional negative examples to the judgment list

In [11]:
from ltr import date_genre_judgments
date_genre_judgments.buildJudgments(client,
                                    judgmentsFile='data/genre_by_date_judgments.txt',
                                    autoNegate=True)

judgments_to_training_set(client,
                          judgmentInFile='data/genre_by_date_judgments.txt', 
                          trainingOutFile='data/genre_by_date_judgments_train.txt', 
                          featureSet='genre')

from ltr.train import train
trainLog = train(client,
                 trainingInFile='data/genre_by_date_judgments_train.txt',
                 metric2t='NDCG@10',
                 featureSet='genre',
                 modelName='genre')

print()
print("Impact of each feature on the model")
for ftrId, impact in trainLog.impacts.items():
    print("{} - {}".format(ftrId, impact))
    
print("Perfect NDCG! {}".format(trainLog.rounds[-1]))

Generating judgments for scifi & drama movies
Query {'q': '*:*... [Status: 200]
Done
Recognizing 2 queries...
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Science Fiction (0/2)
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
Searching tmdb [Status: 200]
REBUILDING TRAINING DATA for Drama (1/2)
Running java -jar data/RankyMcRankFace.jar -ranker 6 -metric2t NDCG@10 -tree 100 -train data/genre_by_date_judgments_train.txt -save data/genre_model.txt
DONE
Submit Model genre Ftr Set genre [Status: 200]
Deleted Model genre [Status: 200]
Created Model genre [Status: 200]

Impact of each feature on the model
4 - 178579056.50291184
1 - 107974596.42237584
3 - 14215772.384097328
2 - 7654167.426058862
Perfect NDCG! 1.0


### Now try those queries...

Replace keywords below with 'science fiction' or 'drama' and see how it works

In [3]:
from ltr.search import search
search(client, keywords="Drama", modelName="genre")

Query {'fl': '*,... [Status: 200]
['A Man There Was'] 
3.9405792 
1917 
['Drama'] 
["Terje Vigen, a sailor, suffers the loss of his family through the cruelty of another man. Years later, when his enemy's family finds itself dependent on Terje's benevolence, Terje must decide whether to avenge himself."] 
---------------------------------------
['The Immigrant'] 
3.9405792 
1917 
['Comedy', 'Drama'] 
['Charlie is an immigrant who endures a challenging voyage and gets into trouble as soon as he arrives in America.'] 
---------------------------------------
['Blacksmith Scene'] 
3.939317 
1893 
['Drama'] 
['Three men hammer on an anvil and pass a bottle of beer around. Notable for being the first film in which a scene is being acted out.'] 
---------------------------------------
["Tillie's Punctured Romance"] 
3.939317 
1914 
['Comedy', 'Drama', 'Romance'] 
["Chaplin plays a womanizing city man who meets Tillie (Dressler) in the country after a fight with his girlfriend (Normand). When 

### The next problem

- Overfit to these two examples
- We need many more queries, covering more use cases