## Loading Elasticsearch

In this section Elasticsearch is installed on the colab machine and initialised. The scripts to be able to run it on Colab were provided by GitHub user [korakot](https://gist.github.com/korakot/15fe4f18d0e0f53d7b834ef797880500).

In [1]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q # it's an older version of Elasticsearch, latest release being 7.13.4, couldn't make it run with that
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.0.0
!pip install elasticsearch -q

[?25l[K     |█                               | 10 kB 25.7 MB/s eta 0:00:01[K     |█▉                              | 20 kB 31.6 MB/s eta 0:00:01[K     |██▊                             | 30 kB 27.7 MB/s eta 0:00:01[K     |███▊                            | 40 kB 22.3 MB/s eta 0:00:01[K     |████▋                           | 51 kB 9.6 MB/s eta 0:00:01[K     |█████▌                          | 61 kB 9.9 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 9.6 MB/s eta 0:00:01[K     |███████▍                        | 81 kB 9.5 MB/s eta 0:00:01[K     |████████▎                       | 92 kB 9.8 MB/s eta 0:00:01[K     |█████████▏                      | 102 kB 8.8 MB/s eta 0:00:01[K     |██████████▏                     | 112 kB 8.8 MB/s eta 0:00:01[K     |███████████                     | 122 kB 8.8 MB/s eta 0:00:01[K     |████████████                    | 133 kB 8.8 MB/s eta 0:00:01[K     |████████████▉                   | 143 kB 8.8 MB/s eta 0:00:01[K 

In [2]:
import os
from subprocess import Popen, PIPE, STDOUT

This cell runs Elasticsearch in the hosted runtime machine.

In [3]:
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )


The next command is using [cURL](https://curl.se/), which is used as a REST API to communicate with the Elasticsearch client. In this case it is adapting te request in query DSL (Elasticsearch's proprietary language). It's just checking if Elasticsearch is running locally (in the google machine in this case) by querying its default port (`localhost:9200`)

In [4]:
!curl -X GET "localhost:9200/" 

curl: (7) Failed to connect to localhost port 9200: Connection refused


In [5]:
from elasticsearch import Elasticsearch

In [6]:
es = Elasticsearch()
es.ping() # another way to test whether ES is running, returns True if so

False

Elasticsearch should be running fine. Before creating an index, it is worth it to preprocess the data.

## Preprocessing

In [7]:
import pandas as pd
import numpy as np
import nltk
import re
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Next cell loads the dataset. It is available on [Kaggle](https://www.kaggle.com/ashishgup/netflix-rotten-tomatoes-metacritic-imdb), it is currently being updated to the local runtime directly as using code to download it directly requires a personal Kaggle API  token.

In [8]:
films = pd.read_csv("netflix-rotten-tomatoes-metacritic-imdb.csv")
films.head()

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Hidden Gem Score,Country Availability,Runtime,Director,Writer,Actors,View Rating,IMDb Score,Rotten Tomatoes Score,Metacritic Score,Awards Received,Awards Nominated For,Boxoffice,Release Date,Netflix Release Date,Production House,Netflix Link,IMDb Link,Summary,IMDb Votes,Image,Poster,TMDb Trailer,Trailer Site
0,Lets Fight Ghost,"Crime, Drama, Fantasy, Horror, Romance","Comedy Programmes,Romantic TV Comedies,Horror ...","Swedish, Spanish",Series,4.3,Thailand,< 30 minutes,Tomas Alfredson,John Ajvide Lindqvist,"Kåre Hedebrant, Per Ragnar, Lina Leandersson, ...",R,7.9,98.0,82.0,74.0,57.0,"$2,122,065",12 Dec 2008,2021-03-04,"Canal+, Sandrew Metronome",https://www.netflix.com/watch/81415947,https://www.imdb.com/title/tt1139797,A med student with a supernatural gift tries t...,205926.0,https://occ-0-4708-64.1.nflxso.net/dnm/api/v6/...,https://m.media-amazon.com/images/M/MV5BOWM4NT...,,
1,HOW TO BUILD A GIRL,Comedy,"Dramas,Comedies,Films Based on Books,British",English,Movie,7.0,Canada,1-2 hour,Coky Giedroyc,Caitlin Moran,"Paddy Considine, Cleo, Beanie Feldstein, Dónal...",R,5.8,79.0,69.0,1.0,,"$70,632",08 May 2020,2021-03-04,"Film 4, Monumental Pictures, Lionsgate",https://www.netflix.com/watch/81041267,https://www.imdb.com/title/tt4193072,"When nerdy Johanna moves to London, things get...",2838.0,https://occ-0-1081-999.1.nflxso.net/dnm/api/v6...,https://m.media-amazon.com/images/M/MV5BZGUyN2...,https://www.youtube.com/watch?v=eIbcxPy4okQ,YouTube
2,Centigrade,"Drama, Thriller",Thrillers,English,Movie,6.4,Canada,1-2 hour,Brendan Walsh,"Brendan Walsh, Daley Nixon","Genesis Rodriguez, Vincent Piazza",Unrated,4.3,,46.0,,,"$16,263",28 Aug 2020,2021-03-04,,https://www.netflix.com/watch/81305978,https://www.imdb.com/title/tt8945942,"Trapped in a frozen car during a blizzard, a p...",1720.0,https://occ-0-1081-999.1.nflxso.net/dnm/api/v6...,https://m.media-amazon.com/images/M/MV5BODM2MD...,https://www.youtube.com/watch?v=0RvV7TNUlkQ,YouTube
3,ANNE+,Drama,"TV Dramas,Romantic TV Dramas,Dutch TV Shows",Turkish,Series,7.7,"Belgium,Netherlands",< 30 minutes,,,"Vahide Perçin, Gonca Vuslateri, Cansu Dere, Be...",,6.5,,,1.0,,,01 Oct 2016,2021-03-04,,https://www.netflix.com/watch/81336456,https://www.imdb.com/title/tt6132758,"Upon moving into a new place, a 20-something r...",1147.0,https://occ-0-1489-1490.1.nflxso.net/dnm/api/v...,https://m.media-amazon.com/images/M/MV5BNWRkMz...,,
4,Moxie,"Animation, Short, Drama","Social Issue Dramas,Teen Movies,Dramas,Comedie...",English,Movie,8.1,"Lithuania,Poland,France,Iceland,Italy,Spain,Gr...",1-2 hour,Stephen Irwin,,Ragga Gudrun,,6.3,,,,4.0,,22 Sep 2011,2021-03-04,,https://www.netflix.com/watch/81078393,https://www.imdb.com/title/tt2023611,Inspired by her moms rebellious past and a con...,63.0,https://occ-0-4039-1500.1.nflxso.net/dnm/api/v...,https://m.media-amazon.com/images/M/MV5BODYyNW...,,


The next cell contains the preprocessing functions.

In [9]:
def query_processor(query):
    query = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", query)
    query = re.sub(r"([.,;:!?'\"“\(])(\w)", r"\1 \2", query)
    query = re.sub("'", "", query)
    query = re.sub(",", "", query)
    tokens = re.split(r"\s+",query)
    tokens = [t.lower() for t in tokens]
    # Stemming with the SnowballStemmer for more efficiency
    s_stemmer = SnowballStemmer("english")
    stemedList = []
    for word in tokens:
        stemedList.append(s_stemmer.stem(word))
    tokens = stemedList
    # Lemmatising the tokens
    wordnet_lemmatiser = WordNetLemmatizer()
    lemmaList = []
    for word in stemedList:
        lemmaList.append(wordnet_lemmatiser.lemmatize(word))
    tokens = lemmaList
    # Finally remove stopwords
    stops = set(stopwords.words("english"))
    tokens = [word for word in tokens if not word in stops]
    return tokens

def dataset_processor(dataset):
    dataset = dataset.drop(columns = ["Trailer Site", "TMDb Trailer", "Poster", "Image", "IMDb Votes", "Netflix Link", "Production House", "Netflix Release Date", "Boxoffice", "Awards Received", "Awards Nominated For", "IMDb Link", "Runtime", "Country Availability", "View Rating"])
    dataset["Average Score"] = dataset.apply(Average, axis = 1)
    dataset = dataset.drop(columns = ["Hidden Gem Score", "IMDb Score", "Rotten Tomatoes Score", "Metacritic Score"])
    dataset = dataset.astype({"Title":"str", "Genre":"str", "Tags":"str",
                     "Languages":"str", "Series or Movie":"category",
                     "Director":"str", "Writer":"str", "Actors":"str",
                     "Summary":"str", "Average Score":"float"})
    # So need to query processor the summary, genre list and tag list.
    for i in ["Summary", "Genre", "Tags"]:
      dataset[i] = dataset.loc[:, i].apply(query_processor)

    # Need to turn the director, writer, languages, and actors into a list.
    for i in ["Director", "Languages", "Writer", "Actors"]:
      dataset[i] = dataset.loc[:,i].apply(Split)

    # Change the release data to just the year
    dataset["Release Date"] = dataset["Release Date"].apply(Year)
    return dataset

def Split(row):
  tokens = re.split(",", row)
  return tokens

def Year(row):
  tokens = re.split(" ", str(row))
  if tokens[0] == "nan":
    return "NaN"
  else:
    return tokens[2]

def Average(row):
  scores = []
  scorers = ["Hidden Gem Score", "IMDb Score", "Rotten Tomatoes Score", "Metacritic Score"]
  potential = [10,10,100,100]
  for i in range(len(scorers)):
    if (np.isnan(row.loc[scorers[i]])) == False:
      scores.append(row.loc[scorers[i]]/potential[i])
  if len(scores) > 0:
    return sum(scores)/len(scores)
  else:
    return "NaN"

In [10]:
# preprocessing the dataset
films = dataset_processor(films)
films.head()

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Director,Writer,Actors,Release Date,Summary,Average Score
0,Lets Fight Ghost,"[crime, drama, fantasi, horror, romanc]","[comedi, programm, romant, tv, comedi, horror,...","[Swedish, Spanish]",Series,[Tomas Alfredson],[John Ajvide Lindqvist],"[Kåre Hedebrant, Per Ragnar, Lina Leandersso...",2008,"[med, student, supernatur, gift, tri, cash, ab...",0.755
1,HOW TO BUILD A GIRL,[comedi],"[drama, comedi, film, base, book, british]",[English],Movie,[Coky Giedroyc],[Caitlin Moran],"[Paddy Considine, Cleo, Beanie Feldstein, D...",2020,"[nerdi, johanna, move, london, thing, get, han...",0.69
2,Centigrade,"[drama, thriller]",[thriller],[English],Movie,[Brendan Walsh],"[Brendan Walsh, Daley Nixon]","[Genesis Rodriguez, Vincent Piazza]",2020,"[trap, frozen, car, dure, blizzard, pregnant, ...",0.51
3,ANNE+,[drama],"[tv, drama, romant, tv, drama, dutch, tv, show]",[Turkish],Series,[nan],[nan],"[Vahide Perçin, Gonca Vuslateri, Cansu Dere,...",2016,"[upon, move, new, place, 20-someth, run, forme...",0.71
4,Moxie,"[anim, short, drama]","[social, issu, drama, teen, movi, drama, comed...",[English],Movie,[Stephen Irwin],[nan],[Ragga Gudrun],2011,"[inspir, mom, rebelli, past, confid, new, frie...",0.72


In [11]:
films[films['Average Score'].isna()]

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Director,Writer,Actors,Release Date,Summary,Average Score
208,The Strongest,[drama],"[sport, movi, drama, sport, drama, classic, mo...",[None],Movie,[Raoul Walsh],"[Georges Clemenceau, Raoul Walsh]","[Beatrice Noyes, Carlo Liten, Renée Adorée, ...",1920,"[romanc, add, risk, two, hunt, expedit, arctic...",
220,Firefly Lane,"[drama, romanc]","[tv, drama, romant, tv, drama, u, tv, show, tv...",[English],Series,[nan],[Maggie Friedman],"[Roan Curtis, Sarah Chalke, Katherine Heigl,...",2021,"[best, friend, tulli, kate, support, good, tim...",
246,Le Tournoi,[documentari],"[drama, french]",[None],Movie,[Charles Belot],[nan],[nan],,"[self-indulg, chess, champion, cruis, major, t...",
248,She is King,[music],"[african, film, drama, comedi, music, music, &...",[English],Movie,[Gersh Kgamedi],"[Nicola Rauch, Gersh Kgamedi]","[Mandisa Nduna, Mike Mvelase, Aubrey Poo, K...",2017,"[khanyisil, take, talent, joburg, land, role, ...",
249,How to Eliminate My Teacher,"[mysteri, thriller]","[drama, programm, japanes, tv, programm, tv, t...",[Japanese],Series,[nan],[nan],"[Kokoro Morita, Kazusa Okuyama, Marika Matsu...",2020,"[high-achiev, student, class, 3-d, make, game,...",
...,...,...,...,...,...,...,...,...,...,...,...
15471,DreamWorks Short Stories,[nan],"[tv, comedi, kid, tv, tv, programm, tv, cartoo...",[nan],Series,[nan],[nan],[nan],,"[dreamworkss, coolest, charact, star, collect,...",
15472,DreamWorks Shrek Stories,[nan],"[tv, comedi, kid, tv, tv, programm, anim, tale...",[nan],Series,[nan],[nan],[nan],,"[shrek, celebr, busi, christma, spooki, hallow...",
15474,Nijntje and Vriendjes,[nan],"[kid, tv, tv, programm, dutch, tv, show, tv, s...",[nan],Series,[nan],[nan],[nan],,"[dick, bruna, classic, child, stori, get, new,...",
15475,K-POP Extreme Survival,[nan],"[tv, drama, tv, programm, tv, comedi, romant, ...",[nan],Series,[nan],[nan],[nan],,"[seung, yeon, decid, chase, dream, becom, k-po...",


In [12]:
films.dropna(subset=['Average Score'], inplace=True)

films[films['Average Score'].isna()]

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Director,Writer,Actors,Release Date,Summary,Average Score


## Indexing the documents

In [13]:
film_list = films.values.tolist()

In [14]:
films.dtypes

Title                object
Genre                object
Tags                 object
Languages            object
Series or Movie    category
Director             object
Writer               object
Actors               object
Release Date         object
Summary              object
Average Score       float64
dtype: object

Creating the index in Elasticsearch ([API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html)).

In [15]:
index_body = {
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1,
        
    },
    'mappings': {
          'properties': {
              'Title': {'type': 'text'},
              'Genre': {'type': 'keyword'},
              'Tags' : {'type': 'keyword'},
              'Languages': {'type': 'keyword'},
              'Series or Movie' : {'type' : 'keyword'},
              'Director' : {'type': 'text'},
              'Writer' : {'type': 'text'},
              'Actors' : {'type': 'text'},
              'Release Date' : {'type': 'integer'},
              'Summary' : {'type': 'text'},
              'Average Score' : {'type': 'double'}  
          }
    }
}

In [16]:
index_name = 'netflix'

# es.indices.delete(index='netflix', ignore=[400, 404]) # useful during development to reset the index
es.indices.create(index_name, body=index_body)

{'acknowledged': True, 'index': 'netflix', 'shards_acknowledged': True}

In [17]:
for title, genre, tags, languages, series_or_movie, director, writer, actors, release_date, summary, avg_score in film_list:
  list_body = {
      'Title': title,
      'Genre': genre,
      'Tags' : tags,
      'Languages': languages,
      'Series or Movie' : series_or_movie,
      'Director' : director,
      'Writer' : writer,
      'Actors' : actors,
      'Release Date' : release_date,
      'Summary' : summary,
      'Average Score' : avg_score
  }
  es.index(index_name, list_body)

In [18]:
es.cat.count(index='netflix',h=['count'])

'13394\n'

In [19]:
es.search(index="netflix", body={"query": {"match_all": {}}})
print('')




In [20]:
es.search(index="netflix", body={"query":{'term': {'Genre' : 'crime'}} })
print('')




In [21]:
es.search(index="netflix", body={"query":{'match': {'Director' : 'Tarantino'}} })
print('')




In [22]:
query_body = {
    'query' :{
        'match' :{
            'Title' : 'Lets Fight Ghost'
        }
    }
}

explain=True

results = es.search(index=index_name, body=query_body, explain=explain)['hits']['hits']
for hit in results:
  print('Title: {} - Director: {} - score: {}'.format(hit['_source']['Title'], hit['_source']['Director'], hit['_score']))

Title: Lets Fight Ghost - Director: ['Tomas Alfredson'] - score: 20.152025
Title: Lets Dance - Director: ['Tomas Alfredson'] - score: 8.814203
Title: Lets Dance - Director: ['Ladislas Chollat'] - score: 8.814203
Title: Fist Fight - Director: ['Richie Keen'] - score: 7.9699717
Title: Ghost - Director: ['Jerry Zucker'] - score: 7.805466
Title: Lets Eat 2 - Director: ['nan'] - score: 7.6067967
Title: Lets Be Cops - Director: ['Luke Greenfield'] - score: 7.6067967
Title: Ultra Fight Victory - Director: ['nan'] - score: 6.8782115
Title: S.W.A.T.: Fire Fight - Director: ['Benny Boom'] - score: 6.8782115
Title: Ghost Buddies - Director: ['Simon Sek'] - score: 6.5665264


**Compound query Search:**

In [23]:
es.search(index="netflix", body={"query":{'bool':{'must':[{'match': {'Director' : 'Tarantino'}}, {'match': {'Genre': 'crime'}}] }}})

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'MgAP43oB7m-swid8xTlH',
    '_index': 'netflix',
    '_score': 9.963967,
    '_source': {'Actors': ['Tim Roth',
      ' Amanda Plummer',
      ' Laura Lovelace',
      ' John Travolta'],
     'Average Score': 0.8025,
     'Director': ['Quentin Tarantino'],
     'Genre': ['crime', 'drama'],
     'Languages': ['English', ' Spanish', ' French'],
     'Release Date': '1994',
     'Series or Movie': 'Movie',
     'Summary': ['shelter',
      'young',
      'woman',
      'becom',
      'enamor',
      'struggl',
      'writer',
      'goe',
      'great',
      'length',
      'becom',
      'involv',
      'creativ',
      'process',
      '.'],
     'Tags': ['psycholog',
      'thriller',
      'independ',
      'film',
      'thriller',
      'indonesian',
      'film'],
     'Title': 'Fiction.',
     'Writer': ['Quentin Tarantino', ' Roger Avary']},
    '_type': '_doc'},
   {'_id': 'ugAQ43oB

In [24]:
def free_search(query):
  query = query_processor(query)
  query_body = {
    'query' :{
        'multi_match' :{
            'query' : str(query),
            'type'  : "cross_fields",
            'fields' : ['Title^2', 'Genre^3', 'Tags^2', 'Series or Movie', 'Director', 'Writer', 'Actors', 'Summary']
          }
      }
  }
  results = es.search(index='netflix', body=query_body)['hits']['hits']
  for hit in results:
    print('Title: {} - Genre: {} - score: {}'.format(hit['_source']['Title'], hit['_source']['Genre'], hit['_score']))
  return results


In [25]:
free_search('An action film about the end of the world')
print('\n')


Title: The End - Genre: ['comedi', 'drama'] - score: 14.311697
Title: Journeys End - Genre: ['drama', 'war'] - score: 12.263657
Title: Parades End - Genre: ['action', 'drama', 'romanc', 'war'] - score: 12.263657
Title: Otto - Der Neue Film - Genre: ['comedi'] - score: 11.343967
Title: Action Point - Genre: ['comedi'] - score: 10.921772
Title: Action Replayy - Genre: ['comedi', 'romanc', 'sci-fi'] - score: 10.921772
Title: The Worlds End - Genre: ['comedi', 'sci-fi'] - score: 10.750923
Title: One Piece Film: Strong World - Genre: ['anim', 'action', 'adventur', 'fantasi'] - score: 10.409172
Title: Seraph of the End - Genre: ['anim', 'action', 'adventur', 'drama', 'famili', 'fantasi', 'sci-fi'] - score: 9.987624
Title: The End? - Genre: ['horror', 'thriller'] - score: 9.445669




In [26]:
free_search('A comedy series about a group of friends')
print('\n')

Title: Divine Secrets of the Ya-Ya Sisterhood - Genre: ['drama'] - score: 11.143722
Title: The Comedy Lineup - Genre: ['comedi'] - score: 10.668942
Title: Assimilate - Genre: ['horror', 'mysteri', 'sci-fi', 'thriller'] - score: 9.757513
Title: The Adventures of Puss in Boots - Genre: ['anim', 'action', 'adventur', 'comedi', 'famili', 'fantasi', 'western'] - score: 9.679244
Title: Anohana: The Flower We Saw That Day - Genre: ['anim', 'adventur', 'drama', 'fantasi', 'mysteri', 'romanc'] - score: 9.483681
Title: Friend Zone - Genre: ['comedi'] - score: 9.396196
Title: Skins - Genre: ['drama'] - score: 9.224798
Title: #blackAF - Genre: ['comedi'] - score: 8.013987
Title: I Think You Should Leave with Tim Robinson - Genre: ['comedi'] - score: 8.013987
Title: Fresh Meat - Genre: ['comedi', 'drama'] - score: 8.013987




In [27]:
free_search('A series about a group of people saving the world')
print('\n')

Title: Doomsday: The Sinking of Japan - Genre: ['adventur', 'drama', 'sci-fi', 'thriller'] - score: 11.75946
Title: Grimgar of Fantasy and Ash - Genre: ['anim', 'adventur', 'drama', 'fantasi'] - score: 11.513651
Title: Passionate People - Genre: ['comedi', 'romanc'] - score: 11.469359
Title: Dark Net - Genre: ['documentari'] - score: 11.189331
Title: Moryo no Hako - Genre: ['crime', 'mysteri', 'thriller'] - score: 11.147486
Title: Reboot: The Guardian Code - Genre: ['action', 'comedi', 'drama', 'sci-fi'] - score: 10.211877
Title: DCs Legends of Tomorrow - Genre: ['documentari', 'action', 'sci-fi'] - score: 10.194386
Title: Arthur and the Invisibles - Genre: ['anim', 'adventur', 'comedi', 'famili', 'fantasi'] - score: 9.985563
Title: Hello World - Genre: ['anim', 'comedi', 'drama', 'famili', 'romanc', 'sci-fi'] - score: 9.919128
Title: Young Justice - Genre: ['anim', 'action', 'adventur', 'crime', 'romanc', 'sci-fi'] - score: 9.654355




In [28]:
free_search('A film about a family')
print('\n')

Title: Stromberg - Der Film - Genre: ['comedi'] - score: 7.7949395
Title: Draw - Genre: ['anim', 'short'] - score: 6.8804126
Title: Battles Without Honor and Humanity Dairi Senso - Genre: ['action', 'drama'] - score: 6.8804126
Title: Otto - Der Neue Film - Genre: ['comedi'] - score: 6.8558035
Title: One Piece Film: Gold - Genre: ['anim', 'action', 'adventur', 'fantasi'] - score: 6.8558035
Title: One Piece Film Z - Genre: ['anim', 'action', 'adventur', 'fantasi'] - score: 6.8558035
Title: A Lion in the House - Genre: ['documentari'] - score: 6.687322
Title: The Wilde Wedding - Genre: ['comedi', 'romanc'] - score: 6.687322
Title: Joey - Genre: ['famili', 'adventur', 'comedi'] - score: 6.687322
Title: Rumor Has It - Genre: ['comedi', 'drama', 'romanc'] - score: 6.687322




In [29]:
free_search('Something about solving a mystery')
print('\n')

Title: The Little Man - Genre: ['comedi', 'famili', 'fantasi'] - score: 8.9453
Title: Gintama: The Movie: The Final Chapter: Be Forever Yorozuya - Genre: ['anim', 'action', 'comedi', 'sci-fi'] - score: 8.8301735
Title: Trick Shinsaku Special - Genre: ['comedi', 'mysteri'] - score: 8.575211
Title: Whats New Scooby-Doo? - Genre: ['comedi', 'fantasi', 'romanc'] - score: 8.575211
Title: Skyline - Genre: ['action', 'sci-fi', 'thriller'] - score: 8.575211
Title: Dilili in Paris - Genre: ['anim', 'adventur', 'famili', 'mysteri'] - score: 8.334558
Title: Prokurator - Genre: ['crime'] - score: 8.334558
Title: The Victims Game - Genre: ['drama', 'thriller'] - score: 8.334558
Title: Saru Lock - Genre: ['comedi', 'drama'] - score: 8.334558
Title: SŁUGI WOJNY - Genre: ['thriller'] - score: 8.334558




## Testing

In [31]:
benchmark = pd.read_csv("Benchmark_dataset_features.csv")
benchmark.dropna(subset=['Average Score'], inplace=True)
benchmark["Title"].tolist()
benchmark = films[films["Title"].isin(benchmark["Title"].tolist())]
benchmark = benchmark.drop_duplicates(subset=["Title"])
benchmark_list = benchmark.values.tolist()

In [None]:
#benchmark = pd.read_csv("Benchmark_dataset_features.csv")
#benchmark = dataset_processor(benchmark)
#benchmark.dropna(subset=['Average Score'], inplace=True)
#benchmark_list = benchmark.values.tolist()


In [32]:
index_name = 'benchmark'
es.indices.create(index_name, body=index_body)

{'acknowledged': True, 'index': 'benchmark', 'shards_acknowledged': True}

In [33]:
for title, genre, tags, languages, series_or_movie, director, writer, actors, release_date, summary, avg_score in benchmark_list:
  list_body = {
      'Title': title,
      'Genre': genre,
      'Tags' : tags,
      'Languages': languages,
      'Series or Movie' : series_or_movie,
      'Director' : director,
      'Writer' : writer,
      'Actors' : actors,
      'Release Date' : release_date,
      'Summary' : summary,
      'Average Score' : avg_score
  }
  es.index(index_name, list_body)

In [127]:
def free_search_test(query):
  query = query_processor(query)
  query_body = {
    'query' :{
        'multi_match' :{
            'query' : str(query),
            'type'  : "cross_fields",
            'fields' : ['Title^2', 'Genre', 'Tags^2', 'Series or Movie^4', 'Director', 'Writer', 'Actors', 'Summary^3']
          }
      }
  }
  results = es.search(index='benchmark', body=query_body)['hits']['hits']
  for hit in results:
    print('Title: {} - Genre: {} - score: {}'.format(hit['_source']['Title'], hit['_source']['Genre'], hit['_score']))
  return results

In [128]:
test_1 = free_search_test('An action movie about the end of the world')

Title: Billy Lynns Long Halftime Walk - Genre: ['action', 'drama', 'sport', 'thriller', 'war'] - score: 14.324874
Title: Movie 43 - Genre: ['comedi'] - score: 13.959761
Title: The Last Days of American Crime - Genre: ['action', 'crime', 'drama', 'sci-fi', 'thriller'] - score: 13.612794
Title: Molang - Genre: ['anim', 'short', 'comedi', 'famili'] - score: 9.31041
Title: Michael Boltons Big, Sexy Valentines Day Special - Genre: ['music'] - score: 9.31041
Title: Aliens - Genre: ['action', 'adventur', 'sci-fi', 'thriller'] - score: 9.053455
Title: World War Z - Genre: ['action', 'adventur', 'horror', 'sci-fi'] - score: 9.053455
Title: Blood of Zeus - Genre: ['anim', 'action', 'fantasi'] - score: 8.810305
Title: The Others - Genre: ['horror', 'mysteri', 'thriller'] - score: 8.579873
Title: Jericho - Genre: ['action', 'drama', 'mysteri', 'sci-fi'] - score: 8.579873


In [129]:
test_2 = free_search_test('A comedy series about a group of friends')

Title: 10 Years - Genre: ['comedi', 'drama', 'romanc'] - score: 19.317524
Title: Crocodile Dundee - Genre: ['action', 'adventur', 'comedi'] - score: 12.376177
Title: ADAM SANDLER 100% FRESH - Genre: ['comedi', 'music'] - score: 11.379005
Title: Interstellar - Genre: ['adventur', 'drama', 'sci-fi'] - score: 11.163675
Title: Material - Genre: ['comedi', 'drama'] - score: 11.081389
Title: Movie 43 - Genre: ['comedi'] - score: 10.798946
Title: Strip Down, Rise Up - Genre: ['documentari'] - score: 10.547474
Title: Fear City: New York vs The Mafia - Genre: ['documentari', 'crime'] - score: 9.995739
Title: Alien Resurrection - Genre: ['action', 'horror', 'sci-fi'] - score: 9.995739
Title: Monty Pythons Life of Brian - Genre: ['comedi'] - score: 9.995739


In [130]:
test_3 = free_search_test('A series about a group of people saving the world')

Title: Devil - Genre: ['horror', 'mysteri', 'thriller'] - score: 21.770401
Title: Blood of Zeus - Genre: ['anim', 'action', 'fantasi'] - score: 19.58095
Title: The Confession Tapes - Genre: ['documentari', 'crime'] - score: 19.466915
Title: Hell or High Water - Genre: ['action', 'crime', 'drama', 'thriller', 'western'] - score: 18.870735
Title: Interstellar - Genre: ['adventur', 'drama', 'sci-fi'] - score: 11.163675
Title: Dora and the Lost City of Gold - Genre: ['adventur', 'comedi', 'famili', 'mysteri'] - score: 11.067898
Title: Strip Down, Rise Up - Genre: ['documentari'] - score: 10.547474
Title: 10 Years - Genre: ['comedi', 'drama', 'romanc'] - score: 10.547474
Title: Minimalism: A Documentary About the Important Things - Genre: ['documentari'] - score: 10.388374
Title: Wonder Woman: Bloodlines - Genre: ['anim', 'action', 'fantasi'] - score: 10.221597


In [131]:
test_4 = free_search_test('A movie about a family')

Title: Movie 43 - Genre: ['comedi'] - score: 13.959761
Title: Arrested Development - Genre: ['comedi'] - score: 7.714793
Title: Baxu and the Giants - Genre: ['short', 'drama', 'famili', 'fantasi'] - score: 7.495834
Title: Lifes Speed Bump - Genre: ['horror', 'sci-fi', 'thriller'] - score: 7.495834
Title: Aliens - Genre: ['action', 'adventur', 'sci-fi', 'thriller'] - score: 7.2889595
Title: This Is Us - Genre: ['comedi', 'drama', 'romanc'] - score: 7.2889595
Title: Life in Pieces - Genre: ['comedi'] - score: 7.2889595
Title: Rumor Has It - Genre: ['comedi', 'drama', 'romanc'] - score: 7.2889595
Title: The Russian Revolution - Genre: ['documentari', 'histori'] - score: 7.2889595
Title: The Grudge 2 - Genre: ['horror', 'thriller'] - score: 7.2889595


In [132]:
test_5 = free_search_test('Something about solving a mystery')

Title: Beyond the Edge - Genre: ['action', 'adventur', 'fantasi'] - score: 11.067898
Title: The Grudge 2 - Genre: ['horror', 'thriller'] - score: 11.067898
Title: The Disappointments Room - Genre: ['drama', 'horror', 'thriller'] - score: 10.770645
Title: The Mist - Genre: ['drama', 'horror', 'sci-fi'] - score: 10.48894
Title: Let Me In - Genre: ['drama', 'fantasi', 'horror', 'mysteri', 'thriller'] - score: 9.967542


In [133]:
relevance_scores = pd.read_csv("Benchmark_dataset_relevance.csv")
relevance_scores.head()

Unnamed: 0,Title,Query_1,Query_2,Query_3,Query_4,Query_5
0,The Missing,0,0,0,0,1
1,Asian Connection,0,0,0,1,0
2,Erased,0,0,0,0,1
3,The Front Runner,0,0,0,1,0
4,Rumor Has It,0,0,0,1,1


In [134]:
# Extracting titles
titles_1 = []
titles_2 = []
titles_3 = []
titles_4 = []
titles_5 = []
for hit in test_1:
  titles_1.append(hit['_source']['Title'])
for hit in test_2:
  titles_2.append(hit['_source']['Title'])
for hit in test_3:
  titles_3.append(hit['_source']['Title'])
for hit in test_4:
  titles_4.append(hit['_source']['Title'])
for hit in test_5:
  titles_5.append(hit['_source']['Title'])

# Then calculating precision and recall
true_pos = 0
for i in relevance_scores["Title"].loc[relevance_scores["Query_1"] == 1]:
  if i in titles_1:
    true_pos += 1
print("Query 1 precision is " + str(true_pos/len(titles_1)))
print("Query 1 recall is " + str(true_pos/len(relevance_scores["Title"].loc[relevance_scores["Query_1"] == 1])))
true_pos = 0
for i in relevance_scores["Title"].loc[relevance_scores["Query_2"] == 1]:
  if i in titles_2:
    true_pos += 1
print("Query 2 precision is " + str(true_pos/len(titles_2)))
print("Query 2 recall is " + str(true_pos/len(relevance_scores["Title"].loc[relevance_scores["Query_2"] == 1])))
true_pos = 0
for i in relevance_scores["Title"].loc[relevance_scores["Query_3"] == 1]:
  if i in titles_3:
    true_pos += 1
print("Query 3 precision is " + str(true_pos/len(titles_3)))
print("Query 3 recall is " + str(true_pos/len(relevance_scores["Title"].loc[relevance_scores["Query_3"] == 1])))
true_pos = 0
for i in relevance_scores["Title"].loc[relevance_scores["Query_4"] == 1]:
  if i in titles_4:
    true_pos += 1
print("Query 4 precision is " + str(true_pos/len(titles_4)))
print("Query 4 recall is " + str(true_pos/len(relevance_scores["Title"].loc[relevance_scores["Query_4"] == 1])))
true_pos = 0
for i in relevance_scores["Title"].loc[relevance_scores["Query_5"] == 1]:
  if i in titles_5:
    true_pos += 1
print("Query 5 precision is " + str(true_pos/len(titles_5)))
print("Query 5 recall is " + str(true_pos/len(relevance_scores["Title"].loc[relevance_scores["Query_5"] == 1])))


Query 1 precision is 0.3
Query 1 recall is 0.11538461538461539
Query 2 precision is 0.1
Query 2 recall is 0.058823529411764705
Query 3 precision is 0.1
Query 3 recall is 0.1111111111111111
Query 4 precision is 0.1
Query 4 recall is 0.022727272727272728
Query 5 precision is 0.2
Query 5 recall is 0.019230769230769232
