## Loading Elasticsearch

In this section Elasticsearch is installed on the colab machine and initialised. The scripts to be able to run it on Colab were provided by GitHub user [korakot](https://gist.github.com/korakot/15fe4f18d0e0f53d7b834ef797880500).

In [2]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q # it's an older version of Elasticsearch, latest release being 7.13.4, couldn't make it run with that
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.0.0
!pip install elasticsearch -q

In [3]:
import os
from subprocess import Popen, PIPE, STDOUT

This cell runs Elasticsearch in the hosted runtime machine.

In [4]:
es_server = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )


The next command is using [cURL](https://curl.se/), which is used as a REST API to communicate with the Elasticsearch client. In this case it is adapting te request in query DSL (Elasticsearch's proprietary language). It's just checking if Elasticsearch is running locally (in the google machine in this case) by querying its default port (`localhost:9200`)

In [26]:
!curl -X GET "localhost:9200/" 

{
  "name" : "4669f306e673",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "c4lEJg5jQYSaPpa-Tn9SGw",
  "version" : {
    "number" : "7.0.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "b7e28a7",
    "build_date" : "2019-04-05T22:55:32.697037Z",
    "build_snapshot" : false,
    "lucene_version" : "8.0.0",
    "minimum_wire_compatibility_version" : "6.7.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


In [27]:
from elasticsearch import Elasticsearch

In [28]:
es = Elasticsearch()
es.ping() # another way to test whether ES is running, returns True if so

True

Elasticsearch should be running fine. Before creating an index, it is worth it to preprocess the data.

## Preprocessing

In [29]:
import pandas as pd
import numpy as np
import nltk
import re
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Next cell loads the dataset. It is available on [Kaggle](https://www.kaggle.com/ashishgup/netflix-rotten-tomatoes-metacritic-imdb), it is currently being updated to the local runtime directly as using code to download it directly requires a personal Kaggle API  token.

In [30]:
films = pd.read_csv("netflix-rotten-tomatoes-metacritic-imdb.csv")
films.head()

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Hidden Gem Score,Country Availability,Runtime,Director,Writer,Actors,View Rating,IMDb Score,Rotten Tomatoes Score,Metacritic Score,Awards Received,Awards Nominated For,Boxoffice,Release Date,Netflix Release Date,Production House,Netflix Link,IMDb Link,Summary,IMDb Votes,Image,Poster,TMDb Trailer,Trailer Site
0,Lets Fight Ghost,"Crime, Drama, Fantasy, Horror, Romance","Comedy Programmes,Romantic TV Comedies,Horror ...","Swedish, Spanish",Series,4.3,Thailand,< 30 minutes,Tomas Alfredson,John Ajvide Lindqvist,"Kåre Hedebrant, Per Ragnar, Lina Leandersson, ...",R,7.9,98.0,82.0,74.0,57.0,"$2,122,065",12 Dec 2008,2021-03-04,"Canal+, Sandrew Metronome",https://www.netflix.com/watch/81415947,https://www.imdb.com/title/tt1139797,A med student with a supernatural gift tries t...,205926.0,https://occ-0-4708-64.1.nflxso.net/dnm/api/v6/...,https://m.media-amazon.com/images/M/MV5BOWM4NT...,,
1,HOW TO BUILD A GIRL,Comedy,"Dramas,Comedies,Films Based on Books,British",English,Movie,7.0,Canada,1-2 hour,Coky Giedroyc,Caitlin Moran,"Paddy Considine, Cleo, Beanie Feldstein, Dónal...",R,5.8,79.0,69.0,1.0,,"$70,632",08 May 2020,2021-03-04,"Film 4, Monumental Pictures, Lionsgate",https://www.netflix.com/watch/81041267,https://www.imdb.com/title/tt4193072,"When nerdy Johanna moves to London, things get...",2838.0,https://occ-0-1081-999.1.nflxso.net/dnm/api/v6...,https://m.media-amazon.com/images/M/MV5BZGUyN2...,https://www.youtube.com/watch?v=eIbcxPy4okQ,YouTube
2,Centigrade,"Drama, Thriller",Thrillers,English,Movie,6.4,Canada,1-2 hour,Brendan Walsh,"Brendan Walsh, Daley Nixon","Genesis Rodriguez, Vincent Piazza",Unrated,4.3,,46.0,,,"$16,263",28 Aug 2020,2021-03-04,,https://www.netflix.com/watch/81305978,https://www.imdb.com/title/tt8945942,"Trapped in a frozen car during a blizzard, a p...",1720.0,https://occ-0-1081-999.1.nflxso.net/dnm/api/v6...,https://m.media-amazon.com/images/M/MV5BODM2MD...,https://www.youtube.com/watch?v=0RvV7TNUlkQ,YouTube
3,ANNE+,Drama,"TV Dramas,Romantic TV Dramas,Dutch TV Shows",Turkish,Series,7.7,"Belgium,Netherlands",< 30 minutes,,,"Vahide Perçin, Gonca Vuslateri, Cansu Dere, Be...",,6.5,,,1.0,,,01 Oct 2016,2021-03-04,,https://www.netflix.com/watch/81336456,https://www.imdb.com/title/tt6132758,"Upon moving into a new place, a 20-something r...",1147.0,https://occ-0-1489-1490.1.nflxso.net/dnm/api/v...,https://m.media-amazon.com/images/M/MV5BNWRkMz...,,
4,Moxie,"Animation, Short, Drama","Social Issue Dramas,Teen Movies,Dramas,Comedie...",English,Movie,8.1,"Lithuania,Poland,France,Iceland,Italy,Spain,Gr...",1-2 hour,Stephen Irwin,,Ragga Gudrun,,6.3,,,,4.0,,22 Sep 2011,2021-03-04,,https://www.netflix.com/watch/81078393,https://www.imdb.com/title/tt2023611,Inspired by her moms rebellious past and a con...,63.0,https://occ-0-4039-1500.1.nflxso.net/dnm/api/v...,https://m.media-amazon.com/images/M/MV5BODYyNW...,,


The next cell contains the preprocessing functions.

In [31]:
def query_processor(query):
    query = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", query)
    query = re.sub(r"([.,;:!?'\"“\(])(\w)", r"\1 \2", query)
    query = re.sub("'", "", query)
    query = re.sub(",", "", query)
    tokens = re.split(r"\s+",query)
    tokens = [t.lower() for t in tokens]
    # Stemming with the SnowballStemmer for more efficiency
    s_stemmer = SnowballStemmer("english")
    stemedList = []
    for word in tokens:
        stemedList.append(s_stemmer.stem(word))
    tokens = stemedList
    # Lemmatising the tokens
    wordnet_lemmatiser = WordNetLemmatizer()
    lemmaList = []
    for word in stemedList:
        lemmaList.append(wordnet_lemmatiser.lemmatize(word))
    tokens = lemmaList
    # Finally remove stopwords
    stops = set(stopwords.words("english"))
    tokens = [word for word in tokens if not word in stops]
    return tokens

def dataset_processor(dataset):
    dataset = pd.read_csv("netflix-rotten-tomatoes-metacritic-imdb.csv")
    dataset = dataset.drop(columns = ["Trailer Site", "TMDb Trailer", "Poster", "Image", "IMDb Votes", "Netflix Link", "Production House", "Netflix Release Date", "Boxoffice", "Awards Received", "Awards Nominated For", "IMDb Link", "Runtime", "Country Availability", "View Rating"])
    dataset["Average Score"] = dataset.apply(Average, axis = 1)
    dataset = dataset.drop(columns = ["Hidden Gem Score", "IMDb Score", "Rotten Tomatoes Score", "Metacritic Score"])
    dataset = dataset.astype({"Title":"str", "Genre":"str", "Tags":"str",
                     "Languages":"str", "Series or Movie":"category",
                     "Director":"str", "Writer":"str", "Actors":"str",
                     "Summary":"str", "Average Score":"float"})
    # So need to query processor the summary, genre list and tag list.
    for i in ["Summary", "Genre", "Tags"]:
      dataset[i] = dataset.loc[:, i].apply(query_processor)

    # Need to turn the director, writer, languages, and actors into a list.
    for i in ["Director", "Languages", "Writer", "Actors"]:
      dataset[i] = dataset.loc[:,i].apply(Split)

    # Change the release data to just the year
    dataset["Release Date"] = dataset["Release Date"].apply(Year)
    return dataset

def Split(row):
  tokens = re.split(",", row)
  return tokens

def Year(row):
  tokens = re.split(" ", str(row))
  if tokens[0] == "nan":
    return "NaN"
  else:
    return tokens[2]

def Average(row):
  scores = []
  scorers = ["Hidden Gem Score", "IMDb Score", "Rotten Tomatoes Score", "Metacritic Score"]
  potential = [10,10,100,100]
  for i in range(len(scorers)):
    if (np.isnan(row.loc[scorers[i]])) == False:
      scores.append(row.loc[scorers[i]]/potential[i])
  if len(scores) > 0:
    return sum(scores)/len(scores)
  else:
    return "NaN"

In [32]:
# preprocessing the dataset
films = dataset_processor(films)
films.head()

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Director,Writer,Actors,Release Date,Summary,Average Score
0,Lets Fight Ghost,"[crime, drama, fantasi, horror, romanc]","[comedi, programm, romant, tv, comedi, horror,...","[Swedish, Spanish]",Series,[Tomas Alfredson],[John Ajvide Lindqvist],"[Kåre Hedebrant, Per Ragnar, Lina Leandersso...",2008,"[med, student, supernatur, gift, tri, cash, ab...",0.755
1,HOW TO BUILD A GIRL,[comedi],"[drama, comedi, film, base, book, british]",[English],Movie,[Coky Giedroyc],[Caitlin Moran],"[Paddy Considine, Cleo, Beanie Feldstein, D...",2020,"[nerdi, johanna, move, london, thing, get, han...",0.69
2,Centigrade,"[drama, thriller]",[thriller],[English],Movie,[Brendan Walsh],"[Brendan Walsh, Daley Nixon]","[Genesis Rodriguez, Vincent Piazza]",2020,"[trap, frozen, car, dure, blizzard, pregnant, ...",0.51
3,ANNE+,[drama],"[tv, drama, romant, tv, drama, dutch, tv, show]",[Turkish],Series,[nan],[nan],"[Vahide Perçin, Gonca Vuslateri, Cansu Dere,...",2016,"[upon, move, new, place, 20-someth, run, forme...",0.71
4,Moxie,"[anim, short, drama]","[social, issu, drama, teen, movi, drama, comed...",[English],Movie,[Stephen Irwin],[nan],[Ragga Gudrun],2011,"[inspir, mom, rebelli, past, confid, new, frie...",0.72


In [33]:
films[films['Average Score'].isna()]

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Director,Writer,Actors,Release Date,Summary,Average Score
208,The Strongest,[drama],"[sport, movi, drama, sport, drama, classic, mo...",[None],Movie,[Raoul Walsh],"[Georges Clemenceau, Raoul Walsh]","[Beatrice Noyes, Carlo Liten, Renée Adorée, ...",1920,"[romanc, add, risk, two, hunt, expedit, arctic...",
220,Firefly Lane,"[drama, romanc]","[tv, drama, romant, tv, drama, u, tv, show, tv...",[English],Series,[nan],[Maggie Friedman],"[Roan Curtis, Sarah Chalke, Katherine Heigl,...",2021,"[best, friend, tulli, kate, support, good, tim...",
246,Le Tournoi,[documentari],"[drama, french]",[None],Movie,[Charles Belot],[nan],[nan],,"[self-indulg, chess, champion, cruis, major, t...",
248,She is King,[music],"[african, film, drama, comedi, music, music, &...",[English],Movie,[Gersh Kgamedi],"[Nicola Rauch, Gersh Kgamedi]","[Mandisa Nduna, Mike Mvelase, Aubrey Poo, K...",2017,"[khanyisil, take, talent, joburg, land, role, ...",
249,How to Eliminate My Teacher,"[mysteri, thriller]","[drama, programm, japanes, tv, programm, tv, t...",[Japanese],Series,[nan],[nan],"[Kokoro Morita, Kazusa Okuyama, Marika Matsu...",2020,"[high-achiev, student, class, 3-d, make, game,...",
...,...,...,...,...,...,...,...,...,...,...,...
15471,DreamWorks Short Stories,[nan],"[tv, comedi, kid, tv, tv, programm, tv, cartoo...",[nan],Series,[nan],[nan],[nan],,"[dreamworkss, coolest, charact, star, collect,...",
15472,DreamWorks Shrek Stories,[nan],"[tv, comedi, kid, tv, tv, programm, anim, tale...",[nan],Series,[nan],[nan],[nan],,"[shrek, celebr, busi, christma, spooki, hallow...",
15474,Nijntje and Vriendjes,[nan],"[kid, tv, tv, programm, dutch, tv, show, tv, s...",[nan],Series,[nan],[nan],[nan],,"[dick, bruna, classic, child, stori, get, new,...",
15475,K-POP Extreme Survival,[nan],"[tv, drama, tv, programm, tv, comedi, romant, ...",[nan],Series,[nan],[nan],[nan],,"[seung, yeon, decid, chase, dream, becom, k-po...",


In [34]:
films.dropna(subset=['Average Score'], inplace=True)

films[films['Average Score'].isna()]

Unnamed: 0,Title,Genre,Tags,Languages,Series or Movie,Director,Writer,Actors,Release Date,Summary,Average Score


## Indexing the documents

In [35]:
film_list = films.values.tolist()

In [36]:
films.dtypes

Title                object
Genre                object
Tags                 object
Languages            object
Series or Movie    category
Director             object
Writer               object
Actors               object
Release Date         object
Summary              object
Average Score       float64
dtype: object

Creating the index in Elasticsearch ([API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html)).

In [41]:
index_body = {
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1,
        
    },
    'mappings': {
          'properties': {
              'Title': {'type': 'text'},
              'Genre': {'type': 'keyword'},
              'Tags' : {'type': 'keyword'},
              'Languages': {'type': 'keyword'},
              'Series or Movie' : {'type' : 'keyword'},
              'Director' : {'type': 'text'},
              'Writer' : {'type': 'text'},
              'Actors' : {'type': 'text'},
              'Release Date' : {'type': 'integer'},
              'Summary' : {'type': 'text'},
              'Average Score' : {'type': 'double'}  
          }
    }
}

In [43]:
index_name = 'netflix'

es.indices.delete(index='netflix', ignore=[400, 404]) # useful during development to reset the index
es.indices.create(index_name, body=index_body)

{'acknowledged': True, 'index': 'netflix', 'shards_acknowledged': True}

In [44]:
for title, genre, tags, languages, series_or_movie, director, writer, actors, release_date, summary, avg_score in film_list:
  list_body = {
      'Title': title,
      'Genre': genre,
      'Tags' : tags,
      'Languages': languages,
      'Series or Movie' : series_or_movie,
      'Director' : director,
      'Writer' : writer,
      'Actors' : actors,
      'Release Date' : release_date,
      'Summary' : summary,
      'Average Score' : avg_score
  }
  es.index(index_name, list_body)

In [45]:
es.cat.count(index='netflix',h=['count'])

'13394\n'

In [46]:
es.search(index="netflix", body={"query": {"match_all": {}}})

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'Behu33oBXeShiYZJz4_9',
    '_index': 'netflix',
    '_score': 1.0,
    '_source': {'Actors': ['Takumi Kitamura',
      ' Masahiro Higashide',
      ' Mackenyu',
      ' Aoi Morikawa'],
     'Average Score': 0.685,
     'Director': ['Eiichirô Hasumi'],
     'Genre': ['action'],
     'Languages': ['Japanese'],
     'Release Date': '2018',
     'Series or Movie': 'Movie',
     'Summary': ['risk-tak',
      'driver',
      'rivalri',
      'steadi',
      'mechan',
      'brother',
      'threaten',
      'team',
      'ultim',
      'goal',
      'win',
      'place',
      'world',
      'ralli',
      'championship',
      '.'],
     'Tags': ['action', '&', 'adventur', 'japanes', 'movi'],
     'Title': 'Over Drive',
     'Writer': ['nan']},
    '_type': '_doc'},
   {'_id': 'Buhu33oBXeShiYZJ0I8R',
    '_index': 'netflix',
    '_score': 1.0,
    '_source': {'Actors': ['Sophia Lillis',
      '

In [47]:
es.search(index="netflix", body={"query":{'term': {'Genre' : 'crime'}} })

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'GOhu33oBXeShiYZJ0I_n',
    '_index': 'netflix',
    '_score': 2.6023657,
    '_source': {'Actors': ['Damien Bonnard',
      ' Laure Calamy',
      ' Denis Ménochet',
      ' Nadia Tereszkiewicz'],
     'Average Score': 0.71,
     'Director': ['Dominik Moll'],
     'Genre': ['crime', 'drama', 'mysteri', 'thriller'],
     'Languages': ['French'],
     'Release Date': '2019',
     'Series or Movie': 'Movie',
     'Summary': ['woman',
      'mysteri',
      'disappear',
      'dure',
      'snowstorm',
      'franc',
      'connect',
      'live',
      'unravel',
      'secret',
      'five',
      'peopl',
      'two',
      'contin',
      '.'],
     'Tags': ['drama',
      'crime',
      'film',
      'crime',
      'drama',
      'film',
      'base',
      'book',
      'mysteri',
      'french',
      'film'],
     'Title': 'Only the Animals',
     'Writer': ['Gilles Marchand', ' Domini

In [48]:
es.search(index="netflix", body={"query":{'match': {'Director' : 'Tarantino'}} })

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'iOhu33oBXeShiYZJ1Y_j',
    '_index': 'netflix',
    '_score': 7.3616014,
    '_source': {'Actors': ['Tim Roth',
      ' Amanda Plummer',
      ' Laura Lovelace',
      ' John Travolta'],
     'Average Score': 0.8025,
     'Director': ['Quentin Tarantino'],
     'Genre': ['crime', 'drama'],
     'Languages': ['English', ' Spanish', ' French'],
     'Release Date': '1994',
     'Series or Movie': 'Movie',
     'Summary': ['shelter',
      'young',
      'woman',
      'becom',
      'enamor',
      'struggl',
      'writer',
      'goe',
      'great',
      'length',
      'becom',
      'involv',
      'creativ',
      'process',
      '.'],
     'Tags': ['psycholog',
      'thriller',
      'independ',
      'film',
      'thriller',
      'indonesian',
      'film'],
     'Title': 'Fiction.',
     'Writer': ['Quentin Tarantino', ' Roger Avary']},
    '_type': '_doc'},
   {'_id': 'sehv33o

In [49]:
query_body = {
    'query' :{
        'match' :{
            'Title' : 'Lets Fight Ghost'
        }
    }
}

explain=True

results = es.search(index=index_name, body=query_body, explain=explain)['hits']['hits']
for hit in results:
  print('Title: {} - Director: {} - score: {}'.format(hit['_source']['Title'], hit['_source']['Director'], hit['_score']))

Title: Lets Fight Ghost - Director: ['Tomas Alfredson'] - score: 20.152025
Title: Lets Dance - Director: ['Tomas Alfredson'] - score: 8.814203
Title: Lets Dance - Director: ['Ladislas Chollat'] - score: 8.814203
Title: Fist Fight - Director: ['Richie Keen'] - score: 7.9699717
Title: Ghost - Director: ['Jerry Zucker'] - score: 7.805466
Title: Lets Eat 2 - Director: ['nan'] - score: 7.6067967
Title: Lets Be Cops - Director: ['Luke Greenfield'] - score: 7.6067967
Title: Ultra Fight Victory - Director: ['nan'] - score: 6.8782115
Title: S.W.A.T.: Fire Fight - Director: ['Benny Boom'] - score: 6.8782115
Title: Ghost Buddies - Director: ['Simon Sek'] - score: 6.5665264


**Compound query Search:**

In [56]:
es.search(index="netflix", body={"query":{'bool':{'must':[{'match': {'Director' : 'Tarantino'}}, {'match': {'Genre': 'crime'}}] }}})

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'iOhu33oBXeShiYZJ1Y_j',
    '_index': 'netflix',
    '_score': 9.963967,
    '_source': {'Actors': ['Tim Roth',
      ' Amanda Plummer',
      ' Laura Lovelace',
      ' John Travolta'],
     'Average Score': 0.8025,
     'Director': ['Quentin Tarantino'],
     'Genre': ['crime', 'drama'],
     'Languages': ['English', ' Spanish', ' French'],
     'Release Date': '1994',
     'Series or Movie': 'Movie',
     'Summary': ['shelter',
      'young',
      'woman',
      'becom',
      'enamor',
      'struggl',
      'writer',
      'goe',
      'great',
      'length',
      'becom',
      'involv',
      'creativ',
      'process',
      '.'],
     'Tags': ['psycholog',
      'thriller',
      'independ',
      'film',
      'thriller',
      'indonesian',
      'film'],
     'Title': 'Fiction.',
     'Writer': ['Quentin Tarantino', ' Roger Avary']},
    '_type': '_doc'},
   {'_id': 'EOhv33oB