In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize

pd.set_option('display.max_colwidth', 50)
movies = pd.read_csv("data/movies_processed.csv")
movies.fillna(' ', inplace=True)

In [2]:
len(movies)

4807

In [3]:
movies.head(2)

Unnamed: 0,genres,keywords,overview,popularity,release_date,tagline,title,vote_average,vote_count,cast,director,all_keywords
0,action adventure fantasy science fiction,culture clash future space war space colony so...,in the 22nd century a paraplegic marine is dis...,150.437577,10-12-2009,enter the world of pandora,Avatar,7.2,11800,sam_worthington zoe_saldana sigourney_weaver s...,james_cameron,cultur clash futur space war space coloni soci...
1,adventure fantasy action,ocean drug abuse exotic island east india trad...,captain barbossa long believed to be dead has ...,139.082615,19-05-2007,at the end of the world the adventure begins,Pirates of the Caribbean: At World's End,6.9,4500,johnny_depp orlando_bloom keira_knightley stel...,gore_verbinski,ocean drug abus exot island east india trade c...


## Section 0: General Information.
Below we will present the functions and functionality of the recommendation system.

- For implementation simplicity and code readability we assume the inputs to the functions are correct and perform no checks if invalid function arguments are passed.
- As per instructions this code focuses on functionality and not visual appeal.
- When using the cosine similarity metric we assume that the user input is processed and converted into a value that exists in the dataframe (ex. `Pirates of the Caribbean worlds end` will be processed by external code into the orthigraphically correct title `Pirates of the Caribbean: At World's End`).

In [4]:
movies[movies.title == "Pirates of the Caribbean: At World's End"]

Unnamed: 0,genres,keywords,overview,popularity,release_date,tagline,title,vote_average,vote_count,cast,director,all_keywords
1,adventure fantasy action,ocean drug abuse exotic island east india trad...,captain barbossa long believed to be dead has ...,139.082615,19-05-2007,at the end of the world the adventure begins,Pirates of the Caribbean: At World's End,6.9,4500,johnny_depp orlando_bloom keira_knightley stel...,gore_verbinski,ocean drug abus exot island east india trade c...


## Section 1: Recommend by numeric column

### Recommend movies with the highest rating

The function that will be used for this section is `recommend_by_column()`:
```
Arguments:
    df: The movies dataframe
    by: Which column to sort by (popularity or vote_average since they are the only numeric columns)
    n_values: How many of the top movies to return. If -1 then return every result.
    ascending: Default option is to rank from highest to lowest (False)
```

In [5]:
def recommend_by_column(df, by, n_values=5, ascending=False):
    df_sorted = df.sort_values(by=by, ascending=ascending)
    df_sorted = df_sorted[["title", by]]
    if n_values == -1:
        return df_sorted
    return df_sorted.head(n_values)

#### Recommend the 10 movies with the highest ratings in descending order

In [6]:
recommend_by_column(df=movies, by="vote_average", n_values=10, ascending=False)

Unnamed: 0,title,vote_average
4253,Me You and Five Bucks,10.0
3524,Stiff Upper Lips,10.0
4051,"Dancer, Texas Pop. 81",10.0
4666,Little Big Top,10.0
3998,Sardaarji,9.5
2392,One Man's Hero,9.3
1887,The Shawshank Redemption,8.5
2975,There Goes My Baby,8.5
3342,The Godfather,8.4
2802,The Prisoner of Zenda,8.4


### Get ALL movies ranked by popularity in descending order
Sometimes movies with a perfect `vote_average` score are movies with very few ratings. Therefore a better way to rank movies is by popularity. For this reason we will be using only `popularity` as a sorting metric from now on.

In [7]:
recommend_by_column(df=movies, by="popularity", n_values=-1, ascending=False)

Unnamed: 0,title,popularity
546,Minions,875.581305
95,Interstellar,724.247784
788,Deadpool,514.569956
94,Guardians of the Galaxy,481.098624
127,Mad Max: Fury Road,434.278564
...,...,...
4514,Love Letters,0.001586
4629,Midnight Cabaret,0.001389
4124,Hum To Mohabbat Karega,0.001186
4731,Penitentiary,0.001117


## Section 2: Recommend by people

### Part 1
The function that will be used for this section is `recommend_by_director()`:
```
Arguments:
    df: The movies dataframe
    n_values: How many movies to return. -1 to return every movie
    director: The name of the director
    sort_by: Sort movies by popularity or vote_average
```

In [8]:
def recommend_by_director(df, n_values=5, director=None, by=None, trim=True):
    movies_from_director = df[df["director"] == director]
    
    if by is not None:
        movies_from_director = recommend_by_column(df=movies_from_director, by=by, n_values=n_values, ascending=False)
        if trim:
            movies_from_director = movies_from_director[["title", by]]
        if n_values == -1:
            return movies_from_director
        return movies_from_director.head(n_values)
    
    if by is None:
        if trim:
            movies_from_director = movies_from_director[["title"]]
        if n_values == -1:
            return movies_from_director
        return movies_from_director.head(n_values)

#### Get 3 movies with the highest popularity directed by `gore verbinski`

In [9]:
recommend_by_director(df=movies, n_values=3, director="gore_verbinski", by="popularity")

Unnamed: 0,title,popularity
199,Pirates of the Caribbean: The Curse of the Bla...,271.972889
12,Pirates of the Caribbean: Dead Man's Chest,145.847379
1,Pirates of the Caribbean: At World's End,139.082615


#### Get all movies directed by `james cameron`, without sorting

In [10]:
recommend_by_director(df=movies, n_values=-1, director="james_cameron")

Unnamed: 0,title
0,Avatar
25,Titanic
279,Terminator 2: Judgment Day
282,True Lies
587,The Abyss
2409,Aliens
3444,The Terminator


### Part 2

Another function in the "people" section is `recommend_by_cast()`:
```
Arguments:
    df: The movies dataframe.
    n_values: How many movies to return. -1 to return every movie.
    cast: A list of strings with the actor names.
    by: Sort the filtered dataframe by popularity.
    ascending: Sort in ascending order or not.
    strict: True if the recommended movies should contain ALL actors listed in cast. False by default
```

In [11]:
def recommend_by_cast(df=None, n_values=-1, cast=None, by=None, ascending=False, strict=False, trim=True):
    if strict:
        df_filtered = df[df['cast'].apply(lambda x: all(actor in x for actor in cast))]
    else:
        df_filtered = df[df['cast'].apply(lambda x: any(actor in x for actor in cast))]

    if by is not None:
        df_filtered = df_filtered.sort_values(by=by, ascending=ascending)
    
    if trim:
        df_filtered = df_filtered[["title", "popularity"]]
    if n_values == -1:
        return df_filtered
    return df_filtered.head(n_values)

#### Get all movies with `chris_pratt` sorted by `popularity` in descending order.

We will select actors like `Chris Pratt` and `Johnny Depp` who have appeared in popular franchises to assert the correct functionality of the output.

In [12]:
cast=["chris_pratt"]
recommend_by_cast(df=movies, n_values=-1, cast=cast, by="popularity", ascending=False, strict=False)

Unnamed: 0,title,popularity
94,Guardians of the Galaxy,481.098624
28,Jurassic World,418.708552
512,Wanted,73.82289
744,The Lego Movie,59.547928
2003,Her,53.682367
928,Moneyball,46.180421
884,Zero Dark Thirty,38.306954
3058,Movie 43,35.350303
1582,Bride Wars,33.385429
2483,Jennifer's Body,32.257414


#### Get 5 movies with `chris_pratt` and `johnny_depp` sorted by `popularity` in descending order.

In [13]:
cast=["chris_pratt", "johnny_depp"]
recommend_by_cast(df=movies, n_values=5, cast=cast, by="popularity", ascending=False, strict=False)

Unnamed: 0,title,popularity
94,Guardians of the Galaxy,481.098624
28,Jurassic World,418.708552
199,Pirates of the Caribbean: The Curse of the Bla...,271.972889
12,Pirates of the Caribbean: Dead Man's Chest,145.847379
1,Pirates of the Caribbean: At World's End,139.082615


#### Get all movies where `chris_pratt` and `johnny_depp` have costarred together (`strict=True`). This should result in an empty dataframe since these actors have not co-stared anywhere.

In [14]:
cast=["chris_pratt", "johnny_depp"]
recommend_by_cast(df=movies, n_values=5, cast=cast, by="popularity", ascending=False, strict=True)

Unnamed: 0,title,popularity


#### Get all movies with `jennifer_lawrence` **and** `liam_hemsworth` **and** `julianne_moore` (`strict=True`) ranked by popularity.

We will select actors who played in `The Hunger Games` just to test that the output is correct.

In [15]:
cast=["jennifer_lawrence", "liam_hemsworth", "julianne_moore"]
recommend_by_cast(df=movies, n_values=-1, cast=cast, by="popularity", ascending=False, strict=True)

Unnamed: 0,title,popularity
200,The Hunger Games: Mockingjay - Part 1,206.227151
102,The Hunger Games: Mockingjay - Part 2,127.284427


#### Get all movies with `jennifer_lawrence` **or** `liam_hemsworth` **or** `julianne_moore` (`strict=False`) ranked by popularity.

In [16]:
cast=["jennifer_lawrence", "liam_hemsworth", "julianne_moore"]
recommend_by_cast(df=movies, n_values=-1, cast=cast, by="popularity", ascending=False, strict=False)

Unnamed: 0,title,popularity
200,The Hunger Games: Mockingjay - Part 1,206.227151
64,X-Men: Apocalypse,139.272042
102,The Hunger Games: Mockingjay - Part 2,127.284427
46,X-Men: Days of Future Past,118.078691
930,Non-Stop,83.295796
183,The Hunger Games: Catching Fire,76.310119
426,The Hunger Games,68.550698
1619,Carrie,63.848541
331,Seventh Son,63.628459
2077,Silver Linings Playbook,63.599973


## Section 3: Get movie recommendations by genres.

Note that the valid genres in this dataset are:
```action, adventure, animation, comedy, crime, documentary, crame, family, fantasy, foreign, history, horror, music, mystery, romance, science fiction, thriller, tv movie, war, western```

In [17]:
def recommend_by_genre(df=None, n_values=-1, genres=None, by=None, ascending=False, strict=False, trim=True):
    if strict:
        df_filtered = df[df['genres'].apply(lambda x: all(genre in x for genre in genres))]
    else:
        df_filtered = df[df['genres'].apply(lambda x: any(genre in x for genre in genres))]

    if by is not None:
        df_filtered = df_filtered.sort_values(by=by, ascending=ascending)

    if trim:
        df_filtered = df_filtered[["title", "popularity", "genres"]]

    if n_values == -1:
        return df_filtered
    return df_filtered.head(n_values)

#### Get the first 5 movies with the genres `action` and `war` without `strict` selection (meaning that a movie doesn't need to fit into all categories of `genres` we pass as argument), ranked by popularity. We see that the results make sense.

In [18]:
genres = ["action", "war"]
recommend_by_genre(df=movies, n_values=5, genres=genres, by="popularity", ascending=False, strict=False)

Unnamed: 0,title,popularity,genres
788,Deadpool,514.569956,action adventure comedy
94,Guardians of the Galaxy,481.098624,action science fiction adventure
127,Mad Max: Fury Road,434.278564,action adventure science fiction thriller
28,Jurassic World,418.708552,action adventure science fiction thriller
199,Pirates of the Caribbean: The Curse of the Bla...,271.972889,adventure fantasy action


#### Get the first 10 movies that belong to both categories (`strict=True`) sorted by popularity.

In [19]:
genres = ["action", "war"]
recommend_by_genre(df=movies, n_values=5, genres=genres, by="popularity", ascending=False, strict=True)

Unnamed: 0,title,popularity,genres
456,Fury,139.575085,war drama action
790,American Sniper,87.53437,war action
571,Inglourious Basterds,72.595961,drama action thriller war
253,300: Rise of an Empire,71.510596,action war
687,300,65.197968,action adventure war


## Section 3: Recommend by release date

It would be reasonable to assume that some users would like to sort movies by date. Therefore in this section we will use the function `recommend_by_date()`:

```
Arguments:
    df: The movies dataframe.
    n_values: How many movies to return. -1 to return every movie.
    date: String containing the date in format "dd-mm-yyyy".
    filter_type: Acceptable values: 'before', 'on', 'after', 'between'. The 'between' option is only viable if both 'start' and 'end' are provided.
    by: Sort the filtered dataframe by popularity.
    ascending: Sort in ascending order or not.
```

In [20]:
def recommend_by_date(df=None, n_values=-1, start=None, end=None, filter_type=None, by=None, ascending=False, trim=True):
    df['release_date'] = pd.to_datetime(df['release_date'], format='%d-%m-%Y')
    start = pd.to_datetime(start, format='%d-%m-%Y')

    if filter_type == 'before':
        df = df[df['release_date'] < start]
    elif filter_type == 'after':
        df = df[df['release_date'] > start]
    elif filter_type == 'on':
        df = df[df['release_date'] == start]
    elif filter_type == "between":
        if end is None:
            print("Must provide an ending date.")
            return
        start = pd.to_datetime(start, format='%d-%m-%Y')
        end = pd.to_datetime(end, format='%d-%m-%Y')
        df = df[df['release_date'] >= start]
        df = df[df['release_date'] <= end]
    
    if by is not None:
        df = df.sort_values(by=by, ascending=ascending)
    
    if trim:
        df = df[["title", "release_date", "popularity"]]

    if n_values == -1:
        return df
    return df.head(n_values)

#### Get 10 movies released before 2009 ranked by popularity in descending order

In [21]:
recommend_by_date(df=movies, n_values=10, start='01-01-2009', end=None, filter_type='before', by="popularity", ascending=False)

Unnamed: 0,title,release_date,popularity
199,Pirates of the Caribbean: The Curse of the Bla...,2003-07-09,271.972889
65,The Dark Knight,2008-07-16,187.322927
662,Fight Club,1999-10-15,146.757391
12,Pirates of the Caribbean: Dead Man's Chest,2006-06-20,145.847379
3342,The Godfather,1972-03-14,143.659698
1,Pirates of the Caribbean: At World's End,2007-05-19,139.082615
809,Forrest Gump,1994-07-06,138.133331
262,The Lord of the Rings: The Fellowship of the Ring,2001-12-18,138.049577
1887,The Shawshank Redemption,1994-09-23,136.747729
276,Harry Potter and the Chamber of Secrets,2002-11-13,132.397737


#### Get 10 movies released between 2009 and 2013 ranked by popularity in descending order

In [22]:
recommend_by_date(df=movies, n_values=10, start="01-01-2009", end="31-12-2013", filter_type='between', by="popularity", ascending=False)

Unnamed: 0,title,release_date,popularity
96,Inception,2010-07-14,167.58371
124,Frozen,2013-11-27,165.125366
0,Avatar,2009-12-10,150.437577
16,The Avengers,2012-04-25,144.448633
335,Rise of the Planet of the Apes,2011-08-03,138.433168
506,Despicable Me 2,2013-06-25,136.886704
17,Pirates of the Caribbean: On Stranger Tides,2011-05-14,135.413856
55,Brave,2012-06-21,125.114374
614,Despicable Me,2010-07-08,113.858273
3,The Dark Knight Rises,2012-07-16,112.31295


## Section 4: Cosine Similarity

It is reasonable to require more sophisticated methods to calculate the similarity between movies. Using individual columns to make recommendations is useful but only for simple input queries and not for advanced predictions.

Cosine similarity between two non zero vectors in an inner product space is a measure of similarity of the two vectors. It is calculated as the cosine of the angle between the two vectors. Mathematically, given two n-dimentional vectors of attributes A, B the cosine similarity $cos(\theta)$ is represented using the dot product and the magnitude as:

$$
cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||}
$$

First convert the tokenized keywords which contains all the information from the string columns of the dataframe (see 01_preprocess.ipynb) using CountVectorizer.

In [23]:
cv = CountVectorizer(stop_words="english")
cv.fit(movies["keywords"])
for word in cv.vocabulary_:
    print(word)

culture
clash
future
space
war
colony
society
travel
futuristic
romance
alien
tribe
planet
cgi
marine
soldier
battle
love
affair
anti
power
relations
mind
soul
3d
ocean
drug
abuse
exotic
island
east
india
trading
company
life
traitor
shipwreck
strong
woman
ship
alliance
calypso
afterlife
fighter
pirate
swashbuckler
aftercreditsstinger
spy
based
novel
secret
agent
sequel
mi6
british
service
united
kingdom
dc
comics
crime
terrorist
identity
burglar
hostage
drama
time
bomb
gotham
city
vigilante
cover
superhero
villainess
tragic
hero
terrorism
destruction
catwoman
cat
imax
flood
criminal
underworld
batman
mars
medallion
princess
steampunk
martian
escape
edgar
rice
burroughs
race
superhuman
strength
civilization
sword
19th
century
dual
amnesia
sandstorm
forgiveness
spider
wretch
death
friend
egomania
sand
narcism
hostility
marvel
comic
revenge
magic
horse
fairy
tale
musical
animation
tower
blonde
selfishness
healing
duringcreditsstinger
gift
animal
sidekick
book
vision
team
cinematic
univer

In [24]:
for item in cv.get_feature_names_out():
    print(item)

11
15th
16th
17th
18th
1910s
1917
1920s
1930s
1940s
1950s
1960s
1970s
1980s
1990s
1992
1995
19th
2000
2001
2002
2079
20th
21st
25th
2nd
3d
51
60s
66
95
abandoned
abandonment
abduction
ability
abolitionist
aboriginal
aborigine
abortion
abraham
abroad
absorbing
absurdism
abuse
abusive
academy
acapella
accent
acceptance
accepting
accident
accidental
accountant
accusal
accusations
accused
acid
acting
action
activism
activist
activity
actor
actors
actress
ad
adaptation
addict
addicted
addiction
address
adhd
admiration
admissions
adolescence
adolescent
adolf
adopted
adoption
adoptive
adrenalin
adrenaline
adult
adultery
adultress
advancement
adventure
adventurer
adversary
advertising
advice
adviser
advisor
aerial
aerialist
aerobics
affair
affairs
affection
afghanistan
africa
african
aftercreditsstinger
afterlife
age
aged
agency
agent
ager
ages
aggression
aggressive
aging
agnostic
agreement
agriculture
aid
aids
ailul
air
aircraft
aires
airline
airplane
airport
airship
al
alabama
alamo
alan
ala

In [25]:
keywords_vectorized = cv.transform(movies["keywords"])

In [26]:
keywords_vectorized

<4807x7069 sparse matrix of type '<class 'numpy.int64'>'
	with 48916 stored elements in Compressed Sparse Row format>

Then use the `cosine_similarity()` function from sklearn and find the similarity between every movie. This produces a matrix of size 4807x4807 which is a symmetrical matrix. Each cell contains the similarity of a movie with every other movie.

In [27]:
cos_similarity_distances = cosine_similarity(keywords_vectorized)
print(f"Cosine Similarity Matrix occupies {cos_similarity_distances.nbytes / 2**20:.2f} MB of memory.")
cos_similarity_distances.shape

Cosine Similarity Matrix occupies 176.29 MB of memory.


(4807, 4807)

### Recommend the similar movies of a movie title. The function that will be used here is the `recommend_similar_movies()`:
```
Arguments:
    df: The movies dataframe.
    title: The title of a movie.
    similarity_matrix: The matrix returned by cosine_similarity().
    n_values: How many movies to return. No option for -1 here since this would simply return every movie in the dataframe sorted by descending similarity.
    by: Sort the movies with the highest similarity rating by popularity.
    ascending: Sort in ascending order or not.
```

In [28]:
def recommend_similar_movies(df=None, title=None, similarity_matrix=None, n_values=5, by=None, ascending=False, trim=True):
    movie_index = df[df.title == title].index.item()
    similarity = similarity_matrix[movie_index]
    similarity = sorted(list(enumerate(similarity)), reverse=True)
    similarity.sort(key=lambda x: x[1], reverse=True)

    indices = list()
    if n_values == -1:
        for i in range(1, len(similarity_matrix)):  # Start iterating from index=1 since the first index will be the movie itself after the similarity is sorted.
            indices.append(similarity[i][0])
    else:
        for i in range(1, n_values+1):  # Start iterating from index=1 since the first index will be the movie itself after the similarity is sorted.
            indices.append(similarity[i][0])

    similar_movies = df.iloc[indices]
    if by is not None:
        similar_movies.sort_values(by=by, ascending=ascending)
    
    if trim:
        similar_movies = similar_movies[["title"]]
    
    return similar_movies

#### Get the 10 movies with the biggest similarity to `Pirates of the Caribbean: At World's End`.
It is reasonable that the top recommended movies are the movies from the same franchise. Additionally our cosine similarity metric has captured the relationship of movies revolving around the "pirate" theme since it also recommends `Cuttthroat Island`. Since the Pirates of the Caribbean movies contain keywords related to romance we also get results about romance movies as well.

In [29]:
title = "Pirates of the Caribbean: At World's End"
recommend_similar_movies(df=movies, title=title, similarity_matrix=cos_similarity_distances, n_values=5)

Unnamed: 0,title
12,Pirates of the Caribbean: Dead Man's Chest
199,Pirates of the Caribbean: The Curse of the Bla...
340,Cutthroat Island
3175,Down to You
3260,Half Baked


#### Get 15 recommendations on `Iron Man`.
The recommendations seem reasonable since every movie except `X-Men` is related to the Marvel Universe. Additionally since `Iron Man` is related to the superhero theme it is reasonable to get almost exclusively movies related to this theme.

In [30]:
title = "Iron Man"
recommend_similar_movies(df=movies, title=title, similarity_matrix=cos_similarity_distances, n_values=15)

Unnamed: 0,title
79,Iron Man 2
182,Ant-Man
31,Iron Man 3
16,The Avengers
126,Thor: The Dark World
7,Avengers: Age of Ultron
85,Captain America: The Winter Soldier
26,Captain America: Civil War
511,X-Men
203,X2


## Section 5: Combine all the above techniques into a final function

The recommendation system functionality is implemented with the `recommend()` function:
```
Arguments:
    df: The movies dataframe.
    title: The title of a movie.
    director: A list containing the director(s) of the movies we want to find.
    cast: Similar with director but with the name(s) of the cast members.
    genres: List of genres to include.
    ascending: Sort in ascending order or not.
    n_values: Number of movies to return. -1 to get every result.
    strict: The cast and genres must all be included in the resulting movies.
```
Explanation:
- `similar_movies` is initialized to df (the `movies` dataframe)
- If a `title` is provided find the movie with that title. If the title is not orthographically correct then `recommend_similar_movies()` will raise an Error. Handle that error by initializing `similar_movies` to a dataframe, the movies of which contain at least one of the tokens of the title given.
- If a `director` list is provided then filter the movies with that director
- If a `cast` list is provided then filter the movies with that cast depending on the `strict` value.
- if a `genres` list is provided then filter the movies with `genres` in the `genres` list
- Return the results. For simplicity this function sorts the results by popularity by default.

In [31]:
def recommend(df=None, title=None, director=None, cast=None, genres=None, ascending=False, n_values=10, strict=False):
    similar_movies = df
    if title:
        try:
            similar_movies = recommend_similar_movies(df=similar_movies, title=title, similarity_matrix=cos_similarity_distances, n_values=n_values, by=None, ascending=ascending, trim=False)
        except:
            print("Not an orthographically correct title. Selecting movies by title similarity...")
            tokens = word_tokenize(title.lower())
            similar_movies = similar_movies[df['title'].str.contains('|'.join(tokens), case=False)]

    if director:
        similar_movies = recommend_by_director(df=similar_movies, by=None, n_values=n_values, director=director, trim=False)
    if cast:
        similar_movies = recommend_by_cast(df=similar_movies, n_values=n_values, cast=cast, by=None, ascending=ascending, strict=strict, trim=False)
    if genres:
        similar_movies = recommend_by_genre(df=similar_movies, n_values=n_values, genres=genres, by=None, ascending=ascending, strict=strict, trim=False)
    
    similar_movies = similar_movies.sort_values(by="popularity", ascending=False)
    return similar_movies[["title", "genres"]].head(n_values)

### Example 1: Trying to find the `Iron Man` movie *without remembering* the title

Assume we do know some relevant information about `Iron Man` but we do not remember the title itself. Let's try to put the recommendation system to use. Assume the user knows the name of `Robert Downey`. This query would normally be processed and be entered in the `cast` list tokenized, with the punctuation removed and the letters casted to lowercase. Assume we also remember that `Gwineth Paltrow` and `Samuel Jackson` are also part of the cast. So let's include all these names in `cast`.

Additionally, it would be reasonable to assume that the movie belongs to the `science fiction` and `action` genres. So let's include them to `genres`.

Finally we will use the recommend function with the movies dataframe, without providing a title, or a director and with the `cast` and `genres` we initialized. We get the desired results for the `Avengers`, `Iron Man` and `Iron Man 2` movies. The `recommend()` function is accurate.

In [32]:
cast=["robert", "downey", "paltrow", "samuel", "jackson"]
genres = ["action", "science", "fiction"]
recommend(df=movies, title=None, director=None, cast=cast, genres=genres, ascending=False, n_values=10, strict=True)

Unnamed: 0,title,genres
16,The Avengers,science fiction action adventure
68,Iron Man,action science fiction adventure
79,Iron Man 2,adventure action science fiction


### Example 2: Trying to find movies by keyword included in title.

It is reasonable to assume that the user will rarely provide a perfectly written title with correct punctuation. Therefore we will implement functionality to perform the recommendation procedure even without it. In this example if the user provides the keyword `pirate` the recommendation system returns all movies with the same keyword in their titles. We see that the results are quite accurate. The user was most likely looking for a `Pirates of the Caribbean` movie.

In [33]:
recommend(df=movies, title="pirate", director=None, cast=None, genres=None, ascending=False, n_values=5, strict=False)

Not an orthographically correct title. Selecting movies by title similarity...


Unnamed: 0,title,genres
199,Pirates of the Caribbean: The Curse of the Bla...,adventure fantasy action
12,Pirates of the Caribbean: Dead Man's Chest,adventure fantasy action
1,Pirates of the Caribbean: At World's End,adventure fantasy action
17,Pirates of the Caribbean: On Stranger Tides,adventure action fantasy
848,The Pirates! In an Adventure with Scientists!,animation adventure family comedy


We can perform this experiment with more keywords...

In [34]:
recommend(df=movies, title="superman", director=None, cast=None, genres=None, ascending=False, n_values=5, strict=False)

Not an orthographically correct title. Selecting movies by title similarity...


Unnamed: 0,title,genres
9,Batman v Superman: Dawn of Justice,action adventure fantasy
10,Superman Returns,adventure fantasy action science fiction
813,Superman,action adventure fantasy science fiction
870,Superman II,action adventure fantasy science fiction
1299,Superman III,comedy action adventure fantasy science fiction


In [35]:
recommend(df=movies, title="batman", director=None, cast=None, genres=None, ascending=False, n_values=5, strict=False)

Not an orthographically correct title. Selecting movies by title similarity...


Unnamed: 0,title,genres
9,Batman v Superman: Dawn of Justice,action adventure fantasy
119,Batman Begins,action crime drama
428,Batman Returns,action fantasy
210,Batman & Robin,action crime fantasy
299,Batman Forever,action crime fantasy


The system is also quite robust since it handles title inputs that are incomplete.

In [36]:
recommend(df=movies, title="ali", director=None, cast=None, genres=None, ascending=False, n_values=5, strict=False)

Not an orthographically correct title. Selecting movies by title similarity...


Unnamed: 0,title,genres
3163,Alien,horror action thriller science fiction
1605,The Age of Adaline,fantasy drama romance
821,The Equalizer,thriller action crime
32,Alice in Wonderland,family fantasy adventure
2409,Aliens,horror action thriller science fiction


In [37]:
recommend(df=movies, title="jas bourne", director=None, cast=None, genres=None, ascending=False, n_values=5, strict=False)

Not an orthographically correct title. Selecting movies by title similarity...


Unnamed: 0,title,genres
209,The Bourne Legacy,action thriller
694,The Bourne Identity,action drama mystery thriller
218,Jason Bourne,action thriller
386,The Bourne Supremacy,action drama thriller
180,The Bourne Ultimatum,action drama mystery thriller
