# Import libraries and load data

In [1]:
import pandas as pd
import numpy as np
from search_movie_data import *

In [2]:
movies = pd.read_csv("movies_preprocessed.csv")
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9556 entries, 0 to 9555
Data columns (total 30 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movieId             9556 non-null   int64  
 1   title               9556 non-null   object 
 2   genres              9556 non-null   object 
 3   Adventure           9556 non-null   int64  
 4   Horror              9556 non-null   int64  
 5   Thriller            9556 non-null   int64  
 6   Crime               9556 non-null   int64  
 7   Musical             9556 non-null   int64  
 8   Documentary         9556 non-null   int64  
 9   Mystery             9556 non-null   int64  
 10  Children            9556 non-null   int64  
 11  Sci-Fi              9556 non-null   int64  
 12  Western             9556 non-null   int64  
 13  Fantasy             9556 non-null   int64  
 14  Film-Noir           9556 non-null   int64  
 15  Animation           9556 non-null   int64  
 16  Action

The only column with any missing values is relevant_tag_soup, which contains the most relevant tags according to the genome tag data. So, if I want to search by similarity between relevant_tag_soup values, I'll have to remove the NA values.

To look at the relevance scores that led to the relevant_tag_soup, I can look at the data saved in `tag_genome_preprocessed.csv`.

In [3]:
genome_tags = pd.read_csv('tag_genome_preprocessed.csv')
genome_tags.head()

Unnamed: 0,movieId,title,title_clean,year,007,007 (series),18th century,1920s,1930s,1950s,...,world war i,world war ii,writer's life,writers,writing,wuxia,wwii,zombie,zombies,relevant_tag_soup
0,1,Toy Story,toy story,1995,0.029,0.02375,0.05425,0.06875,0.16,0.19525,...,0.0225,0.04075,0.03175,0.1295,0.0455,0.02,0.0385,0.09125,0.02225,toys computeranimation pixaranimation animatio...
1,2,Jumanji,jumanji,1995,0.03625,0.03625,0.08275,0.08175,0.102,0.069,...,0.0205,0.0165,0.0245,0.1305,0.027,0.01825,0.01225,0.09925,0.0185,adventure children fantasy kids jungle special...
2,3,Grumpier Old Men,grumpier old men,1995,0.0415,0.0495,0.03,0.09525,0.04525,0.05925,...,0.02375,0.0355,0.02125,0.12775,0.0325,0.01625,0.02125,0.09525,0.0175,sequel goodsequel sequels comedy original
3,4,Waiting to Exhale,waiting to exhale,1995,0.0335,0.03675,0.04275,0.02625,0.0525,0.03025,...,0.03275,0.02125,0.03675,0.15925,0.05225,0.015,0.016,0.09175,0.015,women chickflick girliemovie romantic
4,5,Father of the Bride Part II,father of the bride part ii,1995,0.0405,0.05175,0.036,0.04625,0.055,0.08,...,0.02625,0.0205,0.02125,0.17725,0.0205,0.015,0.0155,0.08875,0.01575,goodsequel sequel sequels pregnancy fatherdaug...


The relevant_tag_soup contains the tags that have a relevance score of at least 0.75.

# Search for movie by title

`search_title` is an important function for the later functions that find similar movies to a given title.

Before searching for matches, title is converted to lowercase and all punctuation is removed. This allows for more fool-proof matching before resorting to fuzzy string matching (see below).

If there is one perfect match, return that movie's movieId.

In [4]:
search_title(movies, 'sixteen candles')

2144

If there are multiple perfect matches, ask user to select one and then return movieId.

In [5]:
search_title(movies, 'father of the bride')

More than one movie with matching title

1: Father of the Bride (1950)
2: Father of the Bride (1991)

Please select movie number, or enter -1 to exit: 2


6944

If there are no perfect matches (e.g., typos, alternate spellings), perform fuzzy string matching by computing Levenshtein difference ratio between input title and all titles in dataframe, returning closest matches.

In [6]:
search_title(movies, 'findingnemo')

Cannot find that title, looking for similar titles...

One possible match found. Is this the title you want?

Finding Nemo (2003)
Enter y or n: y


6377

In [7]:
search_title(movies, 'fast furious')

Cannot find that title, looking for similar titles...

More than one potential match found

1: Fast and the Furious, The (2001)
2: The Fate of the Furious (2017)

Please select movie number, or enter -1 to exit: 1


4369

# Get most relevant tags for each movie

`get_most_relevant_tags` returns the tags with the highest relevance score for a given movie.
- Returns a series where the index is the name of the tag and the value is the relevance score (0-1)
- Default number of tags returned is 10, but `num_tags` can be passed as an argument

In [8]:
get_most_relevant_tags(genome_tags, 'finding nemo', num_tags=15)

oscar (best animated feature)    0.99825
pixar animation                  0.99550
computer animation               0.99350
short-term memory loss           0.99225
animation                        0.99050
children                         0.98075
kids and family                  0.97850
kids                             0.97800
fish                             0.97625
cartoon                          0.97175
animated                         0.97125
pixar                            0.97025
animals                          0.93500
story                            0.93150
heartwarming                     0.92850
Name: 4317, dtype: float64

`get_relevant_tag_soup` returns a string of space-separated tags with a relevance score of 0.75 or greater.

In [9]:
get_relevant_tag_soup(genome_tags, 'finding nemo')

'oscarbestanimatedfeature pixaranimation computeranimation shorttermmemoryloss animation children kidsandfamily kids fish cartoon animated pixar animals story heartwarming cute imdbtop250 talkinganimals toys family touching disney singlefather oscarwinner original friendship storytelling adventure fun good childhood cute entertaining great funny greatmovie visuallystunning memoryloss clever comedy shark disneyanimatedfeature feelgoodmovie underwater'

# Find similar movies to a given title

`get_similar_movies` returns a dataframe of movies that are most similar to a title passed as input.
`get_similar_titles` returns a series of titles associated with output of `get_similar_movies`.

Optional parameters determine how `get_similar_movies` operates. These same parameters can be passed to `get_similar_titles`, which will then be passed to its implementation of `get_similar_movies`. Optional parameters are:
- `how`: method of finding similarity
    - `'tag_soup'`: find similar movies based on list of user-assigned tags or relevant genome tags (default)
    - `'tag_relevance'`: find similar movies based on relevance scores assigned to each tag
- `field`: column of dataframe to compute similarity between movies (default = 'tag_soup') (only applies when `how='tag_soup'`
- `num_movies`: number of similar movies to return (default = 10)

## Find similar movies based on list of user-assigned tags or relevant genome tags

Default behavior of `get_similar_titles` is to call `get_similar_tag_soup`.
- Tokenizes words in string of tag soup
- Computes cosine similarity between each movie's tags and the target movie's tags.
- Returns titles of movies with highest cosine similarity.

In [10]:
get_similar_titles(movies, 'finding nemo')

0                 Toy Story
1739          Bug's Life, A
2334            Toy Story 2
3533         Monsters, Inc.
3709                Ice Age
6154                   Cars
6346            Ratatouille
7291            Toy Story 3
8114    Monsters University
9130           Finding Dory
Name: title, dtype: object

Because there are some movies with null values for relevant_tag_soup, null rows need to be dropped before using relevant_tag_soup as the source of similarity between movies.

In [11]:
get_similar_titles(movies.dropna(), 'finding nemo', field='relevant_tag_soup')

0            Toy Story
1716     Bug's Life, A
2306       Toy Story 2
3496    Monsters, Inc.
3668           Ice Age
5808    Chicken Little
6133       Ratatouille
6729                Up
7000       Toy Story 3
8416      Finding Dory
Name: title, dtype: object

`relevant_tag_soup` gives similar results to `tag_soup`, so it's likely better to use `tag_soup` because there are no missing values.

It is also possible to search for similar titles within a subset of the data. However, the input title must be part of filtered data (i.e., must have filtered tags or genres).

In [12]:
# First search without filtering
get_similar_movies(movies, 'my big fat greek wedding')

Unnamed: 0,movieId,title,genres,Adventure,Horror,Thriller,Crime,Musical,Documentary,Mystery,...,IMAX,Drama,Comedy,mean_rating,num_ratings,weighted_rating,tag_soup,year,title_clean,relevant_tag_soup
311,357,Four Weddings and a Funeral,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.728926,30236,0.728776,british hughgrant witty british donotlikehughg...,1994,four weddings and a funeral,romanticcomedy relationships britishcomedy chi...
1628,2195,Dirty Work,Comedy,0,0,0,0,0,0,0,...,0,0,1,0.615865,1248,0.628327,normmacdonald comedy comedy christophermcdonal...,1998,dirty work,comedy funny saturdaynightlive chase funniestm...
1755,2371,Fletch,Comedy|Crime|Mystery,0,0,0,1,0,0,1,...,0,0,1,0.685466,7582,0.685996,comedy comedy comedy comedy comedy comedy mema...,1985,fletch,chase comedy investigation quotable saturdayni...
2690,3646,Big Momma's House,Comedy,0,0,0,0,0,0,0,...,0,0,1,0.473999,3473,0.486637,comedy comedy comedy comedy disguise fbiagent ...,2000,big mommas house,comedy funny undercovercop predictable veryfunny
3727,5250,Stir Crazy,Comedy,0,0,0,0,0,0,0,...,0,0,1,0.668531,715,0.676741,comedy comedy comedy genewilder richardpryor c...,1980,stir crazy,comedy prisonescape funny prison veryfunny
4470,6687,My Boss's Daughter,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.479545,528,0.541783,owned comedy funny comedy funny,2003,my bosss daughter,comedy stupidity dumbbutfunny stupid teenmovie...
4598,6944,Father of the Bride,Comedy,0,0,0,0,0,0,0,...,0,0,1,0.667226,3518,0.669316,comedy stevemartin itaege wedding remake weddi...,1991,father of the bride,wedding comedy remake family girliemovie roman...
5174,8531,White Chicks,Action|Comedy|Crime,0,0,0,1,0,0,0,...,0,0,1,0.479799,2193,0.498712,crossdressing whiteface comedy comedy comedy c...,2004,white chicks,comedy dumbbutfunny dumb crossdressing sillyfu...
5904,34530,Deuce Bigalow: European Gigolo,Comedy,0,0,0,0,0,0,0,...,0,0,1,0.427753,890,0.478824,amsterdam comedy comedy comedy comedy goldenra...,2005,deuce bigalow european gigolo,stupidashell comedy funny stupid predictable p...
8818,135861,Ted 2,Comedy,0,0,0,0,0,0,0,...,0,0,1,0.60257,1362,0.615824,stupid comedy funny ted comedy juvenile unnece...,2015,ted 2,comedy weed sequel sequels crudehumor funny du...


In [13]:
# Now search just rom-coms
romcoms = filter_genre(movies, ['Romance', 'Comedy'])
get_similar_movies(romcoms, 'my big fat greek wedding')

Unnamed: 0,movieId,title,genres,Adventure,Horror,Thriller,Crime,Musical,Documentary,Mystery,...,IMAX,Drama,Comedy,mean_rating,num_ratings,weighted_rating,tag_soup,year,title_clean,relevant_tag_soup
33,357,Four Weddings and a Funeral,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.728926,30236,0.728776,british hughgrant witty british donotlikehughg...,1994,four weddings and a funeral,romanticcomedy relationships britishcomedy chi...
269,4246,Bridget Jones's Diary,Comedy|Drama|Romance,0,0,0,0,0,0,0,...,0,1,1,0.680338,16667,0.680644,romanticcomedy basedonabook british comedy dra...,2001,bridget joness diary,girliemovie chickflick romanticcomedy adaptedf...
282,4447,Legally Blonde,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.631435,13447,0.632529,comedy okayonce absurd feminism feminist sense...,2001,legally blonde,girliemovie chickflick comedy cute funmovie ro...
413,6687,My Boss's Daughter,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.479545,528,0.541783,owned comedy funny comedy funny,2003,my bosss daughter,comedy stupidity dumbbutfunny stupid teenmovie...
455,7255,Win a Date with Tad Hamilton!,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.567961,927,0.592474,onedimensionalcharacters sexist joshduhamel ka...,2004,win a date with tad hamilton,romanticcomedy girliemovie chickflick romantic...
567,34162,Wedding Crashers,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.688274,11470,0.688579,christopherwalken nuditytopless owenwilson ste...,2005,wedding crashers,comedy romanticcomedy wedding funny veryfunny ...
747,90576,What's Your Number?,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.643428,601,0.659073,chickflick romance chickflick romance basedonn...,2011,whats your number,chickflick romanticcomedy girliemovie comedy r...
821,118814,Playing It Cool,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.634091,88,0.68409,comedy romance romanticcomedy whimsical engage...,2014,playing it cool,romanticcomedy romantic love lovestory relatio...
831,130978,Love and Pigeons,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.77395,119,0.731404,comedy romance sovietrussia vladimirmenshov pi...,1985,love and pigeons,russian storytelling
849,147196,The Girls,Comedy|Romance,0,0,0,0,0,0,0,...,0,0,1,0.811429,70,0.733399,comedy romance sentimentale ussr,1961,girls,love romanticcomedy lovestory excellentscript ...


## Find similar movies based on relevance scores assigned to each tag

When parameter `how='tag_relevance'` is passed to `get_similar_titles`, it calls `get_similar_tag_relevance`.

- Computes cosine similarity between relevance scores assigned to each movie and the scores assigned to the target movie.
- Returns titles of movies with highest cosine similarity.

In [14]:
get_similar_titles(genome_tags, 'finding dory', how='tag_relevance')

1739                          Bug's Life, A
2334                            Toy Story 2
3709                                Ice Age
6091                Ice Age 2: The Meltdown
7291                            Toy Story 3
7838                                  Brave
8114                    Monsters University
8124                        Despicable Me 2
8196    Cloudy with a Chance of Meatballs 2
8823                      The Good Dinosaur
Name: title, dtype: object

# Compute top-rated movies similar to a given movie

First, get list of similar movies.

In [15]:
similar_movies_genome = get_similar_movies(genome_tags, 'sixteen candles', how='tag_relevance', num_movies=20)

Slice rows of movies dataframe with corresponding movieId.

In [16]:
similar_movies = movies[movies['movieId'].isin(similar_movies_genome['movieId'])]

Get list of highest-rated movies in dataframe.

In [17]:
highest_rated(similar_movies, num_movies=8)

1429           Breakfast Club, The
1653               Say Anything...
942             Better Off Dead...
7351                        Easy A
4444               Sure Thing, The
4895                    Mean Girls
975         Some Kind of Wonderful
1922    10 Things I Hate About You
Name: title, dtype: object