Earlier we trained a model to predict the ratings users would give to movies using a network with embeddings learned for each movie and user. Embeddings are powerful! But how do they actually work? 

Previously, I claimed that embeddings capture the 'meaning' of the objects they represent, and discover useful latent structure. Let's put that to the test!

# Looking up embeddings

Let's load a model we trained earlier so we can investigate the embedding weights that it learned.

In [2]:
import os

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow import keras

RUNNING_ON_KERNELS = 'KAGGLE_WORKING_DIR' in os.environ
input_dir = '../input/0-movielens-preprocessing' if RUNNING_ON_KERNELS else '../input/movielens_preprocessed'
model_dir = '../input/x3-movielens-spiffy-model' if RUNNING_ON_KERNELS else '.'
model_path = os.path.join(model_dir, 'movie_svd_model_32.h5')
#model = keras.models.load_model('movie_svd_model_8_v2.h5')
model = keras.models.load_model(model_path)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
user_id (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
movie_id (InputLayer)           (None, 1)            0                                            
__________________________________________________________________________________________________
user_embedding (Embedding)      (None, 1, 32)        4431808     user_id[0][0]                    
__________________________________________________________________________________________________
movie_embedding (Embedding)     (None, 1, 32)        855808      movie_id[0][0]                   
__________________________________________________________________________________________________
movie_bias

The embedding weights are part of the model's internals, so we'll have to do a bit of digging around to access them. We'll grab the layer responsible for embedding movies, and use the `get_weights()` method to get its learned weights.

In [3]:
emb_layer = model.get_layer('movie_embedding')
(w,) = emb_layer.get_weights()
w.shape

(26744, 32)

Our weight matrix has 26,744 rows for that many movies. Each row is 32 numbers - the size of our movie embeddings.

Let's look at an example movie vector:

In [4]:
w[0]

array([-0.33853766, -0.4146362 ,  0.1623646 ,  0.50178075,  0.16071211,
       -0.16824646, -1.2413633 ,  0.6934318 , -0.79779387, -0.02712124,
       -0.04582484, -0.700483  , -0.51075536,  0.55907726, -0.13546772,
       -0.7331561 , -0.01410303,  1.1877884 ,  0.5516961 , -0.10908557,
       -0.6856427 , -0.13831832,  1.1474288 , -0.1690759 , -0.04500233,
        0.40393695, -0.43479416, -0.36175066, -0.40691528,  0.15359668,
       -1.0017278 , -0.43102026], dtype=float32)

What movie is this the embedding of? Let's load up our dataframe of movie metadata.

In [None]:
movies_path = os.path.join(input_dir, 'movie.csv')
movies_df = pd.read_csv(movies_path, index_col=0)
movies_df.head()

Of course, it's *Toy Story*! I should have recognized that vector anywhere.

Okay, I'm being facetious. It's hard to make anything of these vectors at this point. We never directed the model about how to use any particular embedding dimension. We left it alone to learn whatever representation it found useful.

So how do we check whether these representations are sane and coherent?

## Vector similarity

A simple way to test this is to look at how close or distant pairs of movies are in the embedding space. Embeddings can be thought of as a smart distance metric. If our embedding matrix is any good, it should map similar movies (like *Toy Story* and *Shrek*) to similar vectors.

In [6]:
i_toy_story = 0
i_shrek = movies_df.loc[
    movies_df.title == 'Shrek',
    'movieId'
].iloc[0]

toy_story_vec = w[i_toy_story]
shrek_vec = w[i_shrek]

print(
    toy_story_vec,
    shrek_vec,
    sep='\n',
)

[-0.33853766 -0.4146362   0.1623646   0.50178075  0.16071211 -0.16824646
 -1.2413633   0.6934318  -0.79779387 -0.02712124 -0.04582484 -0.700483
 -0.51075536  0.55907726 -0.13546772 -0.7331561  -0.01410303  1.1877884
  0.5516961  -0.10908557 -0.6856427  -0.13831832  1.1474288  -0.1690759
 -0.04500233  0.40393695 -0.43479416 -0.36175066 -0.40691528  0.15359668
 -1.0017278  -0.43102026]
[-0.14501241  0.38190323  0.33974665  0.10670913  0.29868233  0.51298016
 -1.001472    0.99781567 -0.53581023 -0.08761381 -0.2953611  -0.6223411
 -0.25792935  0.47818306  0.47391957 -0.4300383  -0.38145897  0.63624847
 -0.14488064  0.3077913  -0.08880541 -0.5605316   0.02790316  0.21299955
  0.11481046  0.68308973  0.66193867 -0.40429035 -0.7823577   0.66591966
 -0.53341657 -0.27222055]


These look generally similar! If we wanted to assign a single number to their similarity, we could calculate the euclidean distance between these two vectors. (This is our conventional 'as the crow flies' notion of distance between two points. Easy to grok in 1, 2, or 3 dimensions. Mathematically, we can also extend it to 32 dimensions, though good luck visualizing it.)

In [1]:
from scipy.spatial import distance

distance.euclidean(toy_story_vec, shrek_vec)

NameError: name 'toy_story_vec' is not defined

How does this compare to a pair of movies that we would think of as very different?

In [9]:
i_exorcist = movies_df.loc[
    movies_df.title == 'The Exorcist',
    'movieId'
].iloc[0]

exorcist_vec = w[i_exorcist]

distance.euclidean(toy_story_vec, exorcist_vec)

Toy Story:
[-0.33853766 -0.4146362   0.1623646   0.50178075  0.16071211 -0.16824646
 -1.2413633   0.6934318  -0.79779387 -0.02712124 -0.04582484 -0.700483
 -0.51075536  0.55907726 -0.13546772 -0.7331561  -0.01410303  1.1877884
  0.5516961  -0.10908557 -0.6856427  -0.13831832  1.1474288  -0.1690759
 -0.04500233  0.40393695 -0.43479416 -0.36175066 -0.40691528  0.15359668
 -1.0017278  -0.43102026]
The Exorcist:
[-0.48039156 -0.5364228   0.20817241 -0.73291785 -0.05619409  0.05894614
  0.1316456  -0.14783028  0.22751398 -0.2714083  -0.39914957 -0.3958874
  0.7817501  -0.16066262  0.03851374  0.04804222 -1.8531514  -0.19662409
  0.15619414  0.8513164   0.5270438   0.21442115 -0.2165486  -0.47107786
 -0.8162815   0.11464956 -0.59268594  0.04401499 -0.5136618  -1.4066666
 -0.43677568 -0.07259946]
Distance = 4.700692653656006 (euclidean), 1.0603957809507847 (cosine)


As expected, much further apart.

## Cosine Distance

If you check out [the docs for the `scipy.spatial` module](https://docs.scipy.org/doc/scipy-0.14.0/reference/spatial.distance.html), you'll see there are actually a *lot* of different measures of distance that people use for different tasks.

When judging the similarity of embeddings, it's more common to use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

In brief, the cosine similarity of two vectors ranges from -1 to 1, and is a function of the *angle* between the vectors. If two vectors point in the same direction, their cosine similarity is 1. If they point in opposite directions, it's -1. If they're orthogonal (i.e. at right angles), their cosine similarity is 0.

Cosine distance is just defined as 1 minus the cosine similarity (and therefore ranges from 0 to 2).

Let's calculate a couple cosine distances between movie vectors:

In [1]:
print(
    distance.cosine(toy_story_vec, shrek_vec),
    distance.cosine(toy_story_vec, exorcist_vec),
    sep='\n'
)

NameError: name 'distance' is not defined

> **Aside:** *Why* is cosine distance commonly used when working with embeddings? The short answer, as with so many deep learning techniques, is "empirically, it works well". In the exercise coming up, you'll get to do a little hands-on investigation that digs into this question more deeply.

Which movies are most similar to *Toy Story*? Which movies fall right between *Psycho* and *Scream* in the embedding space? We could write a bunch of code to work out questions like this, but it'd be pretty tedious. Fortunately, there's already a library for exactly this sort of work: **Gensim**.

# Exploring embeddings with Gensim

I'll instantiate an instance of [`WordEmbeddingsKeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors) with our model's movie embeddings and the titles of the corresponding movies.

> Aside: You may notice that Gensim's docs and many of its class and method names refer to *word* embeddings. While the library is most frequently used in the text domain, we can use it to explore embeddings of any sort.

In [10]:
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

# Limit to movies with at least this many ratings in the dataset
threshold = 100
mainstream_movies = movies_df[movies_df.n_ratings >= threshold].reset_index(drop=True)

movie_embedding_size = w.shape[1]
kv = WordEmbeddingsKeyedVectors(movie_embedding_size)
kv.add(
    mainstream_movies['key'].values,
    w[mainstream_movies.movieId]
)

# TODO: could be kind of nice if we could default to using the title 
# as key when it's unambiguous, and the year-augmented keys when
# necessary.
def k(title):
    """Helper to resolve a base movie title to its unique key."""
    matches = movies_df[movies_df['title']==title]
    assert len(matches) == 1, len(matches)
    return matches.iloc[0]['key']

Okay, so which movies are most similar to *Toy Story*?

In [11]:
# TODO: Maybe write some nice little wrapper around this method that takes care of resolving titles
# to keys, and nicely formats the output (maybe as a dataframe? Or even a little horizontal bar chart.)
kv.most_similar(k('Toy Story'))

  if np.issubdtype(vec.dtype, np.int):


[('Toy Story 2 (1999)', 0.9668039083480835),
 ('Monsters, Inc. (2001)', 0.8768452405929565),
 ("Bug's Life, A (1998)", 0.8742560148239136),
 ('Finding Nemo (2003)', 0.8684772253036499),
 ('Toy Story 3 (2010)', 0.8677495718002319),
 ('Incredibles, The (2004)', 0.8582932949066162),
 ('Ratatouille (2007)', 0.7599791288375854),
 ('Up (2009)', 0.7554174661636353),
 ('Lion King, The (1994)', 0.7251849174499512),
 ('Aladdin (1992)', 0.7034016251564026)]

Wow, these are pretty great! It makes perfect sense that *Toy Story 2* is the most similar movie to *Toy Story*. And most of the rest are animated kids movies with a similar computer-animated style.

#### DB: I'd cut it here. Users clearly learn something by seeing the comparison once. But unless they are curious about movies (which generally isn't what users are coming to our courses for), I'm skeptical they'll get much value out of seeing this for different movies. For those who are interested in exploring the movie space, it'd be fun to include a reminder to do this in the exercises page, so they can do it for movies of their choosing in edit mode.

So it's learned something about 3-d animated kids flicks, but let's try a few more to make sure that wasn't a fluke.

What about artsy erotic dramas?

In [12]:
kv.most_similar(k('Eyes Wide Shut'))

  if np.issubdtype(vec.dtype, np.int):


[('Mulholland Drive (2001)', 0.8159981966018677),
 ('Lost Highway (1997)', 0.731342077255249),
 ('Barry Lyndon (1975)', 0.6611681580543518),
 ('Exotica (1994)', 0.650924563407898),
 ('Twin Peaks: Fire Walk with Me (1992)', 0.6277767419815063),
 ('Solaris (2002)', 0.6273046731948853),
 ('Magnolia (1999)', 0.6238169074058533),
 ('Match Point (2005)', 0.6146119832992554),
 ('Closer (2004)', 0.6053546667098999),
 ('Inland Empire (2006)', 0.585962176322937)]

Nailed it.

Raunchy sophomoric comedy?

In [13]:
kv.most_similar(k('American Pie'))

  if np.issubdtype(vec.dtype, np.int):


[('American Pie 2 (2001)', 0.9293572902679443),
 ('Road Trip (2000)', 0.8612767457962036),
 ('American Wedding (American Pie 3) (2003)', 0.8007140159606934),
 ("There's Something About Mary (1998)", 0.7459080219268799),
 ("Porky's (1982)", 0.7170787453651428),
 ('Varsity Blues (1999)', 0.6828464269638062),
 ('Wedding Crashers (2005)', 0.6678208112716675),
 ('Scary Movie (2000)', 0.6587681770324707),
 ('Fast Times at Ridgemont High (1982)', 0.6415641903877258),
 ('Not Another Teen Movie (2001)', 0.6381173729896545)]

Pretty good.

Light-hearted old-school musicals?

In [14]:
kv.most_similar(k('Meet Me in St. Louis'))

  if np.issubdtype(vec.dtype, np.int):


[('Pollyanna (1960)', 0.7923725843429565),
 ('Gigi (1958)', 0.7843174934387207),
 ('Seven Brides for Seven Brothers (1954)', 0.7775176167488098),
 ('Little Women (1933)', 0.765568196773529),
 ('Funny Girl (1968)', 0.7556552886962891),
 ('Harvey Girls, The (1946)', 0.7473137974739075),
 ('Oklahoma! (1955)', 0.746092677116394),
 ('My Fair Lady (1964)', 0.7429697513580322),
 ('Little Princess, The (1939)', 0.7375482320785522),
 ("Singin' in the Rain (1952)", 0.7283079028129578)]

# Semantic vector math

The [`most_similar`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar) method optionally takes a second argument, `negative`. If we call `kv.most_similar(a, b)`, then instead of finding the vector closest to `a`, it will find the closest vector to `a - b`.

Why would you want to do that? It turns out that doing addition and subtraction of embedding vectors often gives surprisingly meaningful results. For example, how would you fill in the following equation?

    Scream = Psycho + ________

*Scream* and *Psycho* are similar in that they're violent, scary movies somewhere on the border between Horror and Thriller. The biggest difference is that *Scream* has elements of comedy. So I'd say *Scream* is what you'd get if you combined *Psycho* with a comedy.

But we can actually ask Gensim to fill in the blank for us via vector math (after some rearranging):

    ________ = Scream - Psycho

In [15]:
kv.most_similar(
    positive = [k('Scream')],
    negative = ['Psycho (1960)']
)

  if np.issubdtype(vec.dtype, np.int):


[("Can't Hardly Wait (1998)", 0.6875821948051453),
 ('Varsity Blues (1999)', 0.643612265586853),
 ('Toy Soldiers (1991)', 0.6359611749649048),
 ('Cruel Intentions (1999)', 0.5939096212387085),
 ('Summer School (1987)', 0.5853281021118164),
 ('10 Things I Hate About You (1999)', 0.5755065679550171),
 ('Road Trip (2000)', 0.5676481127738953),
 ('Bring It On (2000)', 0.5587225556373596),
 ('EuroTrip (2004)', 0.556922197341919),
 ('Faculty, The (1998)', 0.5412049293518066)]

#### DB: I'm concerned we're starting to rely on a good deal of cinema knowledge here.

If you are familiar with these movies, you'll see that the missing ingredient that takes us from *Psycho* to *Scream* is comedy (and also late-90's-teen-movie-ness).

*Brave* and *Cars 2* are Pixar movies released within a year of one another. The most obvious difference between them is that their content is more stereotypically appealing to girls and boys, respectively. What do we get by subtracting *Brave*'s vector from *Cars 2*'s vector?

In [16]:
kv.most_similar(
    [k('Cars 2')],
    negative = [k('Brave')]
)

  if np.issubdtype(vec.dtype, np.int):


[('Sea of Love (1989)', 0.647750198841095),
 ('All the Right Moves (1983)', 0.6178675889968872),
 ('Wall Street (1987)', 0.5668429732322693),
 ('48 Hrs. (1982)', 0.5596427321434021),
 ('Gauntlet, The (1977)', 0.5527459383010864),
 ('10 (1979)', 0.5510733127593994),
 ('Mad Money (2008)', 0.5362347364425659),
 ('Endless Love (1981)', 0.5280988216400146),
 ('Against All Odds (1984)', 0.5252307653427124),
 ("White Men Can't Jump (1992)", 0.5205742120742798)]

#### DB: As a non-cinephile, I know nothing about most of these. I worry most users will feel left out by commentary about movies they don't recognize.


**TODO: Well, this worked a lot better with the model I was originally using at the time I wrote this. These results from the newer model are still mostly 'guy movies', but they're more obscure.**

*The macho vector*. The components that might have represented "kids movie", "computer animation", "blockbuster", "early 2010's", etc. have basically "cancelled out", since these properties were common to both movies. 

## Analogy solving

The SAT test which is used to get into American colleges and universities poses analogy questions like:

    shower : deluge :: _____ : stare
    
(Read "shower is to deluge as ___ is to stare")

To solve this, we find the relationship between deluge and shower, and apply it to stare. A shower is a milder form of a deluge. What's a milder form of stare? A good answer here would be "glance", or "look". 

It's kind of astounding that this works, but people have found that these can often be effectively solved by simple vector math on word embeddings. Can we solve movie analogies with our embeddings? Let's try. What about:

    Brave : Cars 2 :: Pocahontas : _____
    
In terms of vector math, we can frame this as...

    Cars 2 = Brave + X
    _____  = Pocahontas + X
    
Rearranging, we get:

    ____ = Pocahontas + (Cars 2 - Brave)

We can solve this by passing in two movies (*Pocahontas* and *Cars 2*) for the positive argument to `most_similar`, with *Brave* as the negative argument:

In [18]:
kv.most_similar(
    [k('Pocahontas'), k('Cars 2')],
    negative = [k('Brave')]
)

  if np.issubdtype(vec.dtype, np.int):


[('Aladdin and the King of Thieves (1996)', 0.648187518119812),
 ('Hunchback of Notre Dame, The (1996)', 0.6423631906509399),
 ('Chorus Line, A (1985)', 0.6406726837158203),
 ('Hercules (1997)', 0.6011497974395752),
 ('Lady in Red, The (1979)', 0.5953354239463806),
 ('Return of Jafar, The (1994)', 0.5920454859733582),
 ('All Dogs Go to Heaven 2 (1996)', 0.5916997194290161),
 ('Anastasia (1997)', 0.581755518913269),
 ('101 Dalmatians (1996)', 0.5750991702079773),
 ('Little Mermaid, The (1989)', 0.5689462423324585)]

We get a bunch of mid-90s kids' movies like Pocahontas... except that unlike Pocahontas, these have male main characters, and on the whole are probably more popular with boys.

#### DB: I again feel we need to show this just once, so we're demonstrating the technique, rather than exploring a space many users may have little interest in.

Does this work with a non-animated film?

In [19]:
kv.most_similar(
    [k('Bring It On'), k('Cars 2')],
    negative = [k('Brave')]
)

  if np.issubdtype(vec.dtype, np.int):


[('Blue Crush (2002)', 0.7251259088516235),
 ('Bring It On Again (2004)', 0.695522665977478),
 ('Fast and the Furious: Tokyo Drift, The (Fast and the Furious 3, The) (2006)',
  0.6860014200210571),
 ('Legally Blonde (2001)', 0.6783874034881592),
 ("Charlie's Angels (2000)", 0.6571942567825317),
 ('Barbershop 2: Back in Business (2004)', 0.6560758352279663),
 ('2 Fast 2 Furious (Fast and the Furious 2, The) (2003)', 0.6469695568084717),
 ('What a Girl Wants (2003)', 0.642094612121582),
 ("She's All That (1999)", 0.6413033604621887),
 ('Drumline (2002)', 0.6353882551193237)]

**TODO: This example has also degraded after switching models. Possible that some of these patterns were kind of cherry-picked and aren't super robust. Maybe worth showing at least one example that doesn't work as we might hope, to show that it's not totally magic.**

This seems pretty accurate. Like *Bring It On*, these movies are light-hearted, inoffensive comedies aimed at young teens - but not necessarily young teen girls. The contrast is especially obvious when we compare to the most similar movies to *Bring It On* alone:

In [20]:
kv.most_similar(
    k('Bring It On'))

  if np.issubdtype(vec.dtype, np.int):


[('Legally Blonde (2001)', 0.8641946315765381),
 ('Clueless (1995)', 0.8171336054801941),
 ('Blue Crush (2002)', 0.8121250867843628),
 ('Mean Girls (2004)', 0.7572932243347168),
 ("Charlie's Angels (2000)", 0.7506909370422363),
 ('Never Been Kissed (1999)', 0.7351545095443726),
 ('10 Things I Hate About You (1999)', 0.713359534740448),
 ("She's All That (1999)", 0.7108302116394043),
 ('Spice World (1997)', 0.7067806720733643),
 ('Fast and the Furious: Tokyo Drift, The (Fast and the Furious 3, The) (2006)',
  0.6800084114074707)]

# Your Turn

**TODO: link**

In [21]:
# scratch space below - please ignore

kv.most_similar(k('Star Wars: Episode I - The Phantom Menace'))

  if np.issubdtype(vec.dtype, np.int):


[('Star Wars: Episode II - Attack of the Clones (2002)', 0.9871715903282166),
 ('Star Wars: Episode III - Revenge of the Sith (2005)', 0.950527548789978),
 ('Star Wars: The Clone Wars (2008)', 0.8525668382644653),
 ('Indiana Jones and the Kingdom of the Crystal Skull (2008)',
  0.659541130065918),
 ('Planet of the Apes (2001)', 0.5998356342315674),
 ('Star Wars: Episode VI - Return of the Jedi (1983)', 0.5917620658874512),
 ('Star Wars: Episode IV - A New Hope (1977)', 0.5581279397010803),
 ('X-Men: The Last Stand (2006)', 0.5202416181564331),
 ('Jurassic Park III (2001)', 0.5184123516082764),
 ('Daredevil (2003)', 0.5071365833282471)]

In [22]:
kv.most_similar(k('Harry Potter and the Goblet of Fire'))

  if np.issubdtype(vec.dtype, np.int):


[('Harry Potter and the Prisoner of Azkaban (2004)', 0.9893936514854431),
 ('Harry Potter and the Order of the Phoenix (2007)', 0.9833325743675232),
 ('Harry Potter and the Chamber of Secrets (2002)', 0.9780935645103455),
 ('Harry Potter and the Half-Blood Prince (2009)', 0.9761638045310974),
 ("Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)",
  0.9668656587600708),
 ('Harry Potter and the Deathly Hallows: Part 2 (2011)', 0.966358482837677),
 ('Harry Potter and the Deathly Hallows: Part 1 (2010)', 0.950824499130249),
 ('Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)',
  0.5834417343139648),
 ('Hunger Games: Catching Fire, The (2013)', 0.5514477491378784),
 ('Golden Compass, The (2007)', 0.545459508895874)]

In [23]:
kv.most_similar(k('Harry Potter and the Goblet of Fire'))

  if np.issubdtype(vec.dtype, np.int):


[('Harry Potter and the Prisoner of Azkaban (2004)', 0.9893936514854431),
 ('Harry Potter and the Order of the Phoenix (2007)', 0.9833325743675232),
 ('Harry Potter and the Chamber of Secrets (2002)', 0.9780935645103455),
 ('Harry Potter and the Half-Blood Prince (2009)', 0.9761638045310974),
 ("Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)",
  0.9668656587600708),
 ('Harry Potter and the Deathly Hallows: Part 2 (2011)', 0.966358482837677),
 ('Harry Potter and the Deathly Hallows: Part 1 (2010)', 0.950824499130249),
 ('Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)',
  0.5834417343139648),
 ('Hunger Games: Catching Fire, The (2013)', 0.5514477491378784),
 ('Golden Compass, The (2007)', 0.545459508895874)]

In [24]:
kv.most_similar(k('Harry Potter and the Deathly Hallows: Part 2'))

  if np.issubdtype(vec.dtype, np.int):


[('Harry Potter and the Deathly Hallows: Part 1 (2010)', 0.9812530279159546),
 ('Harry Potter and the Half-Blood Prince (2009)', 0.9710525274276733),
 ('Harry Potter and the Order of the Phoenix (2007)', 0.9680110216140747),
 ('Harry Potter and the Goblet of Fire (2005)', 0.966358482837677),
 ('Harry Potter and the Prisoner of Azkaban (2004)', 0.9584927558898926),
 ('Harry Potter and the Chamber of Secrets (2002)', 0.9265634417533875),
 ("Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)",
  0.919805645942688),
 ('Hunger Games: Catching Fire, The (2013)', 0.6579086780548096),
 ('Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)',
  0.5655975937843323),
 ('Chronicles of Narnia: Prince Caspian, The (2008)', 0.5583797097206116)]

In [25]:
kv.distance(
    k('Harry Potter and the Goblet of Fire'),
    k('Harry Potter and the Deathly Hallows: Part 2'),
)

  if np.issubdtype(vec.dtype, np.int):


0.03364157154521097

In [26]:
print(
    k('Harry Potter and the Goblet of Fire'),
    k('Harry Potter and the Deathly Hallows: Part 2'),
    sep='\n',
)

Harry Potter and the Goblet of Fire (2005)
Harry Potter and the Deathly Hallows: Part 2 (2011)


In [27]:
v = kv.vocab['Harry Potter and the Deathly Hallows: Part 2 (2011)']
print(v.count, v.index)

1 8091


In [28]:
df = mainstream_movies
key_counts = df['key'].value_counts()
key_counts[key_counts > 1]

Series([], Name: key, dtype: int64)

In [29]:

df[df.title.str.contains('Harry Potter')]

Unnamed: 0,movieId,title,genres,movieId_orig,key,year,n_ratings,mean_rating
4000,4800,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,2001,17239,3.615418
4574,5717,Harry Potter and the Chamber of Secrets,Adventure|Fantasy,5816,Harry Potter and the Chamber of Secrets (2002),2002,14469,3.579198
5742,7769,Harry Potter and the Prisoner of Azkaban,Adventure|Fantasy|IMAX,8368,Harry Potter and the Prisoner of Azkaban (2004),2004,13157,3.752918
6595,10591,Harry Potter and the Goblet of Fire,Adventure|Fantasy|Thriller|IMAX,40815,Harry Potter and the Goblet of Fire (2005),2005,9908,3.723799
7057,11964,Harry Potter and the Order of the Phoenix,Adventure|Drama|Fantasy|IMAX,54001,Harry Potter and the Order of the Phoenix (2007),2007,5967,3.749513
7616,13920,Harry Potter and the Half-Blood Prince,Adventure|Fantasy|Mystery|Romance|IMAX,69844,Harry Potter and the Half-Blood Prince (2009),2009,4245,3.814833
7962,16165,Harry Potter and the Deathly Hallows: Part 1,Action|Adventure|Fantasy|IMAX,81834,Harry Potter and the Deathly Hallows: Part 1 (...,2010,3884,3.884176
8091,17466,Harry Potter and the Deathly Hallows: Part 2,Action|Adventure|Drama|Fantasy|Mystery|IMAX,88125,Harry Potter and the Deathly Hallows: Part 2 (...,2011,3983,3.953822
