# Brief Outline: Exploiting Movie-Movie Similarity

## Movie Information

### Vectorize Movie Information
The movie information has a lot of text and numerical information.

### Aggregate

### Approximate Nearest Neighbour with Annoy Index

## Use the movies a user has seen to predict most similar movies

In [12]:
# !pip install --upgrade torch

In [21]:
# !pip install sentence-transformers -qq

In [31]:
import pickle
import random
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
import torch
from annoy import AnnoyIndex
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer, BertModel, BertTokenizer

tqdm.pandas()

In [14]:
def read(ds: str, data_dir=Path("../data/ext/od-challenge")):
    with (data_dir / f"{ds}.pickle").open("rb") as f:
        df = pickle.load(f)
    return df


aggs = read(ds="aggs")
teams = read(ds="teams")
movies = read(ds="movies")
labels = read(ds="labels")

data_dir = Path("../data/intermediate/")
train, test = pd.read_csv(data_dir / "train.csv"), pd.read_csv(data_dir / "test.csv")

In [28]:
# def write(df, ds, data_dir=Path("../data/intermediate")):
#     df.to_csv(data_dir / f"{ds}.csv", index=False)


# write(aggs, ds="aggs")
# write(teams, ds="teams")
# write(movies, ds="movies")
# write(labels, ds="labels")

In [17]:
df = movies
df

Unnamed: 0,movie_id,title,genres,year,synopsis
0,114709,Toy Story,"{Comedy, Animation, Children, Fantasy, Adventure}",1995,A boy called Andy Davis (voice: John Morris) u...
1,113497,Jumanji,"{Children, Fantasy, Adventure}",1995,The film begins in 1869 in the town of Brantfo...
2,113277,Heat,"{Action, Crime, Thriller}",1995,An inbound Los Angeles Blue Line train pulls i...
3,114319,Sabrina,"{Comedy, Romance}",1995,"Sabrina Fairchild (Julia Ormond), is the Larra..."
4,112302,Tom and Huck,"{Children, Adventure}",1995,The film opens with Injun Joe (Eric Schweig) a...
...,...,...,...,...,...
4102,3606756,Incredibles 2,"{Children, Animation, Action, Adventure}",2018,Agent Rick Dicker (Jonathan Banks) is intervie...
4103,5463162,Deadpool 2,"{Comedy, Action, Sci-Fi}",2018,After successfully working as the mercenary De...
4104,3778644,Solo: A Star Wars Story,"{Children, Action, Sci-Fi, Adventure}",2018,"In this second 'Star Wars' stand-alone, spin-o..."
4105,5095030,Ant-Man and the Wasp,"{Comedy, Action, Sci-Fi, Fantasy, Adventure}",2018,The film opens in 1987 as Hank Pym (Michael Do...


In [18]:
description = []
for row in df.iterrows():
    info = row[1]
    genres = list(info["genres"])
    genres = ", ".join(genres)
    text = f"The movie **{info['title']}** was released in {info['year']}. It was mainly known for the following genres: {genres}"
    description.append(text)

df["description"] = pd.Series(description)

In [22]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [47]:
def get_description_index():
    vector_dir = Path("../models/vector_index/")
    fname = "description.ann"
    index_path = vector_dir / fname
    sz = 768
    if index_path.exists():
        u = AnnoyIndex(sz, "angular")
        u.load(str(index_path))  # super fast, will just mmap the file
        return u
    else:
        embeddings = df.description.progress_apply(lambda x: model.encode(x))
        # # takes about 10 minutes to run locally on a slow CPU

        t = AnnoyIndex(f, "angular")  # Length of item vector that will be indexed
        for i, vector in enumerate(embeddings):
            t.add_item(i, vector)
        t.build(len(embeddings) // 10)
        t.save(str(index_path))
        return t


t = get_description_index()

# Demo: Movie-Movie Recommendation

In [77]:
idx = random.choice(list(range(len(movies))))


def recommend(idx):
    indices = t.get_nns_by_item(idx, 20)
    return movies.loc[idx]["description"], movies.loc[indices]["description"][1:]


input_movie, recommended_movies = recommend(idx)
print(input_movie, "\n---------\n", recommended_movies)

The movie **First Blood (Rambo: First Blood)** was released in 1982. It was mainly known for the following genres: Drama, Action, Thriller, Adventure 
---------
 971     The movie **Rambo: First Blood Part II** was r...
973     The movie **Rambo III** was released in 1988. ...
1462    The movie **Blood Simple** was released in 198...
2981    The movie **Rambo (Rambo 4)** was released in ...
2494    The movie **Blood: The Last Vampire** was rele...
977     The movie **Rocky III** was released in 1982. ...
2009    The movie **Deathtrap** was released in 1982. ...
1675    The movie **Youngblood** was released in 1986....
2352    The movie **Captain Blood** was released in 19...
2169    The movie **In Cold Blood** was released in 19...
1385    The movie **Bloodsport** was released in 1988....
2320    The movie **Never Say Never Again** was releas...
1867    The movie **Vampire Hunter D: Bloodlust (Banpa...
241     The movie **Hellraiser: Bloodline** was releas...
460     The movie **Raging

Next, we use the movie-movie recommendation to recommend movies to our user - as per the original statement. 
In order to do this, we find movies similar to the movies the user has already seen and recommend those. 

**Note on Hit Ratio**:
> A hit is counted if any of the movies we recommend, is present in the movies the user rates in test.

## Synopsis Embedding Illustration

In [49]:
# class SynopsisProc:
#     def __init__(self):
#         self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
#         self.model = BertModel.from_pretrained("bert-base-uncased")

#     def embed(self, paragraphs):
#         encoded_input = self.tokenizer(
#             paragraphs, padding=True, truncation=True, return_tensors="pt"
#         )
#         output = self.model(**encoded_input)
#         embedding = output.pooler_output
#         return embedding

In [78]:
# sp = SynopsisProc()

In [52]:
# synopsis_embedding = df['synopsis'].progress_apply(lambda x: sp.embed(x))