# Brief Outline: Exploiting Movie-Movie Similarity

## Movie Information

### Vectorize Movie Information
The movie information has a lot of text and numerical information.

### Aggregate

### Approximate Nearest Neighbour with Annoy Index

## Use the movies a user has seen to predict most similar movies

In [None]:
!

In [1]:
import pickle
import random
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
import torch
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm
from transformers import AutoModel, AutoTokenizer, BertModel, BertTokenizer

# import matplotlib.pyplot as plt
# %matplotlib inline

In [2]:
def read(ds: str, data_dir=Path("../data/ext/od-challenge")):
    with (data_dir / f"{ds}.pickle").open("rb") as f:
        df = pickle.load(f)
    return df


aggs = read(ds="aggs")
teams = read(ds="teams")
movies = read(ds="movies")
labels = read(ds="labels")

data_dir = Path("../data/intermediate/")
train, test = pd.read_csv(data_dir / "train.csv"), pd.read_csv(data_dir / "test.csv")

In [3]:
movies

Unnamed: 0,movie_id,title,genres,year,synopsis
0,114709,Toy Story,"{Fantasy, Adventure, Animation, Comedy, Children}",1995,A boy called Andy Davis (voice: John Morris) u...
1,113497,Jumanji,"{Adventure, Fantasy, Children}",1995,The film begins in 1869 in the town of Brantfo...
2,113277,Heat,"{Thriller, Crime, Action}",1995,An inbound Los Angeles Blue Line train pulls i...
3,114319,Sabrina,"{Romance, Comedy}",1995,"Sabrina Fairchild (Julia Ormond), is the Larra..."
4,112302,Tom and Huck,"{Adventure, Children}",1995,The film opens with Injun Joe (Eric Schweig) a...
...,...,...,...,...,...
4102,3606756,Incredibles 2,"{Adventure, Action, Animation, Children}",2018,Agent Rick Dicker (Jonathan Banks) is intervie...
4103,5463162,Deadpool 2,"{Comedy, Action, Sci-Fi}",2018,After successfully working as the mercenary De...
4104,3778644,Solo: A Star Wars Story,"{Adventure, Action, Sci-Fi, Children}",2018,"In this second 'Star Wars' stand-alone, spin-o..."
4105,5095030,Ant-Man and the Wasp,"{Fantasy, Sci-Fi, Adventure, Action, Comedy}",2018,The film opens in 1987 as Hank Pym (Michael Do...


In [4]:
df = movies
description = []
for row in df.iterrows():
    info = row[1]
    genres = list(info["genres"])
    genres = ", ".join(genres)
    text = f"The movie **{info['title']}** was released in {info['year']}. It was mainly known for the following genres: {genres}"
    description.append(text)

df["description"] = pd.Series(description)

In [5]:
## Vectorize Sentence

In [8]:
class SentenceProc:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained(
            "sentence-transformers/bert-base-nli-mean-tokens"
        )

        self.model = AutoModel.from_pretrained(
            "sentence-transformers/bert-base-nli-mean-tokens"
        )

    def mean_pooling(self, model_output, attention_mask):
        # Mean Pooling - Take attention mask into account for correct averaging
        token_embeddings = model_output[
            0
        ]  # First element of model_output contains all token embeddings
        input_mask_expanded = (
            attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        )
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
            input_mask_expanded.sum(1), min=1e-9
        )

    def embed(self, sentences: List[str]):
        encoded_input = self.tokenizer(
            sentences, padding=True, return_tensors="pt"
        )

        with torch.no_grad():
            model_output = self.model(**encoded_input)

        # Perform pooling. In this case, max pooling.
        sentence_embeddings = self.mean_pooling(
            model_output, encoded_input["attention_mask"]
        )
        return sentence_embeddings


sentences = ["This is an example sentence", "Each sentence is converted"]
s = SentenceProc()

In [None]:
%%time
sentences = df.description.tolist()
description_embeddings = s.embed(sentences)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
del(s)

In [36]:
# import tensorflow_hub as hub

# embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
# embeddings = embed([
#     "The quick brown fox jumps over the lazy dog.",
#     "I am a sentence for which I would like to get its embedding"])

# print(embeddings)

In [61]:
class SynopsisProc:
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

    def embed(self, paragraphs):
        encoded_input = self.tokenizer(
            paragraphs, padding=True, truncation=True, return_tensors="pt"
        )
        output = self.model(**encoded_input)
        embedding = output.pooler_output
        return embedding

In [62]:
sp = SynopsisProc()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
%%time
synopsis_embedding = sp.embed(df['synopsis'].tolist())