# Alternative Approach: Content Based kNN Recommendation

## Why? 

Content Based recommendation is excellent at solving the cold start problem, if there are not too many movies seen by a particular user. This often has a different recommendation compared to Collaborative Filtering approaches, and that adds to the quality of the final recommendation when we can blend the two in some meaningful way. 

# Implementation

## Part 1: Similar Movies

We find the most similar movies, given a single input movie:

1. Merge into single sentence: The movie information has a lot of text and numerical information. I combined the movie title, year of release and genre into a single sentence. 
2. Vectorize: The present state of the art, which builds on BERT, but is finetuned on over a billion sentence pairs: `sentence-transformers/all-mpnet-base-v2`
3. Annoy Index: [Annoy](https://github.com/spotify/annoy) allows me to do fast approximate nearest neighbour queries using both: item and vector.

We could also use a similar approach for Synopsis, but given the limited time - I've only illusrated that with code towards the end. 

## Part 2: Recommend Movies to User

1. For each movie which is seen by the user, we pull the k (=3) Nearest Neighbours here which are unseen. 
2. Aggregate and sort by count and most similar
3. Return the Top 10

In [None]:
# !pip install --upgrade torch

In [None]:
# !pip install sentence-transformers -qq

In [1]:
import pickle
import random
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
import torch
from annoy import AnnoyIndex
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer, BertModel, BertTokenizer

tqdm.pandas()

In [2]:
def read(ds: str, data_dir=Path("../data/ext/od-challenge")):
    with (data_dir / f"{ds}.pickle").open("rb") as f:
        df = pickle.load(f)
    return df


aggs = read(ds="aggs")
teams = read(ds="teams")
movies = read(ds="movies")
labels = read(ds="labels")

data_dir = Path("../data/intermediate/")
train, test = pd.read_csv(data_dir / "train.csv"), pd.read_csv(data_dir / "test.csv")

In [3]:
# def write(df, ds, data_dir=Path("../data/intermediate")):
#     df.to_csv(data_dir / f"{ds}.csv", index=False)


# write(aggs, ds="aggs")
# write(teams, ds="teams")
# write(movies, ds="movies")
# write(labels, ds="labels")

In [4]:
df = movies
df

Unnamed: 0,movie_id,title,genres,year,synopsis
0,114709,Toy Story,"{Fantasy, Adventure, Animation, Comedy, Children}",1995,A boy called Andy Davis (voice: John Morris) u...
1,113497,Jumanji,"{Fantasy, Adventure, Children}",1995,The film begins in 1869 in the town of Brantfo...
2,113277,Heat,"{Action, Crime, Thriller}",1995,An inbound Los Angeles Blue Line train pulls i...
3,114319,Sabrina,"{Romance, Comedy}",1995,"Sabrina Fairchild (Julia Ormond), is the Larra..."
4,112302,Tom and Huck,"{Adventure, Children}",1995,The film opens with Injun Joe (Eric Schweig) a...
...,...,...,...,...,...
4102,3606756,Incredibles 2,"{Action, Adventure, Children, Animation}",2018,Agent Rick Dicker (Jonathan Banks) is intervie...
4103,5463162,Deadpool 2,"{Action, Comedy, Sci-Fi}",2018,After successfully working as the mercenary De...
4104,3778644,Solo: A Star Wars Story,"{Action, Adventure, Children, Sci-Fi}",2018,"In this second 'Star Wars' stand-alone, spin-o..."
4105,5095030,Ant-Man and the Wasp,"{Fantasy, Action, Adventure, Comedy, Sci-Fi}",2018,The film opens in 1987 as Hank Pym (Michael Do...


In [5]:
description, genre_text = [], []
for row in df.iterrows():
    info = row[1]
    genres = list(info["genres"])
    genres = ", ".join(genres)
    text = f"Movie Title: **{info['title']}** was released in {info['year']}"
    genres = f"Genres: {genres}"
    description.append(text)

df["description"] = pd.Series(description)

In [6]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [7]:
def get_description_index(new=False):
    vector_dir = Path("../models/vector_index/")
    fname = "description.ann"
    index_path = vector_dir / fname
    sz = 768
    if index_path.exists() and not new:
        u = AnnoyIndex(sz, "angular")
        u.load(str(index_path))  # super fast, will just mmap the file
        return u
    else:
        embeddings = df.description.progress_apply(lambda x: model.encode(x))
        # takes about 20 minutes to run locally on an old 2015 Mac Air

        t = AnnoyIndex(sz, "angular")  # Length of item vector that will be indexed
        for i, vector in enumerate(embeddings):
            t.add_item(i, vector)
        t.build(len(embeddings) // 10)
        t.save(str(index_path))
        return embeddings, t


embeddings, t = get_description_index(new=True)

100%|██████████████████████████████████████████████████████████████████████████████| 4107/4107 [09:02<00:00,  7.57it/s]


# Demo: Movie-Movie Recommendation

In [9]:
random.seed(37)
idx = random.choice(movies.movie_id.tolist())


def similar_movies(movie_id):
    idx = movies[movies.movie_id == movie_id].index[0]
    indices, distances = t.get_nns_by_item(idx, 4, include_distances=True)
    return movies.loc[idx]["description"], movies.loc[indices][1:], distances[1:]


input_movie, recommended_movies, distances = similar_movies(idx)
print(recommended_movies.movie_id, distances)
# print(input_movie, "\n---------\n", recommended_movies)

751     99253
752    103956
837     92513
Name: movie_id, dtype: int64 [0.4094308316707611, 0.5027918815612793, 0.7442076802253723]


Next, we use the movie-movie recommendation to recommend movies to our user - as per the original statement. 
In order to do this, we find movies similar to the movies the user has already seen and recommend those. 

The assumption is that there all users have liked (rating >= 3.5) atleast 2 movies, 1 each across train and test split. 

**Note on Hit Ratio**:
> A hit is counted if any of the movies we recommend, is present in the movies the user rates in test.

In [10]:
set(test.user_id.unique()) - set(train.user_id.unique())

set()

In [11]:
def user_hits(predicted_movies: List[int], seen_movies: List[int]):
    return len(set(predicted_movies) & set(seen_movies)) > 0


def get_seen_movies(test, user_id, threshold=3.0):
    df = test[test.user_id == user_id]
    seen_movies = []
    for user_item_rating in df.iterrows():
        if user_item_rating[1]["rating"] >= threshold:
            seen_movies.append(user_item_rating[1]["movie_id"])
    return seen_movies


def recommend_movies(user_id, train, threshold=3.0, k=10):
    train = train[train.user_id == user_id]
#     train = train[train.rating >= threshold]
    recommended_movies = []
    for movie_id in train.movie_id:
        _, movie_df, distances = similar_movies(movie_id)
        movie_df["similarity"] = [1 - d for d in distances]
        recommended_movies.append(movie_df)
    try:
        recommended_movies = pd.concat(recommended_movies)
    except ValueError:
        return []

    seen_movies = train.movie_id.unique()
    recommended_movies["seen"] = recommended_movies.movie_id.apply(
        lambda x: x not in seen_movies
    )
    recommended_movies = recommended_movies[recommended_movies.seen]
    
    
    recommended_movies["counts"] = recommended_movies["movie_id"].map(
        recommended_movies["movie_id"].value_counts()
    )
    recommended_movies.sort_values(
        by=["similarity"], inplace=True, ascending=False
    )
    return recommended_movies.movie_id.unique()[:k]


def calc_hit_rate(split, train):
    hits = []
    for user_id in tqdm(split.user_id.unique()):
        recommended_movies = recommend_movies(user_id=user_id, train=train)
#         for mv in recommended_movies:
#             print(movies[movies.movie_id == mv][["title", "year", "genres"]])
        seen_movies = get_seen_movies(split, user_id)

#         print("-----------------")
#         for mv in seen_movies:
#             print(movies[movies.movie_id==mv][["title", "genres"]])
#         print("##############")
        hits.append(
            user_hits(predicted_movies=recommended_movies, seen_movies=seen_movies)
        )

    return sum(hits) / len(hits)


calc_hit_rate(test, train)

100%|████████████████████████████████████████████████████████████████████████████████| 608/608 [01:46<00:00,  5.72it/s]


0.2894736842105263

## Synopsis Embedding Illustration

In [12]:
# class SynopsisProc:
#     def __init__(self):
#         self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
#         self.model = BertModel.from_pretrained("bert-base-uncased")

#     def embed(self, paragraphs):
#         encoded_input = self.tokenizer(
#             paragraphs, padding=True, truncation=True, return_tensors="pt"
#         )
#         output = self.model(**encoded_input)
#         embedding = output.pooler_output
#         return embedding

In [13]:
# sp = SynopsisProc()

In [14]:
# synopsis_embedding = df['synopsis'].progress_apply(lambda x: sp.embed(x))