In [1]:
# !python --version

# OnDeck ML Challenge

This notebook is part of the [OnDeck ML Challenge](https://gist.github.com/thaumant/ea2f03b5546f36b72443357238edd2d9)

- Build the first iteration of a recommender system that could suggest top-recommended movies to users.
- Provide a GitHub repo (private is fine) with the working model and some description of how to reproduce training/testing steps.
- Provide written answers to these questions:
  - A brief justification of the chosen model and the scoring metric. What alternatives would be worth exploring?
  - What would be your next steps to improve the recommendation quality?

## Chosen Modeling Approaches

## Scoring Metric


For any recommendation system, we've multiple choice of a scoring or evaluation metric. We've search-inspired metrics: Hit Ratio, Precision @K, Recall @K, mean reciprocal rank, nDCG and so on. We also have regression-inspired metrics e.g. RMSE, MAE.

I'm picking metrics with 2 separate intents: 
1. Modeling movie-user interaction as implicit feedback
2. Modeling movie-user interaction as explicit feedback

For comparing approaches across 

Metric 1 is always Hit Ratio.

## Improving Recommendation Quality

There are 2 main approaches for improving recommendation quality

**Modeling/Algorithmic Improvements**

1. In addition to SVD, other matrix factorisation methods can be looked into e.g. [ALS](https://www.elenacuoco.com/2016/12/22/alternating-least-squares-als-spark-ml/#:~:text=ALS%20recommender%20is%20a%20matrix,algorithm%20in%20a%20parallel%20fashion.) is a well-known and popular baseline which we can look into

1. Parameter turning for all the approaches: While I've used a **GridSearchCV** based approach for SVD optimisation - I didn't optimise the hyperparameters for Neural Collaborative Filtering

1. Better/Domain-specific Embedding: We use embedding which are trained on a large open-domain dataset ranging from Wikipedia to Reddit. Maybe, a movies specific embedding built using a Language Model trained on it would yield better similarity results.

**Improve dataset usage**

1. In the present implementation, I do not use the `teams` or `aggs` information at all. Using `teams`, we can exploit actor/writer/director preferences of the user, if any.

1. Incorporate popular demographic information (from `aggs`) and year (from `movies`) as categorical variables which can also help improve the movie similarity recommendation

1. Given any information about the user itself, e.g. the demographic - we can establish

[Dataset link](https://www.dropbox.com/s/vi7lktdxx0r97o4/od-challenge.tar.gz?dl=1).

## Notes

- You define the technical details of the solution: the model, feature set, test/train split, normalization, loss, scoring metric.
- You're not expected to use as much features as possible. Explore the data and take what works best for your approach.
- You're not expected to produce a model with perfect scores. It's enough to pick a sensible model, make a few tweak iterations, produce some results, and outline a path to improve it.

## Technical constraints

- Python 3.6+.
- The model, the traing and testing code should be in Jupyter notebooks. Everything else may be in notebooks or python files.
- If trainng takes more than an hour on a laptop, provide the trained model and the code to load it.

## Dataset details

`movies.pickle` (4107 rows) — basic info about movies:

| Column     | Type       | Example                             | Notes                     |
| :--------- | :--------- | :---------------------------------- | :------------------------ |
| `movie_id` | `int`      | `109830`                            |
| `title`    | `str`      | `"Forrest Gump"`                    |
| `genres`   | `set[str]` | `{"Romance", "Comedy"}`             |
| `year`     | `int`      | `1994`                              |
| `synopsis` | `str`      | `"The film begins with feather..."` | Detailed plot description |

`aggs.pickle` (28557 rows) — aggregated ratings, total and by demographic:

| Column           | Type    | Example       | Notes                                     |
| :--------------- | :------ | :------------ | :---------------------------------------- |
| `movie_id`       | `int`   | `109830`      |
| `rating_average` | `float` | `8.8`         |
| `rating_count`   | `int`   | `304`         | Number of ratings collected for the group |
| `demographic`    | `str`   | `"age_18_29"` | Group name: total, by age, by gender      |

`teams.pickle` (190547 rows) — cast & crew:

| Column        | Type  | Example             | Notes                       |
| :------------ | :---- | :------------------ | :-------------------------- |
| `movie_id`    | `int` | `109830`            |
| `person_role` | `str` | `"actor"`           | Enum: actor/director/writer |
| `person_id`   | `int` | `37097`             |
| `person_name` | `str` | `"Giovanni Arpino"` |

`labels.pickle` (42237 rows) — user ratings to use as labels:

| Column     | Type    | Example  | Notes      |
| :--------- | :------ | :------- | :--------- |
| `movie_id` | `int`   | `109830` |
| `user_id`  | `int`   | `184`    |
| `rating`   | `float` | `8.8`    | 1-10 scale |

In [2]:
import pickle
import random
import tarfile
import urllib.request
from pathlib import Path
from typing import List

import autoreload
import pandas as pd
from sklearn.model_selection import train_test_split

%load_ext autoreload
%autoreload 2

Path.ls = lambda x: list(x.iterdir())

In [3]:
def get_data(
    url: str = "https://www.dropbox.com/s/vi7lktdxx0r97o4/od-challenge.tar.gz?dl=1",
    file_name: str = "../data/raw/od-challenge.tar.gz",
):
    # Download the file from `url` and save it locally under `file_name`:
    urllib.request.urlretrieve(url, file_name)
    return None


get_data()

In [4]:
with tarfile.open("../data/raw/od-challenge.tar.gz") as file:
    file.extractall("../data/ext")

In [5]:
data_dir = Path("../data/ext/od-challenge")
assert data_dir.exists()
files = data_dir.ls()
files

[PosixPath('../data/ext/od-challenge/aggs.pickle'),
 PosixPath('../data/ext/od-challenge/teams.pickle'),
 PosixPath('../data/ext/od-challenge/movies.pickle'),
 PosixPath('../data/ext/od-challenge/labels.pickle')]

In [6]:
def read(data_dir, ds: str):
    with (data_dir / f"{ds}.pickle").open("rb") as f:
        df = pickle.load(f)
    return df

In [7]:
aggs = read(data_dir, "aggs")
teams = read(data_dir, "teams")
movies = read(data_dir, "movies")
labels = read(data_dir, "labels")

In [8]:
print(
    f"There are {labels.movie_id.nunique()} movies with {len(labels)} ratings against them. There are also {labels.user_id.nunique()} users only"
)

There are 4107 movies with 42237 ratings against them. There are also 610 users only


The relatively small number of user-movie ratings indicates that training a deep learning model from scratch might not be a great approach. We still have the option of using a deep learning based approaches, which don't require large amounts of training data.

In [9]:
labels.head()
labels.movie_id.nunique(), labels.user_id.nunique(), labels.rating.value_counts()

(4107,
 610,
 4.0    11815
 3.0     7859
 5.0     7045
 3.5     4784
 4.5     3961
 2.0     2685
 2.5     1635
 1.0     1219
 0.5      669
 1.5      565
 Name: rating, dtype: int64)

In [10]:
movies.head()
movies.movie_id.nunique(), movies.genres.value_counts(), movies.year.value_counts()

(4107,
 {Drama}                                         328
 {Comedy}                                        307
 {Comedy, Drama}                                 141
 {Romance, Drama}                                138
 {Comedy, Romance}                               137
                                                ... 
 {Horror, Fantasy, Romance, Thriller}              1
 {Fantasy, Action}                                 1
 {Thriller, Romance, Horror, Drama}                1
 {Horror, Fantasy, Crime}                          1
 {Fantasy, Sci-Fi, Adventure, Comedy, Action}      1
 Name: genres, Length: 683, dtype: int64,
 2006    153
 2007    149
 2008    144
 2005    143
 2009    141
        ... 
 1928      1
 1922      1
 1902      1
 1924      1
 1920      1
 Name: year, Length: 98, dtype: int64)

In [11]:
teams.head()
teams.movie_id.nunique(), teams.person_role.nunique(), teams.person_id.nunique(), teams.person_id.value_counts()

(4107,
 3,
 107832,
 498278     56
 919798     55
 168        55
 230        52
 134        52
            ..
 1295573     1
 1005623     1
 1296206     1
 472790      1
 88746       1
 Name: person_id, Length: 107832, dtype: int64)

In [12]:
genres = movies.genres.tolist()

In [13]:
flat_genres = []
for g in genres:
    flat_genres.extend(list(g))

from collections import Counter

Counter(flat_genres)

Counter({'Children': 316,
         'Animation': 259,
         'Fantasy': 419,
         'Adventure': 744,
         'Comedy': 1489,
         'Thriller': 1099,
         'Crime': 610,
         'Action': 1055,
         'Romance': 666,
         'Drama': 1821,
         'Horror': 497,
         'Sci-Fi': 580,
         'Mystery': 318,
         'War': 180,
         'IMAX': 136,
         'Western': 74,
         'Musical': 143,
         'Film-Noir': 42,
         'Documentary': 49})

In [14]:
aggs.head()

Unnamed: 0,movie_id,rating_average,rating_count,demographic
0,109830,8.8,1870499,total
1,109830,8.8,1032446,gender_m
2,109830,8.7,269593,gender_f
3,109830,8.9,1215,age_0_17
4,109830,8.9,344804,age_18_29


In [15]:
aggs.movie_id.nunique()

4107

# Possible Features for Every Movie:

1. Genre (Binarized?)
2. Year of Release (Cont.)
3. Vector Emb of Synopsis, say 786 or 300 dim
4. Some combination of Top_Demo * Rating_Count of Ratings Average -- captures information about "Popular in" what demo
5. Preference in actor/writer/director? What about person_id? There's too many values there. Maybe retain the most popular? 

# How to split data?

There is more than one way to think about recommending a movie. Here are few: 

1. Leave-one-last-out: Leave one last movie from a stream of user rated movies, and predict the next movie
2. Random Movie: For each user, split interactions into train and test data
3. Random User: Split some users into train, the rest into test
4. Random Rating: Split some ratings into train, rest into test - without ensuring that a user or a movie is absent is leaking in test or not
5. Time-based: Split within each user, based on rating timestamp

Since we don't have have time of rating, we can safely ignore that. Last one out also requires making some assumption about the data ordering, so I am choosing to skip that. From the remaining options - to keep things simple, I'm choosing `Random Rating`. 

# Simple Baseline: Popular Movies


Before we start modeling the movie-user interaction and preference, I'd like to implement a baseline approach: 
What if recommend the most popular Top 10 movies to every user? When measuring popularity, we're looking for movies which are rated often - independent of their rating. We limit ourselves to **Top 10** movies. 

In [16]:
labels.describe()

Unnamed: 0,movie_id,user_id,rating
count,42237.0,42237.0,42237.0
mean,387299.5,303.1502,3.610318
std,645102.0,177.654595,1.06375
min,417.0,1.0,0.5
25%,100405.0,144.0,3.0
50%,118636.0,305.0,4.0
75%,350258.0,454.0,4.5
max,6644200.0,610.0,5.0


In [17]:
train, test = train_test_split(
    labels, test_size=0.2, shuffle=True, stratify=labels.rating, random_state=42
)

In [18]:
target_dir = Path("../data/intermediate")
target_dir.mkdir(exist_ok=True)
train.to_csv("../data/intermediate/train.csv", index=False)
test.to_csv("../data/intermediate/test.csv", index=False)

In [19]:
!ls -l {target_dir}

total 1312
-rw-r--r--  1 nirantk  staff  124049 Sep 24 19:56 test.csv
-rw-r--r--  1 nirantk  staff  496384 Sep 24 19:56 train.csv


In [20]:
top_k = 10
vc = dict(train.movie_id.value_counts())
popular_movies = [k for k, v in vc.items()][:top_k]
popular_movies

[109830, 111161, 110912, 102926, 133093, 108052, 112573, 107290, 76759, 112384]

# How do we count a hit? 

1. We predict a list of Top K movies which the user should've rated
2. If the user has infact seen ANY of the Top K moves, we get a hit, else a miss
3. A mean across all users in target, is our hit rate

In [21]:
def user_hits(predicted_movies: List[int], seen_movies: List[int]):
    return len(set(predicted_movies) & set(seen_movies)) > 0


def calc_hit_rate(split):
    hits = []
    for user_id in split.user_id:
        seen_movies = split[split.user_id == user_id].movie_id.unique()
        hits.append(user_hits(popular_movies, seen_movies))

    return sum(hits) / len(hits)


calc_hit_rate(test)

0.29853219696969696