# Movie Recommendation Engine

In this notebook, we build a simple Movie Recommendation Engine based on [MovieLens Dataset](https://grouplens.org/datasets/movielens/latest/). 

We'll **only use the ratings** as the data for our Machine Learning algorithm.

<img src='assets/before_sunrise.png'>


## What we are trying to do:

<img src='assets/problem_setup.png'>

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
from surprise import Reader, Dataset
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import train_test_split

from sklearn.manifold import TSNE

## 1. Prepare Data

In [None]:
DATA_DIR = 'data/ml-latest-small/'

In [None]:
os.listdir(DATA_DIR)

In [None]:
ratings_df = pd.read_csv(os.path.join(DATA_DIR, 'ratings.csv'), usecols=['userId', 'movieId', 'rating'])
ratings_df.shape

In [None]:
ratings_df.head()

In [None]:
movies_df = pd.read_csv(os.path.join(DATA_DIR, 'movies.csv'))

In [None]:
movies_df.head()

In [None]:
rmdf = pd.merge(ratings_df, movies_df, on='movieId', how='left')
rmdf.head()

In [None]:
reader = Reader(rating_scale=(0, 6))
data = Dataset.load_from_df(rmdf[['userId', 'movieId', 'rating']], reader)

In [None]:
trainset, testset = train_test_split(data, test_size=0.1, random_state=42)

## 2. Train model

In [None]:
algo = SVD(n_factors=200, random_state=42)

In [None]:
algo.fit(trainset)

## 3. Test Model

In [None]:
test_pred = algo.test(testset)

accuracy.rmse(test_pred, verbose=True)

## 4. Use Model (Make Predictions)

We can predict how much a user will like a movie and then sort the movies that user hasn't watched according to that score to get the top picks for the user. 

<img src='assets/score_to_rank.png'>

<img src='assets/top_picks.png'>

In [None]:
user_id = 1
user_movies = rmdf.query(f'userId == {user_id}')

In [None]:
user_movies.sort_values(by='rating', ascending=False).head(10)

In [None]:
unwatched_movies = [m for m in ratings_df['movieId'].unique() if m not in user_movies['movieId'].unique()]
scores = []
for m in unwatched_movies:
    score = algo.predict(user_id, m).est
    scores.append((m, score))

In [None]:
sdf = pd.DataFrame(scores, columns=['movieId', 'Score'])
sdf = sdf.merge(movies_df, on='movieId', how='left')
sdf.sort_values(by='Score', ascending=False).head(10)

## Visualization (Extra)

In [None]:
popular_50_movie_ids = rmdf.groupby(by='movieId')['rating'].count().sort_values(ascending=False).head(50).index
popular_50_iids = []
for movie_id in popular_50_movie_ids:
    popular_50_iids.append(trainset.to_inner_iid(movie_id))
    
df_50 = pd.DataFrame({'movieId': popular_50_movie_ids})
df_50 = pd.merge(df_50, movies_df, on='movieId', how='left')

In [None]:
movie_embeddings = algo.qi[popular_50_iids]
movie_embeddings_2d = TSNE(n_components=2, random_state=42).fit_transform(movie_embeddings)

Note: Visualization adapted from [this notebook](https://hodapple.com/blag/posts/2018-04-08-recommender-systems-1.html).

In [None]:
plt.figure(figsize=(15,15))
markers = ["$ {} $".format("\ ".join(m.split(" ")[:-1])) for m in df_50["title"]]
for i, item in enumerate(movie_embeddings_2d):
    l = len(markers[i])
    plt.scatter(item[0], item[1], marker=markers[i], alpha=0.75, s = 50 * (l**2))
plt.show()

## Epilogue

Again, to emphasize, this only used the ratings (interaction data) from the users. Nothing directly about the movies themselves. That's the power of data!