This is my entry for the BBC Data Science Challenge. I have decided to construct a movie recommendation system using the movielens-latest-small-dateset found on https://grouplens.org/datasets/movielens/

### 1. Preparing the Data

Our first step is to import the libraries that we will be using, load in the data as required, and check it seems sensible.

In [360]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [361]:
links_df = pd.read_csv('./links.csv')
movies_df = pd.read_csv('./movies.csv')
ratings_df = pd.read_csv('./ratings.csv')
tags_df = pd.read_csv('./tags.csv')

ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### 2. Trying out Collaborative Filtering Method

We are then going to try using the Surprise library, in particular SVD (Singular Value Decomposition) to predict how a user will rate a given movie. We will then be able to recommend new movies to the user, thus creating a recommendation system.

In [362]:
from surprise import Reader, Dataset, SVD, evaluate
from surprise.model_selection import cross_validate

In [363]:
reader = Reader()

We can cross-validate the model as follows:

In [364]:
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

In [365]:
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8714  0.8743  0.8721  0.8781  0.8764  0.8745  0.0025  
MAE (testset)     0.6699  0.6744  0.6704  0.6723  0.6731  0.6720  0.0017  
Fit time          6.67    6.66    6.67    6.57    6.79    6.67    0.07    
Test time         0.22    0.91    0.34    0.39    0.18    0.41    0.26    


{'test_rmse': array([0.87140615, 0.87433554, 0.87208829, 0.87810223, 0.87635016]),
 'test_mae': array([0.66991945, 0.67435171, 0.67039794, 0.67228652, 0.67305869]),
 'fit_time': (6.66949987411499,
  6.664545059204102,
  6.674649953842163,
  6.568749904632568,
  6.794766902923584),
 'test_time': (0.21829771995544434,
  0.9129374027252197,
  0.34131789207458496,
  0.3943631649017334,
  0.1845228672027588)}

We can now build up a training set for the data and fit an algorithm to our data.

In [366]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a26003ac8>

Now we are able to construct a *get_recommendations* function that, given a userId, outputs 10 of their preferred movies based on the ratings they gave (to get a sense of the movies they like) and then also 10 recommended movies from our algorithm.

In [388]:
def get_top_10_based_on_ratings(userId):
    
    specific_user_ratings = ratings_df[ratings_df['userId'] == userId]
    
    top_rated_movies = specific_user_ratings[specific_user_ratings['rating'] == 5.0]
    
    top_movieIds = list(top_rated_movies.head(10).movieId)

    print(movies_df[movies_df['movieId'].isin(top_movieIds)].title.head(10))

In [389]:
def get_top_10_predicted_from_algorithm(userId):
    
    ratings = [algo.predict(userId,i).est for i in movies_df['movieId']]
    
    movies_df['predicted_ratings'] = ratings
    
    predicted = movies_df.sort_values(by=['predicted_ratings'], ascending=False)

    print(predicted.title.head(10))

In [411]:
def get_recommendations(userId):
    
    print('Users Favourite Films:')
    
    get_top_10_based_on_ratings(userId)
    
    print('-'*40)
    
    print('Recommended Films:')
    
    get_top_10_predicted_from_algorithm(userId)

In [412]:
get_recommendations(413)

Users Favourite Films:
15                         Casino (1995)
257                  Pulp Fiction (1994)
277     Shawshank Redemption, The (1994)
398                 Fugitive, The (1993)
461              Schindler's List (1993)
510     Silence of the Lambs, The (1991)
594                       Twister (1996)
659                Godfather, The (1972)
908         To Kill a Mockingbird (1962)
1284            Good Will Hunting (1997)
Name: title, dtype: object
----------------------------------------
Recommended Films:
1616                  Rosemary's Baby (1968)
2582     Guess Who's Coming to Dinner (1967)
8466                         Whiplash (2014)
949     Bridge on the River Kwai, The (1957)
694                        Casablanca (1942)
461                  Schindler's List (1993)
975                    Cool Hand Luke (1967)
946                     Graduate, The (1967)
686                       Rear Window (1954)
961                 Great Escape, The (1963)
Name: title, dtype: object


#### Why did you pick the specific technique(s) you used? 

There is a range of techniques available for creating recommendation systems. I decided to opt for a collaborative filtering approach, as it is known to work well for Netflix Recommendation Systems. More specifically, I tried out using SVD (singular value decomposition - a matrix factorization technique) as it leverages a latent factor model to capture the similarities between users and items. Essentially, we want to turn the recommendation problem into an optimization problem, and the leading algorithm for this application is SVD (see http://surpriselib.com/ for testing various algorithms). 

#### What parameters matter in this case? 

In this case, the parameters of interest are the movieId's, ratings and userId's. This is because we are trying to predict which users behave similarly to others.

#### What assumptions does the model make about the data that it is being trained on? 

Users who have similar preferences in the past are likely to have similar preferences in the future. It is this assumption which allows us to take a user's history and extrapolate into their future and predict items which they might enjoy.

#### How may the model adapt over time and continue to provide useful predictions?

As movie interests change, model could be adapted to add increased weight to ratings given out recently (can check this by checking the timestamp value in the dataframe). This way, as the interests of certain individuals change over time, the system captures the changing interest and the recommendations are adjusted and improved as time progresses. 

We could also incorporate item-based collaborative filtering into our model to create a hybrid. This would furthure ensure that the model doesn't break down if peoples preferences change.

### 3. Evaluation

#### Provide an evaluation of how your model performed – did it perform well?

Yes the model seems to have performed well, it does offer a good recommendation to users in my opinion of looking at the results for various customers. 
Mean RMSE (testset): 0.8745 from the cross-validation data above.