# MovieLens Popularity Model

This notebook creates and evaluates a baseline recommendation model, based on popularity.

Steps:

1. Data prep - download the raw dataset, and prepare it for training and evaluation.
1. Model authoring - build the popularity based model.
1. Model evaluation - benchmark the popularity model to achieve a baseline.



In [4]:
import os

DATASET_FOLDER = os.getcwd()

%matplotlib inline

## Step 1: Data preparation

In [5]:
from pathlib import Path
import shutil
import tempfile
import urllib.request
from urllib.request import urlopen
import zipfile
import pandas as pd

def load_movielens_dataset(dataset_name = 'ml-latest', workspace_path = DATASET_FOLDER + '/data/movielens'):
    """
    Downloads, extracts, and loads MovieLens dataset 

    Args:
    dataset_name (string): dataset name to load, defaults to 'ml_latest'. Options listed here: https://grouplens.org/datasets/movielens/
    workspace_path (string): workspace path to use for downloading and extracting the archive, defaults to ~/data/movielens
    """

    archive_url = f'http://files.grouplens.org/datasets/movielens/{dataset_name}.zip'

    print(f'Downloading archive from: {archive_url}, this may take a few minutes...')

    with urllib.request.urlopen(archive_url) as response:
        with tempfile.NamedTemporaryFile(delete=True) as tmp_file:
            shutil.copyfileobj(response, tmp_file)
            with zipfile.ZipFile(tmp_file) as archive:
                archive.extractall(workspace_path)

    print(f'Archive files available at workspace: {workspace_path}')

    ratings = pd.read_csv(f'{workspace_path}/{dataset_name}/ratings.csv')
    movies = pd.read_csv(f'{workspace_path}/{dataset_name}/movies.csv')

    print(f'Loaded {ratings.shape[0]:,} ratings and {movies.shape[0]:,} movies.')

    return ratings, movies

In [7]:
# # Let's load some movie ratings!
# ratings, movies = load_movielens_dataset(dataset_name='ml-latest')

workspace_path = 'data/movielens'
dataset_name = 'ml-latest'

ratings = pd.read_csv(f"{workspace_path}/{dataset_name}/ratings.csv")
movies = pd.read_csv(f"{workspace_path}/{dataset_name}/movies.csv")
user_count = ratings["userId"].nunique()
movie_count = movies["movieId"].nunique()

In [8]:
# Let's peek into the loaded DataFrames
print('Sneak peek into the ratings DataFrame:\n\n', ratings.head(), '\n')
print('Number of ratings: ', ratings.shape[0])
print('Sneak peek into the movies DataFrame:\n\n', movies.head(), '\n')
print('Number of movies: ', movies['movieId'].nunique(), ' - notice this is higher than the number of rated movies!')


Sneak peek into the ratings DataFrame:

    userId  movieId  rating   timestamp
0       1      307     3.5  1256677221
1       1      481     3.5  1256677456
2       1     1091     1.5  1256677471
3       1     1257     4.5  1256677460
4       1     1449     4.5  1256677264 

Number of ratings:  27753444
Sneak peek into the movies DataFrame:

    movieId                               title   
0        1                    Toy Story (1995)  \
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy   

Number of movies:  58098  - notice this is higher than th

Next we divide our dataset into train and test, so we can later evaluate the popularity model on the test dataset, and achieve a baseline.

In [9]:
# Define our input and labels data X,Y
dataset = ratings[['userId','movieId','rating']]

# We don't need to train this model, however we will choose a random subset for testing, to be aligned with trained models evaluated later on
from sklearn.model_selection import train_test_split
random_state = 42
test_size = 0.2

dataset_train, dataset_valid = train_test_split(dataset, test_size = test_size, random_state = random_state)

print(f'Training dataset sample size: {len(dataset_train):,} positive samples ({len(dataset_train)/len(dataset)*100:.0f}%)')
print(f'Test dataset sample size: {len(dataset_valid):,} positive samples ({len(dataset_valid)/len(dataset)*100:.0f}%)')
print(f'Total dataset sample size: {len(dataset):,} positive samples (100%)')

Training dataset sample size: 22,202,755 positive samples (80%)
Test dataset sample size: 5,550,689 positive samples (20%)
Total dataset sample size: 27,753,444 positive samples (100%)


## Step 2: Preparing the popularity model
Our popularity model will predict a user's rating for a given movie by averaging all other ratings provided for this movie.

In more concrete terms, our algorithm will:  
1. Compute average rating per movie and store the values
1. For a given prediction request, return the average rating for the movie

In [10]:
# Let's calculate the average rating per movie on the training set, and take a quick look
train_average_rating = pd.DataFrame(dataset_train.groupby(["movieId"]).mean()['rating'])
train_average_rating

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.888161
2,3.246433
3,3.172547
4,2.886079
5,3.077933
...,...
193872,4.000000
193874,5.000000
193876,3.000000
193878,2.000000


In [11]:
# create popularity model, which will be a simple dictionary mapping movieId to predicted rating

class PopModel:
    "Encapsulates popularity based movie rating prediction"

    def __init__(self, dataset_train):
        print('Popularity model initializing...')
        self.DEFAULT_RATING = 3
        self._popModel = {}
        train_average_rating = pd.DataFrame(dataset_train.groupby(["movieId"]).mean()['rating'])
        for movieId, row in train_average_rating.iterrows():
            self._popModel[movieId] = row.rating
        print(f'Popularity model instantiated, includes {len(self._popModel):,} mappings.')

    def predictRating(self, movieId):
        if movieId in self._popModel:
            return self._popModel[movieId]
        else: 
            return self.DEFAULT_RATING

popModel = PopModel(dataset_train)

Popularity model initializing...
Popularity model instantiated, includes 51,559 mappings.


In [12]:
from random  import randrange

# Let's try a random prediction
ratings_row = randrange(0, ratings.shape[0] - 1)
test_user = int(ratings.iloc[ratings_row].userId)
test_movie = int(ratings.iloc[ratings_row].movieId)
actual_rating = ratings.iloc[ratings_row].rating

predicted_rating = popModel.predictRating(test_movie)
print(f"Dry run prediction for user-id:{test_user} and movie-id:{test_movie} is: {predicted_rating}, actual rating is: {actual_rating}")

Dry run prediction for user-id:209311 and movie-id:111362 is: 3.7518554150632215, actual rating is: 5.0


## Step 3: Evaluating the popularity model
Next we'll run an evaluation of the model, which will serve us as a baseline later on when we evaluate and explore personalized models

In [13]:
# define an iterator we will use to go over the dataset efficiently
import numpy as np
import math

class DatasetBatchIterator:
    "Iterates over labaled dataset in batches"
    def __init__(self, X, Y, batch_size):
        self.X = np.asarray(X)
        self.Y = np.asarray(Y)
        self.batch_size = batch_size
        self.n_batches = int(math.ceil(X.shape[0] / batch_size))
        self._current = 0

    def __iter__(self):
        return self

    def __next__(self):
        return self.next()

    def next(self):
        if self._current >= self.n_batches:
            raise StopIteration()
        k = self._current
        self._current += 1
        bs = self.batch_size
        return self.X[k*bs:(k + 1)*bs], self.Y[k*bs:(k + 1)*bs]

In [14]:
# evaluate the popularity model by iterating over the test dataset to compute predictions
batch_size = 1024
count = 0
total = len(dataset_valid)
groud_truth = np.empty([total])
predictions = np.empty([total])

X_val = dataset_valid[['userId','movieId']]
Y_val = dataset_valid['rating']

for x_batch, y_batch in DatasetBatchIterator(X_val, Y_val, batch_size):
    actual_batch_size = len(x_batch)
    y_hat_batch = np.empty([actual_batch_size])
    for index, x in np.ndenumerate(x_batch[:,1]):
        y_hat_batch[index] = popModel.predictRating(x)

    groud_truth[count:count+actual_batch_size] = y_batch;
    predictions[count:count+actual_batch_size] = y_hat_batch;
    count += actual_batch_size
    print(f'\rEvaluation progress: {count/total*100:.0f}% ({count}/{total})', end="")



Evaluation progress: 100% (5550689/5550689)

Popularity model RMSE: 0.9658


In [15]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error

RMSE = mean_squared_error(groud_truth, predictions, squared = False)
MAPE = mean_absolute_percentage_error(groud_truth, predictions)

# RMSE = np.sqrt(np.mean((predictions - groud_truth)**2))
print(f'Final RMSE: {RMSE:.4f}')
print(f'Final MAPE: {MAPE:.4f}')

Final RMSE: 0.9658
Final MAPE: 0.3477
