# Movie Recommendations - Weighted Average

> Recommendations using weighted average approach

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [recommendations, weighted, average, movie, lens]
- hide: false

Movie recommendation is one of the important for most of the online streaming services now, like netflix, prime, hulu and many more.

There are many ways in which you can solve this problem, ranging from simple **cosine similarities** to complex **context aware systems**.

In this notebook, I am trying to solve this problem using Weighted Average technique.

The required modules are:
* NumPy (for faster numerical Operations)
* Pandas (for Data Frame manipulation)
* zipfile (extracting the dataset)
* matplotlib (for some visualization)
* scikit-learn (cosine and partitioning)

In [1]:
# Required modules

import numpy as np
import pandas as pd

from zipfile import ZipFile
from matplotlib import pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Extracting the zipfile

with ZipFile("./data/ml-latest-small.zip", "r") as zf:
    zf.extractall("./data/")

I have used a smaller version of the movielens dataset from grouplens, as this notebook is only for understaning the concepts. And it will be easy to get the requested memory for running the algorithms.

In [3]:
# Load the data

ratings = pd.read_csv("./data/ml-latest-small/ratings.csv", usecols=["userId", "movieId", "rating"])
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [4]:
# Inspecting the data

ratings.info()
ratings.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  int64  
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


Unnamed: 0,userId,movieId,rating
count,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557
std,182.618491,35530.987199,1.042529
min,1.0,1.0,0.5
25%,177.0,1199.0,3.0
50%,325.0,2991.0,3.5
75%,477.0,8122.0,4.0
max,610.0,193609.0,5.0


Here, I am dividing the complete dataset into two partitions, which are train and test, in a stratified manner so that the training and testing distribution will be identical.

In [5]:
# Dividing the data into training and testing

X = ratings.copy()
y = ratings["userId"].copy()

X_train, X_test = train_test_split(X, test_size=0.25, stratify=y, random_state=42)

The next step will be to pivot the dataset into the suitable format, in this the rows correspond to the userid and the columns to movie id.

The cells give the corresponding rating, which is given by the user(`userId`) to that particular movie(`movieId`).

In [6]:
# Making the data into correct format

train_um = X_train.pivot(index="userId", columns="movieId", values="rating")
train_um.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,190219,190221,193565,193567,193571,193579,193581,193583,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [7]:
# Replace all the missing values with the 0

train_um.fillna(0, inplace=True)

In [8]:
# Finding the Similarity(user x user)

user_similarity = cosine_similarity(train_um, train_um)
user_similarity = pd.DataFrame(user_similarity, index=train_um.index, columns=train_um.index)
user_similarity.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.018107,0.056232,0.162144,0.059043,0.10156,0.143222,0.124874,0.035194,0.007833,...,0.075117,0.111534,0.168895,0.046024,0.11,0.140322,0.214684,0.235537,0.053809,0.1348
2,0.018107,1.0,0.0,0.0,0.021686,0.034055,0.005591,0.035673,0.0,0.033564,...,0.126652,0.021905,0.007319,0.0,0.0,0.031212,0.017007,0.052777,0.036892,0.061756
3,0.056232,0.0,1.0,0.0,0.006184,0.004995,0.0,0.0,0.0,0.0,...,0.003543,0.005997,0.027828,0.0,0.0,0.006766,0.00291,0.017128,0.0,0.02747
4,0.162144,0.0,0.0,1.0,0.094343,0.089772,0.113243,0.05716,0.015685,0.015613,...,0.081071,0.092843,0.214782,0.052012,0.058018,0.143171,0.088786,0.128352,0.042905,0.091296
5,0.059043,0.021686,0.006184,0.094343,1.0,0.238989,0.063926,0.352497,0.0,0.019583,...,0.046766,0.340139,0.090455,0.159388,0.089103,0.076463,0.138285,0.102171,0.255179,0.045133


In [28]:
# Algorithm

movie_id = 6
user_id = 601

def get_recommendation(user_similarity, train_um, movie_id, user_id):
    tag = True
    if movie_id in train_um:
        user_sim = user_similarity[user_id]
        movie_rates = train_um[movie_id]
        
        rate = np.dot(movie_rates, user_sim) / user_sim[movie_rates != 0].sum()
        
        if user_sim[movie_rates != 0].sum() == 0:
            tag = False
    else:
        print(f"The movie with {movie_id} is not present in our database")
        tag = False
    
    if tag:
        return rate
    else:
        return -1

get_recommendation(user_similarity, train_um, movie_id, user_id)

4.068352560376466

In [37]:
# Testing on the test set

def score_test(user_similarity, train_um, test):
    predicted = list()
    mask = list()
    actuals = list()

    for (_, row), true_rate in zip(test.iterrows(), test["rating"]):
        rating = get_recommendation(user_similarity, train_um, row["movieId"], row["userId"])
        
        if rating != -1:
            actuals.append(true_rate)
            predicted.append(rating)

    return mean_squared_error(np.array(actuals), np.array(predicted), squared=False)

In [38]:
# Testing on the test set

score_test(user_similarity, train_um, X_test)

The movie with 89837.0 is not present in our database
The movie with 5136.0 is not present in our database
The movie with 66544.0 is not present in our database
The movie with 5614.0 is not present in our database
The movie with 6342.0 is not present in our database
The movie with 148956.0 is not present in our database
The movie with 4573.0 is not present in our database
The movie with 109282.0 is not present in our database
The movie with 7372.0 is not present in our database
The movie with 27829.0 is not present in our database
The movie with 6158.0 is not present in our database
The movie with 100527.0 is not present in our database
The movie with 3714.0 is not present in our database
The movie with 83374.0 is not present in our database
The movie with 121374.0 is not present in our database
The movie with 6002.0 is not present in our database
The movie with 7455.0 is not present in our database
The movie with 8575.0 is not present in our database
The movie with 3344.0 is not prese

  rate = np.dot(movie_rates, user_sim) / user_sim[movie_rates != 0].sum()


The movie with 6465.0 is not present in our database
The movie with 175705.0 is not present in our database
The movie with 133879.0 is not present in our database
The movie with 3590.0 is not present in our database
The movie with 1150.0 is not present in our database
The movie with 160565.0 is not present in our database
The movie with 145951.0 is not present in our database
The movie with 7831.0 is not present in our database
The movie with 155774.0 is not present in our database
The movie with 115216.0 is not present in our database
The movie with 146986.0 is not present in our database
The movie with 130052.0 is not present in our database
The movie with 95175.0 is not present in our database
The movie with 70703.0 is not present in our database
The movie with 2741.0 is not present in our database
The movie with 27746.0 is not present in our database
The movie with 7986.0 is not present in our database
The movie with 1856.0 is not present in our database
The movie with 4116.0 is no

1.1828717551415853