Model-based collaborative filtering uses learning
techniques to create a model to generate recommendation.

Two approaches- 
1. using a probability approach, for example, Bayesian
Classifier

2. using rating prediction of an item, for
example, Singular Value Decomposition. 


Pros: 
1. Model-based approach has better predictions than memory-based. 
2. It is also capable of handling the problem of sparsity and scalability
better than memory-based. 

Cons: 
1. However, model-based approach requires a great resource, such as time and memory, to develop the model and may lose information when using dimensionality reduction.

[Reference](https://ieeexplore.ieee.org/document/7872755)

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
cd ../gdrive/My Drive/Colab Notebooks/_Recommendation System/Recommendation System/_movie_data

/gdrive/My Drive/Colab Notebooks/_Recommendation System/Recommendation System/_movie_data


In [3]:
import pandas as pd
import numpy as np

https://surprise.readthedocs.io/en/stable/getting_started.html

In [4]:
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 6.8 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633728 sha256=d58f760d0289b2f1ad8cb52de29e12aee072a3a68c20bb1bd000c46b6cd499a4
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [11]:
#Import the required classes and methods from the surprise library
from surprise import Reader, Dataset, KNNBasic, SVD
from surprise.model_selection import cross_validate

In [6]:
#Load the u.data file into a dataframe
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_csv('u.data', sep='\t', names=r_cols,encoding='latin-1')

ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [21]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user_id    100000 non-null  int64
 1   movie_id   100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [12]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')

In [14]:
ratings.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,100000.0,462.4848,266.6144,1.0,254.0,447.0,682.0,943.0
movie_id,100000.0,425.5301,330.7984,1.0,175.0,322.0,631.0,1682.0
rating,100000.0,3.52986,1.125674,1.0,3.0,4.0,4.0,5.0
timestamp,100000.0,883528900.0,5343856.0,874724710.0,879448709.5,882826944.0,888259984.0,893286638.0


In [10]:
#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader()

In [16]:
#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

In [17]:
data

<surprise.dataset.DatasetAutoFolds at 0x7f5a11138b10>

### Faster CV

In [18]:
#Define the kNN
knn = KNNBasic()

# Use the famous SVD algorithm
svd_algo = SVD()

In [19]:
# Run 5-fold cross-validation and then print results
cross_validate(svd_algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9411  0.9429  0.9319  0.9334  0.9296  0.9358  0.0052  
MAE (testset)     0.7422  0.7412  0.7355  0.7355  0.7329  0.7374  0.0036  
Fit time          4.10    4.28    4.63    4.06    4.48    4.31    0.22    
Test time         0.13    0.21    0.12    0.19    0.12    0.15    0.04    


{'fit_time': (4.1012513637542725,
  4.277316093444824,
  4.625521183013916,
  4.059075832366943,
  4.481512546539307),
 'test_mae': array([0.74218655, 0.74117566, 0.73546671, 0.73545562, 0.73294207]),
 'test_rmse': array([0.94106278, 0.94292974, 0.9318952 , 0.93335737, 0.9296323 ]),
 'test_time': (0.12514996528625488,
  0.20682835578918457,
  0.12059426307678223,
  0.19013381004333496,
  0.11682558059692383)}

In [20]:
# Run 5-fold cross-validation and then print results
cross_validate(knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9822  0.9774  0.9830  0.9683  0.9798  0.9781  0.0053  
MAE (testset)     0.7776  0.7717  0.7754  0.7656  0.7751  0.7731  0.0042  
Fit time          0.25    0.30    0.27    0.27    0.27    0.27    0.02    
Test time         2.79    2.75    2.62    2.66    2.62    2.69    0.07    


{'fit_time': (0.24642634391784668,
  0.3024609088897705,
  0.27239322662353516,
  0.2698028087615967,
  0.2749969959259033),
 'test_mae': array([0.77758079, 0.77167982, 0.77542899, 0.76559078, 0.77505723]),
 'test_rmse': array([0.98217308, 0.97742919, 0.98302392, 0.96826645, 0.97984494]),
 'test_time': (2.794323205947876,
  2.746000051498413,
  2.6245453357696533,
  2.663700580596924,
  2.624539852142334)}

### Method KNN Basics

In [23]:
from surprise import KNNBasic
from surprise import Dataset

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = KNNBasic()
algo.fit(trainset)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f5a0f949e10>

In [46]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 3.88   {'was_impossible': False}


In [45]:
uid = str(206)  # raw user id (as in the ratings file). They are **strings**!
iid = str(102000)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=100, verbose=True)

user: 206        item: 102000     r_ui = 100.00   est = 2.23   {'was_impossible': False}


### Method - SVD

In [25]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9411


0.9411301570338959