<a href="https://colab.research.google.com/github/Saurav-Raghaw/DataScience/blob/main/Recommender_System_Surprise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install surprise 

In [3]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import cross_validate # it only takes a few lines of code to run a cross-validation procedure.

In [4]:
# Loading the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


#Using prediction algorithmS.


In [28]:
from surprise import KNNBasic
#algo = KNNBasic()

# Some of these algorithms may use baseline estimates, some may use a similarity measure. 
#We will here review how to configure the way baselines and similarities are computed.

#Baselines estimates configuration.

* This section only applies to algorithms (or similarity measures) that try to minimize the following regularized squared error (or equivalent):


> $\sum_{r_{ui} \in R_{train}} \left(r_{ui} - (\mu + b_u + b_i)\right)^2 +
\lambda \left(b_u^2 + b_i^2 \right).$



* Baselines can be estimated in two different ways:

1. Using Stochastic Gradient Descent (SGD).
2. Using Alternating Least Squares (ALS).

* For both procedures (ALS and SGD), user and item biases (bu and bi) are initialized to zero.





In [9]:
from surprise import BaselineOnly
from surprise.model_selection import train_test_split

In [10]:
# sample random trainset and testset
# test set is made of 25% of the ratings.

trainset, testset = train_test_split(data, test_size=.25)

In [13]:
#Using Alternating Least Squares (ALS).

#bsl_options:  configure the way baselines are computed
#'method':  indicates the method to use. 
#'reg_u': The regularization parameter for users. Corresponding to λ3. Default is 15.
# 'reg_i': The regularization parameter for items. Corresponding to λ2. Default is 10.
# 'n_epochs': The number of iteration of the ALS procedure. Default is 10.

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)

Using ALS


In [14]:
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)

# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

Estimating biases using als...
RMSE: 0.9387


0.9387206678493552

In [25]:
#Using Stochastic Gradient Descent (SGD).
# 'reg': The regularization parameter of the cost function that is optimized. Default is 0.02.
# 'learning_rate': The learning rate of SGD. Default is 0.005.
# 'n_epochs': Default is 20.

print('Using SGD')
bsl_options = {'method': 'sgd',
               'learning_rate': .00005,
                'reg': 0.04,
               'n_epochs': 40
               }
algo = BaselineOnly(bsl_options=bsl_options)

Using SGD


In [26]:
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)

# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

Estimating biases using sgd...
RMSE: 1.0534


1.053396906817256

In [29]:
#Some similarity measures may use baselines, such as the pearson_baseline similarity. 

bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
sim_options = {'name': 'pearson_baseline'}
algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

In [30]:
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)

# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0001


1.000137552402814

#Similarity measure configuration.

* Same as baseline rconfiguration: we just need to pass a sim_options argument at the creation of an algorithm. 




In [31]:
#'name': The name of the similarity to use. Default is 'MSD'.
#'user_based': Whether similarities will be computed between users or between items. 
               #This has a huge impact on the performance of a prediction algorithm. Default is True.

#'min_support': The minimum number of common items (when 'user_based' is 'True') or minimum number of common users (when 'user_based' is 'False') 
#for the similarity not to be zero. Simply put, if |Iuv|<min_support then sim(u,v)=0. The same goes for items.

#'shrinkage': Shrinkage parameter to apply (only relevant for pearson_baseline similarity). Default is 100.

sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBasic(sim_options=sim_options)

In [32]:
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)

# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0307


1.030707229676026

In [35]:
sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }
algo = KNNBasic(sim_options=sim_options)

In [36]:
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)

# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.0076


1.007640430461604

#similarities module
##1. cosine: 	Compute the cosine similarity between all pairs of users (or items).

Only common users (or items) are taken into account. The cosine similarity is defined as:

$\text{cosine_sim}(u, v) = \frac{
\sum\limits_{i \in I_{uv}} r_{ui} \cdot r_{vi}}
{\sqrt{\sum\limits_{i \in I_{uv}} r_{ui}^2} \cdot
\sqrt{\sum\limits_{i \in I_{uv}} r_{vi}^2}
}$


$\text{cosine_sim}(i, j) = \frac{
\sum\limits_{u \in U_{ij}} r_{ui} \cdot r_{uj}}
{\sqrt{\sum\limits_{u \in U_{ij}} r_{ui}^2} \cdot
\sqrt{\sum\limits_{u \in U_{ij}} r_{uj}^2}
}$

##2. msd: 	Compute the Mean Squared Difference similarity between all pairs of users (or items).

Only common users (or items) are taken into account. The Mean Squared Difference is defined as:

$\text{msd}(u, v) = \frac{1}{|I_{uv}|} \cdot
\sum\limits_{i \in I_{uv}} (r_{ui} - r_{vi})^2$

$\text{msd}(i, j) = \frac{1}{|U_{ij}|} \cdot
\sum\limits_{u \in U_{ij}} (r_{ui} - r_{uj})^2$

The MSD-similarity is then defined as:

$\begin{split}\text{msd_sim}(u, v) &= \frac{1}{\text{msd}(u, v) + 1}\\
\text{msd_sim}(i, j) &= \frac{1}{\text{msd}(i, j) + 1}\end{split}$
* The +1 term is just here to avoid dividing by zero.

##3. pearson(): Compute the Pearson correlation coefficient between all pairs of users (or items).

Only common users (or items) are taken into account. The Pearson correlation coefficient can be seen as a mean-centered cosine similarity, and is defined as:

$\text{pearson_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}}
(r_{ui} -  \mu_u) \cdot (r_{vi} - \mu_{v})} {\sqrt{\sum\limits_{i
\in I_{uv}} (r_{ui} -  \mu_u)^2} \cdot \sqrt{\sum\limits_{i \in
I_{uv}} (r_{vi} -  \mu_{v})^2} }$

$\text{pearson_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}}
(r_{ui} -  \mu_i) \cdot (r_{uj} - \mu_{j})} {\sqrt{\sum\limits_{u
\in U_{ij}} (r_{ui} -  \mu_i)^2} \cdot \sqrt{\sum\limits_{u \in
U_{ij}} (r_{uj} -  \mu_{j})^2} }$

* Note: if there are no common users or items, similarity will be 0 (and not -1).

##4. pearson_baseline(): 