# High performance recommendation systems
We'll try improve the performance of various recommendation-systems by using non-trivial algorithms and also by performing the tuning of the hyper-parameters.

Import the needed Python packages.

In [None]:
!pip install scikit-surprise
import pandas as pd, numpy as np; from scipy.stats import randint, uniform
import multiprocessing
from surprise import Dataset, Reader 
from surprise.model_selection import KFold, cross_validate, RandomizedSearchCV
from surprise.prediction_algorithms.random_pred import NormalPredictor
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.prediction_algorithms.knns import KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise.prediction_algorithms.co_clustering import CoClustering
from surprise.prediction_algorithms.slope_one import SlopeOne
from surprise.prediction_algorithms.matrix_factorization import SVD, SVDpp, NMF

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp39-cp39-linux_x86_64.whl size=3195827 sha256=b680d9534b41bc16fc0f354a308e25b60ad0811302cf1632b25e6b9b04249a6c
  Stored in directory: /root/.cache/pip/wheels/c6/3a/46/9b17b3512bdf283c6cb84f59929cdd5199d4e754d596d22784
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


We apply **all** algorithms for recommendation made available by “Surprise” libraries on the provided dataset:
* **with their default configuration**
* using **ALL** CPU-cores available on the remote machine by specifying the value in an **explicit** way with an integer number.

Also:
* we use __Alternating Least Squares__ as baselines estimation method
* we use __cosine similarity__ as similarity measure
* we use __item-item similarity__
* if a number of iterations is to be set, it will be 25

> Not all options may be applicable to all algorithms

### 1.1
Prepare the dataset for the Recommendation algorithms.

> It should be a Pandas DataFrame with three fields: `Ruler`, `Knight`, `Rating`.

> Each row must contain two characters' names if they appear together in at least one chapter **text**.

> The relevant characters are only those extracted in Part 1.1.3.

> Keep in mind that some characters have alternative names, but they refer to the same character.

> The dataset must not contain repetitions.

Also:

> A `Ruler` is a character whose name starts with `King` or `Queen`.

> A `Knight` is a character whose ame starts with `Knight` or `Sir`.

> The `Rating` represents the number of chapters in which two characters appear together.

In [None]:
couples_df = pd.read_csv("couples.csv") # from the pagerank.ipynb file in this repo
rate_df = couples_df[((couples_df.character_1.str.startswith("King")) | (couples_df.character_1.str.startswith("Queen"))) & ((couples_df.character_2.str.startswith("Sir")) | (couples_df.character_2.str.startswith("Knight")))]
rate_df.columns = ["Ruler", "Knight"]

temp = couples_df[((couples_df.character_1.str.startswith("Knight")) | (couples_df.character_1.str.startswith("Sir"))) & ((couples_df.character_2.str.startswith("Queen")) | (couples_df.character_2.str.startswith("King")))]
temp.columns = ["Knight", "Ruler"]

rate_df = pd.concat([rate_df, temp], ignore_index=True)

def couple_rating(ruler,
                  knight,
                  df: pd.DataFrame) -> int:

  def check_nickname(name,
                     text,
                     nicks) -> bool:

    for nick in nicks[name]:
      if nick in text:
        return True
    return False

  return np.sum(df.text.apply(lambda x: 1 if check_nickname(ruler, x, name_dict) and check_nickname(knight, x, name_dict) else 0).values) 


rate_df["Rating"] = rate_df.apply(lambda x: couple_rating(x.Ruler, x.Knight, df),  axis=1)
rate_df.sort_values(by="Rating", inplace=True, ascending=False)

### 1.2
Inspect the dataset:

1. For each field, print the minimum and maximum values.

2. Print also the rows of the dataset where `Sir Accolon` appears.


In [None]:
minimum = rate_df.Rating.min()
print(f"The minimum rating is: {minimum}")

maximum = rate_df.Rating.max()
print()
print(f"The maximum rating is: {maximum}")

print()
print("The rows in which 'Sir Accolon' appears in:")
display(rate_df[rate_df.Knight == 'Sir Accolon'])

The minimum rating is: 1

The maximum rating is: 238

The rows in which 'Sir Accolon' appears in:


Unnamed: 0,Ruler,Knight,Rating


### 1.3
Load the dataset into the appropriate scikit-surprise structure.

In [None]:
reader = Reader(rating_scale=[minimum, maximum])

data = Dataset.load_from_df(df=rate_df,
                            reader=reader)

### 1.4
Initialize a `scikit-surprise` `KFold` object with 3-folds.

In [None]:
seed = 24

kf = KFold(n_splits=3, random_state=seed)

### 1.5
Define **all** the algorithms you are going to use

In [None]:
algorithms = [NormalPredictor, BaselineOnly,
              KNNBasic, KNNWithMeans, KNNWithZScore,
              KNNBaseline, CoClustering, SlopeOne, SVD, SVDpp, NMF]

### 1.6
Define the parameter configurations for each selected algorithm.

Each configuration must be a python `dict`.

Ensure that the definition meets the requirements of Part 2, but is also as minimal as possible (the fewer parameters you define, the better).

In [None]:
n_epochs = 25

bsl_options = {"method": "als", "n_epochs": n_epochs}
sim_options = {"name": "cosine", "user_based": False}

parameters = {}
parameters['NormalPredictor'] = {}
parameters['BaselineOnly'] = {'bsl_options': bsl_options}

parameters['KNNBasic'] = {'sim_options': sim_options}
parameters['KNNWithMeans'] = {'sim_options': sim_options}
parameters['KNNWithZScore'] = {'sim_options': sim_options}
parameters['KNNBaseline'] = {'sim_options': sim_options,
                             'bsl_options': bsl_options}

parameters['CoClustering'] = {'n_epochs': n_epochs}

parameters['SlopeOne'] = {}

parameters['SVD'] = {'n_epochs': n_epochs}
parameters['SVDpp'] = {'n_epochs': n_epochs}
parameters['NMF'] = {'n_epochs': n_epochs}

### 1.7
Print the number of CPU cores belonging to the machine on which Colab is running.

In [None]:
# This number (2) will be used in the 1 and 2 sections
cores = multiprocessing.cpu_count()
print(f"Number of CPU cores belonging to the machine on which Colab is running: {cores}")

Number of CPU cores belonging to the machine on which Colab is running: 2


Run a **3-Folds**-cross-validation for all the selected algorithms, using the parameters configuration that we selected for each algorithm.
Make sure that the cross-validation gives in output the metric measures for each split, as well as fit and test times.

[comment]: <> (#SHOW_CELL#)

In [None]:
results = {}

for algo in algorithms:
  results[algo.__name__] = cross_validate(algo(**parameters[algo.__name__]),
                                          data,
                                          measures=['MSE', 'MAE'],
                                          cv=kf,
                                          verbose=True)

Evaluating MSE, MAE of algorithm NormalPredictor on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
MSE (testset)     579.4720337.9442395.7289437.7151102.9759
MAE (testset)     12.6612 12.3210 12.4885 12.4902 0.1389  
Fit time          0.00    0.00    0.00    0.00    0.00    
Test time         0.01    0.00    0.00    0.00    0.00    
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating MSE, MAE of algorithm BaselineOnly on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
MSE (testset)     361.5615141.3187201.5729234.817792.9359 
MAE (testset)     6.6581  6.6288  6.7004  6.6624  0.0294  
Fit time          0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix

### 1.8
Rank all recommendation algorithms we tested according to the mean of the Mean Squared Error metric value: from the worst to the best algorithm.

Print out the ranking: algorithm name and MSE value.

In [None]:
mse = [np.mean(results[result]['test_mse']) for result in results]
algo_names = [algo for algo in results]

mse_ranking = pd.DataFrame({'Algorithm': algo_names, 'MSE': mse}).sort_values(by="MSE", ascending=False)

display(mse_ranking)

Unnamed: 0,Algorithm,MSE
9,SVDpp,35658.146714
0,NormalPredictor,437.715063
2,KNNBasic,244.338955
7,SlopeOne,243.558534
1,BaselineOnly,234.817684
5,KNNBaseline,228.211501
3,KNNWithMeans,222.454754
4,KNNWithZScore,208.677463
6,CoClustering,148.623006
8,SVD,123.072442


### 1.9
Select the algorithm with the best result in the previous test.

We test a maximum of **31** possible configurations for the selected recommendation algorithm. The number of parameters specified for the various configurations must be at least 2* and no more than 5*. Also, disregard configuration limitations described at the start of the homework.

We obtain the best configuration among all configurations, considering the Root Mean Squared Error metric calculated on a cross-validation of **5** folds.

1. Define the configuration dictionary that will be used for parameter optimisation.
2. Find a model configuration that offers the best possible performance within the given constraints. Print this configuration.

The resulting solution must exceed the default configuration according to the Mean Absolute Error metric.

> **If a parameter is itself composed of several parameters (e.g. if it is a dictionary), each will be counted separately when calculating the total number of attributes to be optimised.

In [None]:
best = mse_ranking[mse_ranking.MSE == mse_ranking.MSE.min()].Algorithm.values.flat[0]
print(f"Doing parameters tunning for our best model ({best})...")
print()

params = {'n_factors': randint(50, 100),
          'n_epochs': randint(50, 100),
          'biased': [True, False],
          'reg_qi': uniform(0.01, 0.10),
          'lr_bi': uniform(0.01, 0.10)}

rs = RandomizedSearchCV(eval(best), param_distributions=params, n_iter=31,
                        measures=['mae','rmse'],
                        cv=KFold(n_splits=5, random_state=seed), n_jobs=cores)

print("Performing the RandomizedSearch", end="")
rs.fit(data)
print(", done.")

print()
print(f"The best configuration for RMSE:")
display(pd.DataFrame(rs.best_params['rmse'], index=[best]))

best_mae = np.mean(results[best]['test_mse'])
conclusion = 'Yes.' if rs.cv_results['mean_test_mae'][rs.best_index['rmse']] < best_mae else 'No.'
print()
print(f"Does the resulting configuration exceed the default configuration for MAE? {conclusion}")

Doing parameters tunning for our best model (NMF)...

Performing the RandomizedSearch, done.

The best configuration for RMSE:


Unnamed: 0,biased,lr_bi,n_epochs,n_factors,reg_qi
NMF,False,0.032043,94,59,0.087785



Does the resulting configuration exceed the default configuration for MAE? Yes.


### 1.10
Consider this scenario:

* There are $n$ users and $m$ items.
* The items are divided into two groups $I_A$ and $I_B$.
* Users can like (rating $1$) all items in group $I_A$ and dislike (rating $0$) those in group $I_B$, or vice versa, but no intermediate case; thus users can also be divided into users in group $U_A$ and users in group $U_B$.
* Suppose we have all $n$ x $m$ ratings.

Now, consider this:

* A new user $u$ is added and we record his preference of an item $i$ from group $I_A$ (rating $1$).

> What will be the estimated rating of an item $a \in I_A, a \neq i$ for user $u$ if we use user-based collaborative filtering? What will be the rating of item $b \in I_B$ instead?

> If the user adds that they do not like an item $j$ belonging to group $B$, how would the above ratings change ($b \neq j$)?

Having only one rating for user $u$ , we replace the missing values (for the similarity calculation) with the average of the ratings for each user and centralize the reference space by removing the average itself from each row of the user-item matrix. At the end of the matrix adjustment process, user $u$ will be a vector of zeros only (the mean is 1/1) and will be more similar to the group of users with more zeros (as a result of centralization).\
To actually determine whether the user belongs to one of the two groups we need the user's rating for an item in group B: if we use the assumption of existence of only two groups we fall back to the trivial case by automatically completing the rest of the values in the user vector.


<div style="page-break-after: always; visibility: hidden">
\pagebreak
</div>