<a href="https://colab.research.google.com/github/CSheppardCodes/Scholastic-Study-of-Data-Science/blob/main/Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation Systems
We will use the surprise library of Python. Details are available at: http://surpriselib.com

We will first work through an example using a built-in dataset and then use a custom one.

First, ensure that you have the library installed and then load the required packages.

In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163479 sha256=3ab1237f11168e4532e0be4a48e56cd78e57fcc87b17d7c73c9d53e77cd7019b
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [None]:
import io

import numpy as np
import pandas as pd
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
from surprise import accuracy
from surprise.model_selection import KFold

For a recommendation system, we require a file containing at least 3 things - userId, itemId, and rating. Any other information is not needed, but can be good for human analysis of results.

Let's load the built in ml-100k dataset that contains movies and ratings.

In [None]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [None]:
# Let's see what files come with the dataset
!ls /root/.surprise_data/ml-100k/ml-100k/

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [None]:
# TODO: Show the first 10 lines of the u.data, and u.item files
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


## Algorithms
Let's look at some of the algorithms available with the package

In [None]:
?KNNBaseline

The nearest neighbor methods works by searching for neighbors using the utility matrix. Let's create a nearest neighbor first by item and user

In [None]:
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
# we are going to use item-item similarity
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x797d8e0b6e30>

In [None]:
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

# Id to Name Lookup
Let's write a small method that will convert id to name, and name to id

In [None]:
def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

In [None]:
# test this function
rid_to_name, name_to_rid = read_item_names()

In [None]:
rid_to_name["1"]

'Toy Story (1995)'

In [None]:
name_to_rid["Twelve Monkeys (1995)"]

'7'

In [None]:
# Find top 10 movies similar to movie with id 100

movie_inner_id = algo.trainset.to_inner_iid("200")
movie_name = rid_to_name["200"]

# Retrieve inner ids of the nearest neighbors of Toy Story.
movie_neighbors = algo.get_neighbors(movie_inner_id, k=10)

# Convert inner ids of the neighbors into names.
movie_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in movie_neighbors)
movie_neighbors = (rid_to_name[rid]
                       for rid in movie_neighbors)

print()

print('The 10 nearest neighbors of ' + movie_name)
for movie in movie_neighbors:
    print(movie)


The 10 nearest neighbors of Shining, The (1980)
Bonnie and Clyde (1967)
Godfather: Part II, The (1974)
Alien (1979)
Godfather, The (1972)
Raging Bull (1980)
Pulp Fiction (1994)
One Flew Over the Cuckoo's Nest (1975)
Carrie (1976)
Koyaanisqatsi (1983)
His Girl Friday (1940)


Let's now apply the algorithm and figure out it's accuracy

In [None]:
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

RMSE: 0.4807


0.48071109787164656

Now, let's also try some baseline methods. Follow the code available here:

https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/baselines_conf.py

For more elaborate testing and validation, follow steps mentioned here
https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/grid_search_usage.py

# Assignment

In this part, you will use the dataset that is provided along with the following Kaggle competition

https://www.kaggle.com/arashnic/book-recommendation-dataset


I have uploaded the files for you at

Ratings file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv

Books file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv


Follow the steps below to create a recommendation system from this data

In [None]:
# TODO: Read both the data files into Pandas dataframes

In [None]:
# TODO: Answer the following questions:
import pandas as pd
# How many ratings and how many books are there in the dataset
ratings = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv")
books = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv")
# Find the top 10 books have received the highest count of ratings. You should output the id of the book, its title, and the count of ratings received.



  books = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv")


In [None]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
ratings.count()

User-ID        1149780
ISBN           1149780
Book-Rating    1149780
dtype: int64

In [None]:
ratings['User-ID'].value_counts().count()

105283

In [None]:
books.count()

ISBN                   271360
Book-Title             271360
Book-Author            271359
Year-Of-Publication    271360
Publisher              271358
Image-URL-S            271360
Image-URL-M            271360
Image-URL-L            271357
dtype: int64

In [None]:
df = ratings.merge(books, on = "ISBN")

In [None]:
df.head()


Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [None]:
df.groupby("Book-Title").count()["Book-Rating"].sort_values(ascending=False)

Book-Title
Wild Animus                                                                           2502
The Lovely Bones: A Novel                                                             1295
The Da Vinci Code                                                                      898
A Painted House                                                                        838
The Nanny Diaries: A Novel                                                             828
                                                                                      ... 
Real Love: The Truth About Finding Unconditional Love and Fulfilling Relationships       1
Real Love: The Drawings for Sean                                                         1
Real Love or Fake (Camfield Novel of Love, No 78)                                        1
Fabulous Food for Family and Friends: Healthy Menus for Entertaining With Style          1
Suburban backlash: The battle for the world's most liveable city               

In [None]:
# TODO: Important - You may not be able use the whole dataset for model creation, so you need to create a
# smaller sample to proceeed further
# Here is what I did:
reviews_short = df[["User-ID", "Book-Title", "Book-Rating"]].sample(n = 1000, random_state = 42)
# you can try larger values of n, if the system allows you.
reviews_short.head()
# reviews_short.to_csv("review_short.csv")

Unnamed: 0,User-ID,Book-Title,Book-Rating
770118,162886,The Tears of My Soul,6
454727,11676,A Comedy of Heirs (Torie O'Shea Mysteries (Pap...,10
71725,78973,Fear Nothing,7
535451,14521,The Rhinemann Exchange,0
46502,277427,The Best of the Cheapskate Monthly: Simple Tip...,0


In [None]:
reader = Reader(rating_scale=(1,10))
dataset1 = Dataset.load_from_df(reviews_short[["User-ID", "Book-Title", "Book-Rating"]], reader)
trainset = dataset1.build_full_trainset()

In [None]:
type(dataset1)

surprise.dataset.DatasetAutoFolds

In [None]:
sim_options = {'name': 'pearson_baseline', 'user_based': True}

algo = KNNBaseline(sim_options = sim_options)
algo.fit(trainset)
book_inner_id = algo.trainset.to_inner_iid("Wild Animus")

book_neighbors = algo.get_neighbors(book_inner_id, k =10)
book_neighbors = (algo.trainset.to_raw_iid(inner_id)
                                    for inner_id in book_neighbors)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [None]:
for book in book_neighbors:
  print(book)

The Tears of My Soul
A Comedy of Heirs (Torie O'Shea Mysteries (Paperback))
Fear Nothing
The Rhinemann Exchange
The Best of the Cheapskate Monthly: Simple Tips for Living Lean
The Return of the King (The Lord of the Rings, Part 3)
Flut.
Summer Tree
The Third Twin: A Novel
Elementarteilchen.


In [None]:
# df = ratings.merge(books, on  = "review")

In [None]:
# TODO: Use the data to create a custom dataset in the surprise library
# Steps to do this are: https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset

In [None]:
# TODO: Choose a book at random and use the KNNBasic algorithm to find out its 10 closest neighbors. Do the results make
# sense?

In [None]:
# TODO: Use ParameterGridSearch on the following algorithms and compare their accuracies. You are free to decide
# which specific parameters to use:
# 1. KNNBaseline
# 2. ALS - Baseline
# 3. SGD - Baseline
# 4. SVD
# You should use a cv value of at least 3 and compare the mean accuracy of each of the algorithms
# Comment on whether there is significant differences in the results of the algorithms

In [None]:
from surprise import BaselineOnly
from surprise import KNNBasic
from surprise import Dataset
from surprise.model_selection import cross_validate
bsl_options = {'method': 'sgd',
               'learning_rate': .001,}

also = BaselineOnly(bsl_options = bsl_options)

from surprise import KNNBasic
algo_knn = KNNBasic()

from surprise import SVD
algo_svd = SVD(n_factors=50)

print("KNN")
cross_validate(algo_knn, dataset1, verbose = True)


KNN
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    3.8125  4.0159  4.0020  3.7784  3.8517  3.8921  0.0983  
MAE (testset)     3.5197  3.7113  3.7093  3.5291  3.5933  3.6125  0.0837  
Fit time          0.02    0.01    0.02    0.01    0.01    0.01    0.00    
Test time         0.01    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([3.81246168, 4.01594634, 4.00198252, 3.77838611, 3.8517352 ]),
 'test_mae': array([3.5196875, 3.71125  , 3.709325 , 3.5291   , 3.593275 ]),
 'fit_time': (0.01822495460510254,
  0.012907981872558594,
  0.016382932662963867,
  0.013282299041748047,
  0.013257026672363281),
 'test_time': (0.006691455841064453,
  0.0018122196197509766,
  0.0017423629760742188,
  0.001861572265625,
  0.0017507076263427734)}

In [None]:
from surprise import Dataset, SVD
from surprise.model_selection import GridSearchCV

param_grid = {"n_epochs": [5,10], "lr_all": [0.002,0.005], "reg_all": [0,4,0.6]}
gs = GridSearchCV(SVD, param_grid, measures = ["rmse", "mae"], cv = 3)

In [None]:
gs.fit(dataset1)

#best RMSE score
print(gs.best_score["rmse"])

#comb of params of RMSE
print(gs.best_params["rmse"])

3.8755391127671284
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0}
