<a href="https://colab.research.google.com/github/SusanLL/Project/blob/main/Recommendation_System_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation Systems
We will use the surprise library of Python. Details are available at: http://surpriselib.com

We will first work through an example using a built-in dataset and then use a custom one.

First, ensure that you have the library installed and then load the required packages.

In [None]:
!pip install scikit-surprise



In [None]:
import io

import numpy as np
import pandas as pd
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
from surprise import accuracy
from surprise.model_selection import KFold

For a recommendation system, we require a file containing at least 3 things - userId, itemId, and rating. Any other information is not needed, but can be good for human analysis of results.

Let's load the built in ml-100k dataset that contains movies and ratings.

In [None]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

In [None]:
# Let's see what files come with the dataset
!ls /root/.surprise_data/ml-100k/ml-100k/

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [None]:
# TODO: Show the first 10 lines of the u.data, and u.item files
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


## Algorithms
Let's look at some of the algorithms available with the package

In [None]:
?KNNBaseline

The nearest neighbor methods works by searching for neighbors using the utility matrix. Let's create a nearest neighbor first by item and user

In [None]:
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
# we are going to use item-item similarity
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x79397cffe380>

In [None]:
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

# Id to Name Lookup
Let's write a small method that will convert id to name, and name to id

In [None]:
def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

In [None]:
# test this function
rid_to_name, name_to_rid = read_item_names()

In [None]:
rid_to_name["1"]

'Toy Story (1995)'

In [None]:
name_to_rid["Twelve Monkeys (1995)"]

'7'

In [None]:
# Find top 10 movies similar to movie with id 100

movie_inner_id = algo.trainset.to_inner_iid("100")
movie_name = rid_to_name["100"]

# Retrieve inner ids of the nearest neighbors of Toy Story.
movie_neighbors = algo.get_neighbors(movie_inner_id, k=10)

# Convert inner ids of the neighbors into names.
movie_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in movie_neighbors)
movie_neighbors = (rid_to_name[rid]
                       for rid in movie_neighbors)

print()

print('The 10 nearest neighbors of ' + movie_name)
for movie in movie_neighbors:
    print(movie)


The 10 nearest neighbors of Fargo (1996)
To Die For (1995)
Lone Star (1996)
Bullets Over Broadway (1994)
Sling Blade (1996)
People vs. Larry Flynt, The (1996)
This Is Spinal Tap (1984)
Quiz Show (1994)
Mighty Aphrodite (1995)
2001: A Space Odyssey (1968)
Dolores Claiborne (1994)


Let's now apply the algorithm and figure out it's accuracy

In [None]:
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

RMSE: 0.4807


0.48071109787164656

Now, let's also try some baseline methods. Follow the code available here:

https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/baselines_conf.py

For more elaborate testing and validation, follow steps mentioned here
https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/grid_search_usage.py

# Assignment

In this part, you will use the dataset that is provided along with the following Kaggle competition

https://www.kaggle.com/arashnic/book-recommendation-dataset


I have uploaded the files for you at

Ratings file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv

Books file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv


Follow the steps below to create a recommendation system from this data

In [None]:
# TODO: Read both the data files into Pandas dataframes
ratings_df = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv")
books_df = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv")

  books_df = pd.read_csv("https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv")


In [None]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
books_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [None]:
# TODO: Answer the following questions:

# How many ratings and how many books are there in the dataset
print("Number of ratings: ", len(ratings_df))
print("Number of books: ", len(books_df))

# Find the top 10 books have received the highest count of ratings. You should output the id of the book, its title, and the count of ratings received.



Number of ratings:  1149780
Number of books:  271360


In [None]:
# Find the top 10 books have received the highest count of ratings. You should output the id of the book, its title, and the count of ratings received.

# Merge the two dataframes based on the common column 'ISBN'
merged_df = pd.merge(ratings_df, books_df, on='ISBN', how='inner')

# Group by book ID and title, count the ratings, and sort in descending order
top_10_books = merged_df.groupby(['ISBN', 'Book-Title'])['Book-Rating'].count().reset_index(name='RatingCount').sort_values('RatingCount', ascending=False).head(10)

top_10_books

Unnamed: 0,ISBN,Book-Title,RatingCount
215952,0971880107,Wild Animus,2502
38570,0316666343,The Lovely Bones: A Novel,1295
70798,0385504209,The Da Vinci Code,883
7344,0060928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,732
32370,0312195516,The Red Tent (Bestselling Backlist),723
87397,044023722X,A Painted House,647
21342,0142001740,The Secret Life of Bees,615
145042,067976402X,Snow Falling on Cedars,614
133142,0671027360,Angels &amp; Demons,586
93847,0446672211,Where the Heart Is (Oprah's Book Club (Paperba...,585


In [None]:
# TODO: Important - You may not be able use the whole dataset for model creation, so you need to create a
# smaller sample to proceeed further
# Here is what I did:
# reviews_short = reviews.sample(n = 1000, random_state = 42)
# you can try larger values of n, if the system allows you.

In [None]:
reviews_short = merged_df.sample(n = 1000, random_state = 42)
reviews_short.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
770118,208390,142000205,5,Icy Sparks,Gwyn Hyman Rubio,2001,Penguin Books,http://images.amazon.com/images/P/0142000205.0...,http://images.amazon.com/images/P/0142000205.0...,http://images.amazon.com/images/P/0142000205.0...
454727,123625,590568809,7,"The Beast from the East (Goosebumps, No 43)",R. L. Stine,1996,Scholastic,http://images.amazon.com/images/P/0590568809.0...,http://images.amazon.com/images/P/0590568809.0...,http://images.amazon.com/images/P/0590568809.0...
71725,16943,552997544,0,Cloud Music,Karen Hayes,1997,Black Swan,http://images.amazon.com/images/P/0552997544.0...,http://images.amazon.com/images/P/0552997544.0...,http://images.amazon.com/images/P/0552997544.0...
535451,144255,385490992,4,The Street Lawyer,John Grisham,1998,Doubleday Books,http://images.amazon.com/images/P/0385490992.0...,http://images.amazon.com/images/P/0385490992.0...,http://images.amazon.com/images/P/0385490992.0...
46502,11676,671776800,10,Paradise,Judith McNaught,1992,Pocket,http://images.amazon.com/images/P/0671776800.0...,http://images.amazon.com/images/P/0671776800.0...,http://images.amazon.com/images/P/0671776800.0...


In [None]:
from os import read
# TODO: Use the data to create a custom dataset in the surprise library
# Steps to do this are: https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(reviews_short[['User-ID', 'ISBN', 'Book-Rating']], reader)

In [None]:
trainset = data.build_full_trainset()
# we are going to use item-item similarity
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x79397cffddb0>

In [None]:
# TODO: Choose a book at random and use the KNNBasic algorithm to find out its 10 closest neighbors. Do the results make
# sense?

book_id = top_10_books.iloc[0]['ISBN']
print(f"Finding 10 closest neighbors for book ISBN: {book_id}")
neighbors = algo.get_neighbors(trainset.to_inner_iid(book_id), k=10)

Finding 10 closest neighbors for book ISBN: 0971880107


In [None]:
isbn_to_title = books_df.set_index('ISBN')['Book-Title'].to_dict()

In [None]:
book_inner_id = trainset.to_inner_iid(book_id)
book_title = isbn_to_title[book_id]

In [None]:
# Retrieve inner ids of the nearest neighbors of the chosen book
book_neighbors = algo.get_neighbors(book_inner_id, k=10)

# Convert inner ids of the neighbors back to ISBNs and then retrieve their titles
book_neighbors = [trainset.to_raw_iid(inner_id) for inner_id in book_neighbors]
neighbor_titles = [isbn_to_title[isbn] for isbn in book_neighbors]

# Print the results
print(f"The 10 nearest neighbors of '{book_title}':")
for title in neighbor_titles:
    print(title)

The 10 nearest neighbors of 'Wild Animus':
Icy Sparks
The Beast from the East (Goosebumps, No 43)
Cloud Music
The Street Lawyer
Paradise
Mooring Against the Tide: Writing Fiction and Poetry
Incredible Journey
Running in the Family (Vintage International)
River, Cross My Heart
In Sylvan Shadows (Forgotten Realms Novel: Cleric Quintet)


In [None]:
# TODO: Use ParameterGridSearch on the following algorithms and compare their accuracies. You are free to decide
# which specific parameters to use:
# 1. KNNBaseline
# 2. ALS - Baseline
# 3. SGD - Baseline
# 4. SVD
# You should use a cv value of at least 3 and compare the mean accuracy of each of the algorithms
# Comment on whether there is significant differences in the results of the algorithms

In [None]:
from surprise.model_selection import GridSearchCV
from surprise import SVD, KNNBaseline, BaselineOnly, Dataset, Reader

# Load the data (assuming 'reviews_short' from the previous code is available)
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(reviews_short[['User-ID', 'ISBN', 'Book-Rating']], reader)

# Define parameter grids for each algorithm
param_grid_knn = {
    'k': [20, 30, 40],
    'sim_options': {'name': ['msd'], 'user_based': [False]}  # Avoid cosine to prevent ZeroDivisionError
}
param_grid_als = {
    'bsl_options': {'method': ['als'], 'n_epochs': [5, 10], 'reg_u': [10, 15], 'reg_i': [5, 10]}
}
param_grid_sgd = {
    'bsl_options': {'method': ['sgd'], 'n_epochs': [5, 10]}
}
param_grid_svd = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.01],
    'reg_all': [0.02, 0.1]
}

# Create GridSearchCV objects for each algorithm
gs_knn = GridSearchCV(KNNBaseline, param_grid_knn, measures=['rmse'], cv=3)
gs_als = GridSearchCV(BaselineOnly, param_grid_als, measures=['rmse'], cv=3)
gs_sgd = GridSearchCV(BaselineOnly, param_grid_sgd, measures=['rmse'], cv=3)
gs_svd = GridSearchCV(SVD, param_grid_svd, measures=['rmse'], cv=3)

# Fit the GridSearchCV objects to the data
gs_knn.fit(data)
gs_als.fit(data)
gs_sgd.fit(data)
gs_svd.fit(data)

# Print the best RMSE score and parameters for each algorithm
print('KNN Baseline Best RMSE:', gs_knn.best_score['rmse'])
print('KNN Baseline Best Params:', gs_knn.best_params['rmse'])
print('ALS Baseline Best RMSE:', gs_als.best_score['rmse'])
print('ALS Baseline Best Params:', gs_als.best_params['rmse'])
print('SGD Baseline Best RMSE:', gs_sgd.best_score['rmse'])
print('SGD Baseline Best Params:', gs_sgd.best_params['rmse'])
print('SVD Best RMSE:', gs_svd.best_score['rmse'])
print('SVD Best Params:', gs_svd.best_params['rmse'])

# Compare the results and comment on significant differences
print("\nAnalysis:")
best_scores = {
    "KNN Baseline": gs_knn.best_score['rmse'],
    "ALS Baseline": gs_als.best_score['rmse'],
    "SGD Baseline": gs_sgd.best_score['rmse'],
    "SVD": gs_svd.best_score['rmse']
}
for algo, score in best_scores.items():
    print(f"{algo} - RMSE: {score}")


Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Estimating biases using als...
Es

* **SVD** achieved the lowest RMSE (3.7833), indicating it provided the best predictive accuracy among the models tested. This suggests that SVD may be better suited for capturing latent factors in this dataset.
* **ALS Baseline** followed closely with an RMSE of 3.8021, which is slightly higher than SVD but still competitive.
* **KNN Baseline** and SGD Baseline had similar RMSE values (3.8410 and 3.8388, respectively), both slightly higher than SVD and ALS Baseline.

**SVD** appears to be the most effective model for this dataset.