<a href="https://colab.research.google.com/github/Shaurya0108/cs4372/blob/main/Recommendation_System_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation Systems
We will use the surprise library of Python. Details are available at: http://surpriselib.com

We will first work through an example using a built-in dataset and then use a custom one.

First, ensure that you have the library installed and then load the required packages.

In [2]:
!pip install scikit-surprise



In [3]:
import io

import numpy as np
import pandas as pd
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
from surprise import accuracy
from surprise.model_selection import KFold

For a recommendation system, we require a file containing at least 3 things - userId, itemId, and rating. Any other information is not needed, but can be good for human analysis of results.

Let's load the built in ml-100k dataset that contains movies and ratings.

In [4]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [5]:
# Let's see what files come with the dataset
!ls /root/.surprise_data/ml-100k/ml-100k/

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [6]:
# TODO: Show the first 10 lines of the u.data, and u.item files
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


## Algorithms
Let's look at some of the algorithms available with the package

In [7]:
?KNNBaseline

The nearest neighbor methods works by searching for neighbors using the utility matrix. Let's create a nearest neighbor first by item and user

In [8]:
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
# we are going to use item-item similarity
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7a5f40c5fbe0>

In [9]:
!head -10 /root/.surprise_data/ml-100k/ml-100k/u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

# Id to Name Lookup
Let's write a small method that will convert id to name, and name to id

In [10]:
def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

In [11]:
# test this function
rid_to_name, name_to_rid = read_item_names()

In [12]:
rid_to_name["1"]

'Toy Story (1995)'

In [13]:
name_to_rid["Twelve Monkeys (1995)"]

'7'

In [14]:
# Find top 10 movies similar to movie with id 100

movie_inner_id = algo.trainset.to_inner_iid("200")
movie_name = rid_to_name["200"]

# Retrieve inner ids of the nearest neighbors of Toy Story.
movie_neighbors = algo.get_neighbors(movie_inner_id, k=10)

# Convert inner ids of the neighbors into names.
movie_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in movie_neighbors)
movie_neighbors = (rid_to_name[rid]
                       for rid in movie_neighbors)

print()

print('The 10 nearest neighbors of ' + movie_name)
for movie in movie_neighbors:
    print(movie)


The 10 nearest neighbors of Shining, The (1980)
Bonnie and Clyde (1967)
Godfather: Part II, The (1974)
Alien (1979)
Godfather, The (1972)
Raging Bull (1980)
Pulp Fiction (1994)
One Flew Over the Cuckoo's Nest (1975)
Carrie (1976)
Koyaanisqatsi (1983)
His Girl Friday (1940)


Let's now apply the algorithm and figure out it's accuracy

In [15]:
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

RMSE: 0.4807


0.48071109787164656

Now, let's also try some baseline methods. Follow the code available here:

https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/baselines_conf.py

For more elaborate testing and validation, follow steps mentioned here
https://github.com/NicolasHug/Surprise/blob/fa7455880192383f01475162b4cbd310d91d29ca/examples/grid_search_usage.py

# Assignment

In this part, you will use the dataset that is provided along with the following Kaggle competition

https://www.kaggle.com/arashnic/book-recommendation-dataset


I have uploaded the files for you at

Ratings file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv

Books file - https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv


Follow the steps below to create a recommendation system from this data

In [16]:
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, KNNBasic, KNNBaseline, SVD, BaselineOnly
from surprise.model_selection import GridSearchCV
from surprise.model_selection import cross_validate
import warnings
warnings.filterwarnings('ignore')

In [17]:
# TODO: Read both the data files into Pandas dataframes
ratings_df = pd.read_csv('https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Ratings.csv')
books_df = pd.read_csv('https://an-utd-course.s3.us-west-1.amazonaws.com/CompDS/Books.csv')


In [18]:
# TODO: Answer the following questions:
# How many ratings and how many books are there in the dataset
print(f"Number of ratings: {len(ratings_df)}")
print(f"Number of unique books: {len(books_df)}")

# Find the top 10 books have received the highest count of ratings. You should output the id of the book, its title, and the count of ratings received.
rating_counts = ratings_df['ISBN'].value_counts().reset_index()
rating_counts.columns = ['ISBN', 'Rating_Count']
top_10_books = pd.merge(
    rating_counts,
    books_df[['ISBN', 'Book-Title']],
    on='ISBN'
)[['ISBN', 'Book-Title', 'Rating_Count']].head(10)

# Renaming the books
top_10_books.columns = ['ISBN', 'Book Title', 'Rating Count']

print("Top 10 Most Rated Books:")
print(top_10_books.to_string(index=False))


Number of ratings: 1149780
Number of unique books: 271360
Top 10 Most Rated Books:
      ISBN                                         Book Title  Rating Count
0971880107                                        Wild Animus          2502
0316666343                          The Lovely Bones: A Novel          1295
0385504209                                  The Da Vinci Code           883
0060928336    Divine Secrets of the Ya-Ya Sisterhood: A Novel           732
0312195516                The Red Tent (Bestselling Backlist)           723
044023722X                                    A Painted House           647
0142001740                            The Secret Life of Bees           615
067976402X                             Snow Falling on Cedars           614
0671027360                                Angels &amp; Demons           586
0446672211 Where the Heart Is (Oprah's Book Club (Paperback))           585


In [19]:
# TODO: Important - You may not be able use the whole dataset for model creation, so you need to create a
# smaller sample to proceeed further
# Here is what I did:
# reviews_short = reviews.sample(n = 1000, random_state = 42)
# you can try larger values of n, if the system allows you.
valid_isbns = set(ratings_df['ISBN']).intersection(set(books_df['ISBN']))
ratings_df = ratings_df[ratings_df['ISBN'].isin(valid_isbns)] #

np.random.seed(42)
sample_size = 1000
ratings_sample = ratings_df.sample(n=sample_size, random_state=42)


In [20]:
# TODO: Use the data to create a custom dataset in the surprise library
# Steps to do this are: https://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(
    ratings_sample[['User-ID', 'ISBN', 'Book-Rating']],
    reader
)
trainset = data.build_full_trainset()

In [21]:
# TODO: Choose a book at random and use the KNNBasic algorithm to find out its 10 closest neighbors. Do the results make
# sense?
knn = KNNBasic(sim_options={'user_based': False})
knn.fit(trainset)

random_isbn = ratings_sample[ratings_sample['ISBN'].isin(books_df['ISBN'])]['ISBN'].sample(1).iloc[0]
book_inner_id = trainset.to_inner_iid(random_isbn)
selected_book_title = books_df[books_df['ISBN'] == random_isbn]['Book-Title'].iloc[0]
print(f"\nSelected Book: {selected_book_title}")

Computing the msd similarity matrix...
Done computing similarity matrix.

Selected Book: Work, Sex and Rugby


In [22]:
# TODO: Use ParameterGridSearch on the following algorithms and compare their accuracies. You are free to decide
# which specific parameters to use:
# 1. KNNBaseline
# 2. ALS - Baseline
# 3. SGD - Baseline
# 4. SVD
# You should use a cv value of at least 3 and compare the mean accuracy of each of the algorithms
# Comment on whether there is significant differences in the results of the algorithms

# Defining parameter grids
param_grid_knn = {
    'k': [3, 5],
    'min_k': [1],
    'sim_options': {
        'name': ['cosine'],
        'user_based': [False]
    }
}

param_grid_baseline_als = {
    'bsl_options': {
        'method': ['als'],
        'n_epochs': [5],
        'reg_u': [12],
        'reg_i': [5]
    }
}

param_grid_baseline_sgd = {
    'bsl_options': {
        'method': ['sgd'],
        'learning_rate': [0.005],
        'n_epochs': [10]
    }
}

param_grid_svd = {
    'n_factors': [50],
    'n_epochs': [20],
    'lr_all': [0.005],
    'reg_all': [0.02]
}

In [23]:
def evaluate_algorithm(algo, data, algo_name):
    """Evaluate a single algorithm using cross-validation"""
    try:
        # Perform cross validation
        cv_results = cross_validate(algo, data, measures=['RMSE', 'MAE'],
                                  cv=3, verbose=False)

        results = {
            'Algorithm': algo_name,
            'Mean RMSE': np.mean(cv_results['test_rmse']),
            'Mean MAE': np.mean(cv_results['test_mae']),
            'Std RMSE': np.std(cv_results['test_rmse']),
            'Std MAE': np.std(cv_results['test_mae'])
        }
        return results
    except Exception as e:
        print(f"Error evaluating {algo_name}: {str(e)}")
        return None

In [24]:
algorithms = [
    (KNNBaseline(k=3, min_k=1, sim_options={'name': 'cosine', 'user_based': False}), 'KNNBaseline'),
    (BaselineOnly(bsl_options={'method': 'als', 'n_epochs': 5}), 'BaselineOnly (ALS)'),
    (BaselineOnly(bsl_options={'method': 'sgd', 'learning_rate': 0.005, 'n_epochs': 10}), 'BaselineOnly (SGD)'),
    (SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02), 'SVD')
]

In [27]:
results = []
for algo, name in algorithms:
    result = evaluate_algorithm(algo, data, name)
    if result is not None:
        results.append(result)
if results:
    results_df = pd.DataFrame(results)
    print("\nAlgorithm Comparison Results:")
    print(results_df.to_string(index=False))

    # Find best performing algorithm
    best_algo = results_df.loc[results_df['Mean RMSE'].idxmin()]
    print(f"\nBest performing algorithm: {best_algo['Algorithm']}")
    print(f"Best RMSE: {best_algo['Mean RMSE']:.4f} (±{best_algo['Std RMSE']:.4f})")
    print(f"Best MAE: {best_algo['Mean MAE']:.4f} (±{best_algo['Std MAE']:.4f})")

    # Calculate relative differences
    baseline_rmse = results_df['Mean RMSE'].min()
    relative_diff = (results_df['Mean RMSE'] - baseline_rmse) / baseline_rmse * 100

    print(f"\nRelative Performance Differences (compared to {best_algo}):")
    for idx, row in results_df.iterrows():
        diff = relative_diff[idx]
        print(f"{row['Algorithm']}: {diff:.2f}% worse than {best_algo}")


Estimating biases using als...
Computing the cosine similarity matrix...
Error evaluating KNNBaseline: float division
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...

Algorithm Comparison Results:
         Algorithm  Mean RMSE  Mean MAE  Std RMSE  Std MAE
BaselineOnly (ALS)   3.816843  3.539570  0.056160 0.027986
BaselineOnly (SGD)   3.840093  3.559164  0.077348 0.035561
               SVD   3.812149  3.521659  0.105167 0.051359

Best performing algorithm: SVD
Best RMSE: 3.8121 (±0.1052)
Best MAE: 3.5217 (±0.0514)

Relative Performance Differences (compared to Algorithm         SVD
Mean RMSE    3.812149
Mean MAE     3.521659
Std RMSE     0.105167
Std MAE      0.051359
Name: 2, dtype: object):
BaselineOnly (ALS): 0.12% worse than Algorithm         SVD
Mean RMSE    3.812149
Mean MAE     3.521659
Std RMSE     0.105167
Std MAE      0.051359
Name: 2, dtyp