For this assignment, we will practice collaborative filtering and the different model based recommendation methods. We will be giving book recommendation this time. The data set can be found [here](https://github.com/zygmuntz/goodbooks-10k).

The 4 methods we will use are as follows:
- User-Based Collaborative Filtering
- Item-Based Collaborative Filtering
- Matrix Factorization
- SVD++

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load the datasets
books = pd.read_csv('books.csv') # Book metadata
ratings = pd.read_csv('ratings.csv') # User ratings

In [3]:
# Show you what the data looks like
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [4]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


There should be a total of 53424 unique users and 10000 books in this dataset. You can verify that if you wish.

## Preprocessing

The first step is to perform some preprocessing of the data. In particular, we will format the ratings data into the nice matrix we have seen in class. I will first merge the two files, so I will eliminate any ratings that does have book metadata information (if any).

In [5]:
# Merge the two datasets
merged_data = pd.merge(books, ratings, on='book_id')[['user_id', 'book_id', 'rating', 'original_title']]

In [6]:
# Let's see what the merged data looks like
merged_data.head()

Unnamed: 0,user_id,book_id,rating,original_title
0,2886,1,5,The Hunger Games
1,6158,1,5,The Hunger Games
2,3991,1,4,The Hunger Games
3,5281,1,5,The Hunger Games
4,5721,1,5,The Hunger Games


It turns out that if we work with this data, you might run into memory issue. Hence I am going to keep only the user with ID less than or equal to 10000.

In [7]:
merged_data = merged_data[merged_data.user_id <= 10000]

#### Your tasks starts here. First create the rating matrix. Replace any missing values with 0 afterwards.

In [8]:
rating_matrix = merged_data.pivot_table(index = 'user_id', columns = 'book_id', values='rating')
rating_matrix.fillna(0, inplace = True)

## User-Based Collaborative Filtering
The first model to use will be the user-based collaborative filtering.

1. While this is not the best practice, I want you to use Euclidean distance to measure the similarity between users. Think carefully when you use this measure during the implementation.
2. Use 100 neighbors when calculating the predicted scores.
3. Give me the top 15 recommendations for user with user_id 1839. Give me the book titles and predicted ratings.
4. Also store the recommendations in a variable. We will compare this result with other models later.

In [9]:
# Use as many boxes as you need

from sklearn.metrics.pairwise import euclidean_distances

# Calculate Euclidean distances between users
user_distances = euclidean_distances(rating_matrix)
user_similarity = pd.DataFrame(1 / (1 + user_distances), index=rating_matrix.index, columns=rating_matrix.index)

In [10]:
def user_based_recommendations(user_id, num_neighbors, top_n, similarity_matrix):

    # Get the nearest neighbors
    nearest_neighbors = similarity_matrix[user_id].sort_values(ascending=False)[1:(num_neighbors+1)]

    # Obtain predicted ratings for unread books
    unread_books_index = rating_matrix.columns[rating_matrix.loc[user_id] == 0]
    predicted_ratings = []
    for book_id in unread_books_index:
        neighbors_ratings = rating_matrix.loc[nearest_neighbors.index, book_id]
        predicted_ratings.append(sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors))

    # Sort the predictions
    predicted_ratings = pd.Series(predicted_ratings, index=unread_books_index).sort_values(ascending=False)

    # Extract only the top n books
    top_n_recommendations = predicted_ratings.head(top_n)

    # Create DataFrame for recommendations
    recommendations_df = pd.DataFrame({
        'Book Title': [books.loc[books['book_id'] == book_id, 'original_title'].values[0] for book_id in top_n_recommendations.index],
        'Predicted Rating': top_n_recommendations.values
    }, index=np.arange(1, top_n+1))

    return recommendations_df

print("Top 15 User-Based Book Recommendations for User_id: 1839")
ub_recommendations_df = user_based_recommendations(1839, 100, 15, user_similarity)
print(ub_recommendations_df)

Top 15 User-Based Book Recommendations for User_id: 1839
                                   Book Title  Predicted Rating
1                           The Da Vinci Code          1.519016
2                                O Alquimista          1.383978
3    Harry Potter and the Philosopher's Stone          1.213373
4    Harry Potter and the Prisoner of Azkaban          1.184701
5   Harry Potter and the Order of the Phoenix          1.162617
6                            The Kite Runner           1.095512
7         Harry Potter and the Goblet of Fire          1.054455
8                             Le Petit Prince          1.014054
9      Harry Potter and the Half-Blood Prince          1.005341
10    Harry Potter and the Chamber of Secrets          0.985395
11                           Angels & Demons           0.898395
12       Harry Potter and the Deathly Hallows          0.875282
13                       Nineteen Eighty-Four          0.861661
14                 Animal Farm: A Fairy Story  

## Item-Based Collaborative Filtering
Next we will use item-based collaborative filtering.

This time I want you to use cosine similarity to measure the similarity between items.
Use 100 neighbors when calculating the predicted scores.
Give me the top 15 recommendations for user with user_id 1839. Give me the book titles and predicted ratings.
Also store the recommendations in a variable.

In [11]:
# Use as many boxes as you need

from sklearn.metrics.pairwise import cosine_similarity

# Calculate item similarity using cosine similarity
item_similarity_matrix = cosine_similarity(rating_matrix.T)
item_similarity_matrix = pd.DataFrame(item_similarity_matrix, index=rating_matrix.columns, columns=rating_matrix.columns)

In [12]:
def item_based_recommendations(user_id, num_neighbors, top_n, similarity_matrix):
    
    # Obtain unseen book indices
    unseen_books_index = rating_matrix.columns[rating_matrix.loc[user_id] == 0]
    predicted_ratings = []

    # Calculate predicted rating for each new book
    for book_id in unseen_books_index:
        nearest_neighbors = similarity_matrix[book_id].sort_values(ascending=False)[1:(num_neighbors+1)]
        neighbors_ratings = rating_matrix.loc[user_id, nearest_neighbors.index]
        predicted_rating = sum(nearest_neighbors * neighbors_ratings) / sum(nearest_neighbors)
        predicted_ratings.append(predicted_rating)

    # Sort the predictions
    predicted_ratings = pd.Series(predicted_ratings, index=unseen_books_index).sort_values(ascending=False)

    # Extract only the top n books
    top_n_recommendations = predicted_ratings.head(top_n)

    # Filter the 'books' DataFrame to retrieve book titles for recommended books
    recommended_books = books[books['book_id'].isin(top_n_recommendations.index)]

    # Create DataFrame for recommendations
    recommendations_df = pd.DataFrame({
        'Book Title': recommended_books['original_title'].values,
        'Predicted Rating': top_n_recommendations.values
    }, index=np.arange(1, top_n+1))

    return recommendations_df

print("Top 15 Item-Based Book Recommendations for User_id: 1839")
ib_recommendations_df = item_based_recommendations(1839, 100, 15, item_similarity_matrix)
print(ib_recommendations_df)

Top 15 Item-Based Book Recommendations for User_id: 1839
                                        Book Title  Predicted Rating
1   World War Z: An Oral History of the Zombie War          1.137208
2                                The Scorch Trials          1.106118
3                               The Graveyard Book          1.096361
4                  Abraham Lincoln: Vampire Hunter          1.081804
5                                      Chosen Prey          1.068413
6                                        Bad Blood          1.067748
7                                   Heat Lightning          1.012271
8                                      Secret Prey          0.922163
9                                              NaN          0.818423
10                                             NaN          0.786460
11                                      Shock Wave          0.786274
12                                       Mind Prey          0.740866
13                                     Sudden 

## Matrix Factorization
Now we will turn to model based methods. First we will look at Matrix Factorization. You can use the code I presented in class.

1. Use 3 latent factors.
2. Set the learning rate at 0.001 and beta at 0.01. Since it will take a while to run, 5 iterations will be fine.
3. Fit the model (it will take a while to run).
4. Give me the top 15 recommendations for user with user_id 1839. Return boths book names and predicted ratings.
5. Store the recommendations in a variable.

In [15]:
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    for step in range(steps):
        for i in range(R.shape[0]):
            for j in range(R.shape[1]):
                if R[i][j] > 0: # Skipping over missing ratings
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = np.dot(P,Q)
        e = 0
        for i in range(R.shape[0]):
            for j in range(R.shape[1]):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * ( pow(P[i][k],2) + pow(Q[k][j],2) )
        if e < 0.001: # tolerance
            break
    return P, Q

In [16]:
np.random.seed(862)
# Initializations
M = rating_matrix.shape[0] # Number of users
N = rating_matrix.shape[1] # Number of items
K = 3 # Number of latent features

# Initial estimate of P and Q
P = np.random.rand(M,K)
Q = np.random.rand(K,N)
rating_np = np.array(rating_matrix)

print(P)
print(Q)
print(rating_np)

[[0.98731988 0.76830439 0.90924592]
 [0.00777058 0.37311097 0.4358948 ]
 [0.98879624 0.00823451 0.45418566]
 ...
 [0.46366433 0.02874291 0.04346027]
 [0.85169041 0.26359646 0.27713021]
 [0.57128066 0.26072676 0.93388266]]
[[0.33567769 0.80485188 0.32626062 ... 0.06815034 0.57811259 0.4372216 ]
 [0.251984   0.44512641 0.39829735 ... 0.36310099 0.4510594  0.95769061]
 [0.47778082 0.97640869 0.18778225 ... 0.53712729 0.9934128  0.9125573 ]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 5. 0. ... 0. 0. 0.]
 [5. 0. 0. ... 0. 0. 0.]
 [0. 5. 3. ... 0. 0. 0.]]


In [17]:
# Run the fitting
P, Q = matrix_factorization(rating_np, P, Q, K, alpha = 0.001, beta = 0.01, steps = 5)

In [18]:
# predicted ratings
predicted_rating = np.matmul(P[rating_matrix.index == 1839], Q)
print(predicted_rating)

[[4.26428593 4.07336001 3.45030768 ... 2.66263793 3.17453814 2.97271977]]


In [19]:
# get missing ratings
missing_ratings = predicted_rating[0][rating_matrix.loc[1839,:]==0]
print(missing_ratings)

# put missing values in a series
missing_ratings_series = pd.Series(missing_ratings, index = rating_matrix.columns[rating_matrix.loc[1839,:] == 0] )
print(missing_ratings_series)

[4.26428593 4.07336001 3.45030768 ... 2.66263793 3.17453814 2.97271977]
book_id
1        4.264286
2        4.073360
3        3.450308
5        3.695432
7        4.008542
           ...   
9996     1.896904
9997     3.037556
9998     2.662638
9999     3.174538
10000    2.972720
Length: 9869, dtype: float64


In [20]:
# Sort the values
missing_ratings_series.sort_values(ascending = False, inplace = True)

mat_rec = []
for i in range(15):

    rec_book_id = missing_ratings_series.index[i]
    mat_rec.append(books[books['book_id'] == rec_book_id]['original_title'].values[0])
    # print("my number ", i+1, " recommendation is ", books[books['book_id'] == rec_book_id]['original_title'].values[0],
    #       ", with a predicted rating of", missing_ratings_series.iloc[i])

# print(mat_rec)

# Create a DataFrame from the recommendations list
matrix_factorization_recommendations_df = pd.DataFrame({
    'Book Title': mat_rec,
    'Predicted Rating': missing_ratings_series.head(15).values
}, index=np.arange(1, 16))

print(matrix_factorization_recommendations_df)

                                           Book Title  Predicted Rating
1   Jesus the Christ: A Study of the Messiah and H...          4.884573
2   The Essential Calvin and Hobbes: A Calvin and ...          4.761569
3                                      The Brothers K          4.733280
4       Just Mercy: A Story of Justice and Redemption          4.726165
5                     Complete Harry Potter Boxed Set          4.686720
6   Being Mortal: Medicine and What Matters in the...          4.685362
7                 The Authoritative Calvin and Hobbes          4.675380
8         The Complete Anne of Green Gables Boxed Set          4.670026
9                                   Words of Radiance          4.667943
10              Maus II : And Here My Troubles Began           4.660453
11                                  The Complete Maus          4.640561
12                                  Calvin and Hobbes          4.633267
13  The Indispensable Calvin and Hobbes: A Calvin ...          4

## SVD++

While we briefly introduced the SVD++ model in class, we didn't see how to use that in Python. Here is your chance to practice this. First, you will need to install the [surprise](http://surpriselib.com/) library (if you havn't yet).

In lecture we described the factorization algorithm as SVD++. However, the surprise library called it SVD instead (and use SVD++ for a different yet similar algorithm). Your task here is to implement the [SVD](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD) algorithm from the surprise library. I will walk you through as much as I can.

In order to use the surprise library, we need to first put the data into its accepted format. [Here](https://surprise.readthedocs.io/en/stable/getting_started.html#load-dom-dataframe-py) is an example on how it work. In general, you need to do the following:

1. Set up a Reader class
2. Load the dataframe
3. Build the data set using the build_full_trainset() method (see [here](https://surprise.readthedocs.io/en/stable/trainset.html) or [here](https://stackoverflow.com/questions/49263964/datasetautofolds-object-has-no-attribute-global-mean-on-python-surprise))

In [27]:
!pip install scikit-surprise



In [21]:
# Load the libraries
from surprise import Reader
from surprise import Dataset
from surprise.prediction_algorithms.matrix_factorization import SVD

In [22]:
# Step 1: Set up the reader class
reader = Reader(rating_scale=(1,5))

In [23]:
# Step 2: Load the dataframe. Use the merged data from above (not the pivoted data)
data = Dataset.load_from_df(merged_data[['user_id', 'book_id', 'rating']], reader)

In [24]:
# Step 3: Build the train set
svd_data = data.build_full_trainset()

Now we have prepared the data set, you task is then to build the model. I have already imported the SVD algorithm for you. The usage is similar to any sklearn model: you first instantiate a model and set any hyperparamters, then but the model. For this model, use 5 latent factors, a learning rate of 0.01 for all parameters, and a regularization parameter of 0.1 for all parameters. Set a random state of 862.

In [26]:
svd_model = SVD(n_factors = 5, lr_all = 0.01, reg_all = 0.1, random_state = 862)
svd_model.fit(svd_data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x161323210>

In [None]:
unread_ids = rating.columns[rating.loc[1839,:] == 0]

my_rec = []
for iid in unread_ids:
    my_rec.append(svd_model.predict(uid=1839,iid=iid).est)


#save series with sorted values
rec_svd = pd.Series(my_rec, index = unread_ids).sort_values(ascending=False)

print(rec_svd)

In [29]:
# Use as many boxes as you need
from surprise.model_selection import train_test_split

def svd_recommendations(user_id, num_factors=5, learning_rate=0.01, regularization=0.1, random_state=862):

    # Set up the Reader class
    reader = Reader(rating_scale=(1, 5))

    # Load the dataframe
    data = Dataset.load_from_df(ratings[['user_id', 'book_id', 'rating']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.25, random_state=random_state)

    # Instantiate the SVD algorithm
    svd_model = SVD(n_factors=num_factors, lr_all=learning_rate, reg_all=regularization, random_state=random_state)

    # Fit the model
    svd_model.fit(trainset)

    # Predict ratings for the target user
    user_ratings = []
    for book_id in ratings['book_id'].unique():
        predicted_rating = svd_model.predict(user_id, book_id).est
        user_ratings.append((book_id, predicted_rating))

    # Sort the predictions
    user_ratings.sort(key=lambda x: x[1], reverse=True)

    # Get top 15 recommendations
    top_n_recommendations = user_ratings[:15]

    # Create DataFrame for recommendations
    recommendations_df = pd.DataFrame({
        'Book Title': [books.loc[books['book_id'] == book_id, 'original_title'].values[0] for book_id, _ in top_n_recommendations],
        'Predicted Rating': [predicted_rating for _, predicted_rating in top_n_recommendations]
    }, index=np.arange(1, 16))

    return recommendations_df

Now we have fitted the model, we can perform prediction. There are severals you can do this:

1. Calculate the individual ratings $r_{ui}$ by using the given equation in lecture or [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)
2. Calculate the overall rating matrix by doing some matrix multiplications and manipulations
3. Probably the easiest, is to use the predict function (see an example [here](https://surprise.readthedocs.io/en/stable/getting_started.html#predict-ratings2-py) and [here](https://predictivehacks.com/how-to-run-recommender-systems-in-python/). You may not need to use the str() function)


I will let you decide which you want to do, but the goal is the same, provide the top 15 recommendations (based on the predicted ratings) for user with user_id 1839. Show me the recommendations and the predicted values. Store the recommendations.

In [30]:
# Perform SVD and store the recommendations
svd_recommendations_df = svd_recommendations(1839, num_factors=5, learning_rate=0.01, regularization=0.1, random_state=862)
print(svd_recommendations_df)

                                           Book Title  Predicted Rating
1                                     دیوان‎‎ [Dīvān]          4.519518
2                      The Complete Calvin and Hobbes          4.512293
3   The Indispensable Calvin and Hobbes: A Calvin ...          4.439823
4                                                 NaN          4.437325
5                                   Words of Radiance          4.423735
6                 The Authoritative Calvin and Hobbes          4.418302
7   The Revenge of the Baby-Sat: A Calvin and Hobb...          4.415545
8   There's Treasure Everywhere: A Calvin and Hobb...          4.406228
9   Attack of the Deranged Mutant Killer Monster S...          4.377674
10                   The Absolute Sandman, Volume One          4.363785
11                                    The Hate U Give          4.348139
12  It's a Magical World: A Calvin and Hobbes Coll...          4.346844
13                           The Way of Kings, Part 1          4

## Comparison

We have tried to provide recommendations to user 1839 using 4 methods. You last task is to put these 4 recommendations in a dataframe, with the column names the methods you used, and print out the dataframe.

In [32]:
# Create a DataFrame to compare recommendations
comparison_df = pd.DataFrame({
    'User-Based Book Title': ub_recommendations_df['Book Title'], 
    'Item-Based Book Title': ib_recommendations_df['Book Title'],
    'Matrix Factor Book Title': matrix_factorization_recommendations_df['Book Title'],
    'SVD Book Title': svd_recommendations_df['Book Title']
})

# Print the comparison DataFrame
comparison_df

Unnamed: 0,User-Based Book Title,Item-Based Book Title,Matrix Factor Book Title,SVD Book Title
1,The Da Vinci Code,World War Z: An Oral History of the Zombie War,Jesus the Christ: A Study of the Messiah and H...,دیوان‎‎ [Dīvān]
2,O Alquimista,The Scorch Trials,The Essential Calvin and Hobbes: A Calvin and ...,The Complete Calvin and Hobbes
3,Harry Potter and the Philosopher's Stone,The Graveyard Book,The Brothers K,The Indispensable Calvin and Hobbes: A Calvin ...
4,Harry Potter and the Prisoner of Azkaban,Abraham Lincoln: Vampire Hunter,Just Mercy: A Story of Justice and Redemption,
5,Harry Potter and the Order of the Phoenix,Chosen Prey,Complete Harry Potter Boxed Set,Words of Radiance
6,The Kite Runner,Bad Blood,Being Mortal: Medicine and What Matters in the...,The Authoritative Calvin and Hobbes
7,Harry Potter and the Goblet of Fire,Heat Lightning,The Authoritative Calvin and Hobbes,The Revenge of the Baby-Sat: A Calvin and Hobb...
8,Le Petit Prince,Secret Prey,The Complete Anne of Green Gables Boxed Set,There's Treasure Everywhere: A Calvin and Hobb...
9,Harry Potter and the Half-Blood Prince,,Words of Radiance,Attack of the Deranged Mutant Killer Monster S...
10,Harry Potter and the Chamber of Secrets,,Maus II : And Here My Troubles Began,"The Absolute Sandman, Volume One"
