Project name: Book recommender system

Problem:

<u>Objective</u>:
This project aims to recommendate books to users based on a hybrid of content-based and collaborative-based recommendation system. The problem is given a dataset of users, books and users' ratings, how to recommend most relevant books to our users such that accuracy and relevancy of the recommended books are maximized.

<u>Platform used</u>:
The system is based on a mobile platform where users can interact with the systems and recommendations are provided upon user login. With the large database of books and users, it makes logistic sense to run the core of the recommendation engine on our own server and the mobile app is only a platform where recommendations are served. The server will handle all recommendations computation such as cosine-similarity match (for content-filtering) and matrix factorization (for collaborative-filtering).

### Competitor Analysis

**Goodreads:** The largest market share holder by far is Goodreads which has [140 million members (2022)](https://www.goodreads.com/blog/show/2302-goodreads-members-top-72-hit-books-of-the-year-so-far) and [3.5 billion book](https://help.goodreads.com/s/question/0D58V00008Rm69PSAR/how-many-books-are-listed-on-the-goodreads-site). However, a common critism seems to be that the site is very much under-developed and the recommendation algorithm is very "primitive". The claim of the latter is caused by the increasingly irrelevant books that are recommended. To verify this, one of the group member did an emperical obversation by registering an account and set up the recommendation to suit his perferance. His perference is based on history and particulary the WW2 era. Upon seeing the recommended books, most of them seem related to WW2 history but a not insignicant number of the recommendations are completely irrelevant to his perference (e.g. a Spiderman comic, a teenage novel etc). From furthure observation, these "irrelevant" books are recommended because the user rated some other books that are related to history (Both of these are recommended because the user liked a comic about the concentration camp). The algortihm used seems to be a naive similiartiy matching between the books the user liked and all other books in the dataset.

**The StoryGraph:**
As a major alternative to Goodreads, The StoryGraph offers a much more personalised recommendation to users by analysing users' reading habits, and break down the online library by mood, pace, length, genre, rating, etc. The site is under heavy development with many short, medium and long-term road map of experimental features. (E.g. a beta feature like tropes or triggers that users want to avoid in recommended books). The feedback on this site is generally positive for it's tailored recommendations. A major limitation of the site is that there are only a few thousands books available and the social/community aspect is not as strong as Goodreads social community.

### Our recommendation system:
**User input:** Currently, the only source of input for our users would be the N number of book rating (1-10) that new users will be asked upon first registering an account. The ratings are vital for our recommendation system to provide relevant recommendation tailered to the user (detail explanation in the method section). The form of the recommendation is as follow: a top N number of books would be recommended to the user and shown on screen. These are sorted based on the relevancy of these books to the user as determined by our recommendation system. These recommendations are shown to the user upon entering our platform. Recommendation for each user will be recalculated periodically based on user's new book ratings. Again, the only form of feedback from users are the book ratings but more forms of feedback can be considered in the future. This includes: dislike buttons, some specific trigers that users want to avoid, book reviews, feedback for our platform etc. For ideas as to how to implement these, if a user dislikes a book or some type of books, the system should account for this by deleting some of the books in the recommended list or adjust the recommendation model such that these types of books will be heavily penalised inside our model (i.e. a negative score in the matrix factorization model).

Since this is a hybrid ensemble model, we will walk through the gist of the model. There are 2 recommendation models and each have deals with a seperate problem.
**Content-based:**
Firstly, for content-based filtering, the problem is a ranking problem because we will build a user preference vector based on all the books they rated, and this profile is compared against all other books in the datase using tf-idf. All books are then sorted based on their similarity to the user preference vector and top N most similar books will be picked. This is a ranking problem.

**Collaborative-based:**
Secondly, for Collaborative-based filtering, the problem is both a rating/estimation problem and ranking problem. The goal of this model is to predict the most likely ratings of each user on books they haven't rated. Annd it is also a rannking problem because after getting the predicted book ratings, the predicted ratings need to be sorted for each user and pick the top N books to recommend.

**User interface:**

N number of recommended books are displayed at a time. Users can click on the book they are interested in to see more details about the book or click on the play button to start reading directly.

<img src="UX1.png">

Here, the user can see more information about the book. Particularly, the book’s ISBN, year of publication, synopsis and publisher of the book. There is also a book’s rating aggregated from users who read the book. Users can start reading the book by clicking the play button. Users are also allowed to give ratings after reading the book.

<img src="UX2.png">



### Dataset
**Basic characteristics:**
The dataset we used is mainly from [Kaggle](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset/data?select=Ratings.csv) The data was collected by Cai-Nicolas Ziegler through a 4-week crawl on www.bookcrossing.com.

A brief overview of the dataset:
There are 3 CSV files,
-	Users.csv: 278,858 rows of users and the 3 columns are User-ID, location and age
-	Books.csv: 271,379 rows of books and the 7 columns are Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL-S, Image-URL-M and Image-URL-L.
-	Ratings.csv: 1,149,780 rows of book ratings where the 3 columns are User-ID, ISBN, and ratings

Additional, to supplement content-filtering model, we used various webcrawing technique and API to extract book description and category/genre information from Google books and Openlibrary. We then combined the above information in Books.csv such that each book has 4 additional columns named Desription, Category, Openlibrary_Desription and Openlibrary_category.

**Exploratory data analysis on the dataset:**


**Weaknesses:**
**User ratings distribution:**
<img src="user distribution.png">
From the distribution, we can see that the vast majority of the users have very few ratings. This is an extremely right-skewed distribution with a small minority of the users having most of the ratings (notice the log-scale). The implication is that the recommendations provided may be biased in favour of these more "vocal" users. Further, observe the following statistics:

Total number of users: 278858
Total number of books: 83580 (as tallied in books.csv)
Number of users with ratings: 105283
Number of books with ratings: 340556
Number of users without ratings: 173575 (total_users in users.csv - rated_users in ratings.csv)
Number of books without ratings: -256976 (total_books in books.csv - rated_books in ratings.csv)

Observations:
- There are 173575 users who never rated any books (62% of all users!).
- Notice the negative number of books without ratings. This is because there are more books in ratings.csv than the books in books.csv. This is a mistake caused by the original owner who gathered the data. The implication is that there are occasions where we could not retrieve full details of the recommended book because it doesn't exist in books.csv where information such as book title, authour and book description are located. We discovered the problem too late so we had to find other recommended books to fill the top-N recommended book list (Sacrificing relevancy since the list is sorted based on relevancy).

Another weakness of the dataset which we overcame is the fact that the original dataset doesn't have book description which content-based model depends on. However, we solved this problem by webscraping book description from online websites.

**Strength:**
First of all, the dataset is very suited for collaborative-filtering model as we have abundance of data on user ratings. We can feed this data directly into our model without much modifications. One worthy mention of the dataset is the extra context about the users such as their geograpical location and age. This could be useful for context-aware recommendation. Another strength is the sheer size of the dataset. However, due to the limitation of our computation resources, we could not utilize the full dataset, unfortunately.


**Subset of the dataset:**
We decided to use a subset of the data because the original dataset is simply too big for our computers to handle. For example, during matrix factorization, we had to use compressed sparse row (CSR) to represent our matrix because the user-item matrix is simply too big to store in the memory. The runtime of the matrix factorization algortihm also takes too long for the original dataset. There is also a major practical constraint of webscraping for book description. Because we need to retrieve the description for each book through API or webscraping. To obtain description for all 271,379 books is a very time consuming task as only about 40,000 of books can be scraped in 24 hours. We need to scrape 271,379 * 2 number of descriptions for both Google books and Openlibrary. So we needed abount 14 days to obtain the entire set of descriptions which is not permitted by time. In the end, we decided to use about 30% of the original book dataset.


## Methods:

**Collaborative-filtering models:**
For the Collaborative-filtering models, we have 2 implementations differing only in the method of matrix factorization.

**Stochastic gradient descend (SGD) approach:**
Objective: Decompose user-book ratings matrix R into 2 lower-dimensional matrices:
- P_mxk (user-latent factor matrix)
- Q_nxk (item-latent factor matrix)

The goal is to minimize the difference between the actual ratings and the predicted ratings (from dot product of P and Q^T)

<img src="Gradient2.jpg">
<img src="Gradient1.jpg">



In [None]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt

# Load the original datasets
""" books = pd.read_csv('combined_books.csv')
users = pd.read_csv('Users.csv')
ratings = pd.read_csv('Ratings.csv') """

books = pd.read_csv("sample_books.csv")
users = pd.read_csv("sample_users.csv")
ratings = pd.read_csv("sample_ratings.csv")


# Create mappings from books.csv and users.csv
user_id_mapping = {id: idx for idx, id in enumerate(ratings["User-ID"].unique())}
book_id_mapping = {id: idx for idx, id in enumerate(ratings["ISBN"].unique())}


ratings["User-ID"] = ratings["User-ID"].map(user_id_mapping)
ratings["ISBN"] = ratings["ISBN"].map(book_id_mapping)

# Create user-item rating matrix
n_users = len(user_id_mapping)
n_items = len(book_id_mapping)

row = ratings["User-ID"].values
col = ratings["ISBN"].values
data = ratings["Book-Rating"].values
R = csr_matrix((data, (row, col)), shape=(n_users, n_items))

# user-item matrix
print(R.toarray())


# Function to implement NMF using gradient descent
def nmf(R, k, alpha=0.01, lambda_=0.1, n_iterations=2000):
    n_users, n_items = R.shape
    P = np.random.rand(n_users, k)
    Q = np.random.rand(n_items, k).T

    for iteration in range(n_iterations):
        # print(iteration)
        for u in range(n_users):
            for i in range(n_items):
                if R[u, i] > 0:
                    error = R[u, i] - np.dot(P[u, :], Q[:, i])
                    # P[u, :] -= alpha * (2 * error * Q[:, i] +  2 * lambda_ * P[u, :])
                    # Q[:, i] -= alpha * (2 * error * P[u, :] +  2 * lambda_ * Q[:, i])

                    P[u, :] += alpha * (2 * error * Q[:, i] - 2 * lambda_ * P[u, :])
                    Q[:, i] += alpha * (2 * error * P[u, :] - 2 * lambda_ * Q[:, i])

    return P, Q.T


k_values = 2


P, Q = nmf(R, k_values, alpha=0.01, lambda_=0.1, n_iterations=1000)

R_pred = np.dot(P, Q.T)
R_pred

<img src="Pattern.jpg">

### Reflection
The implementation of matrix factorization used in collaborative filtering

In our honest opinion, the current recommendation as it is, is not commercially viable and more work needs to be done to improve the relevancy of the recommendations. The most obvious areas of improvement is the improvement of the  

**Furthure improvement:**
context-aware recommendation: To furthure improve the relevancy of the collaborative filtering model, a potential area of improvement is the usage of the generalised factorization machine introduced in the context-aware recommendation lecture. We could consider adding additional dimensions to the matrix factorization problem with context variables such as time/days, location and age. By capturing the hidden relationship between users, books and context variables, we believe we could significantly improve the relevancy of the recommendations. In fact, the current dataset does contain context information such as users' location and age. However, due to time constraint, we could not experiment furthure with this idea. From our experience with the traditional matrix factorization problem, we realised the great difficulty of decomposing a matrix as sparse as the user-item matrix, where the matrix is enourmously large (so large that we could not fit the entire matrix in RAM without converting it to CSR) yet the vast majority of the entries are empty. We could only imagine the greater difficulty of adding more dimensions to the problem and thus have an even larger sparse matrix (more accurately, tensors). This problem may be beyond our current capability to solve.


## Content-base

In [None]:
import pandas as pd
import warnings
import numpy as np
import re

warnings.filterwarnings("ignore")
books_df = pd.read_csv("combined_books.csv")
ratings_df = pd.read_csv("content_Ratings.csv")
users_df = pd.read_csv("content_Users.csv")

In [None]:
new_user_data = {"User-ID": 279858, "Location": "kensington, nsw, australia", "Age": 38}

# Append the new user to the DataFrame
users_df = pd.concat([users_df, pd.DataFrame([new_user_data])], ignore_index=True)

# List of ISBNs to be rated by the new user
isbn_list = [
    "0425176428",
    "0385418493",
    "0871137380",
    "0486265862",
    "0060973129",
    "0029087104",
    "0553256696",
    "0393038440",
    "0345308239",
    "0131337033",
]

# Generate random high ratings (6-10)
ratings_list = np.random.randint(6, 11, size=len(isbn_list))

# Create new user ratings
new_user_ratings = [
    {"User-ID": 279858, "ISBN": isbn, "Book-Rating": rating}
    for isbn, rating in zip(isbn_list, ratings_list)
]

ratings_df = pd.concat([ratings_df, pd.DataFrame(new_user_ratings)], ignore_index=True)

In [None]:
books_df.head()

In [None]:
missing_values = books_df.isnull().sum()
missing_values

In [None]:
# Fill in the Description and openlibrary_Description columns
books_df["Description"] = books_df["Description"].fillna(
    books_df["openlibrary_Description"]
)
books_df["openlibrary_Description"] = books_df["openlibrary_Description"].fillna(
    books_df["Description"]
)

# Delete lines where both Description and openlibrary_Description are empty
books_df = books_df.dropna(subset=["Description", "openlibrary_Description"], how="all")

# Fill in the Description and openlibrary_Description columns
books_df["Categories"] = books_df["Categories"].fillna(
    books_df["openlibrary_Categories"]
)
books_df["openlibrary_Categories"] = books_df["openlibrary_Categories"].fillna(
    books_df["Categories"]
)

# Delete lines where both Description and openlibrary_Description are empty
books_df = books_df.dropna(subset=["Categories", "openlibrary_Categories"], how="all")

books_df

In [None]:
missing_values = books_df.isnull().sum()
missing_values

In [None]:
books_df.loc[books_df["Year-Of-Publication"] == "DK Publishing Inc", :]

In [None]:
# ISBN '0789466953'
books_df.loc[books_df["ISBN"] == "0789466953", "Year-Of-Publication"] = 2000
books_df.loc[books_df["ISBN"] == "0789466953", "Book-Author"] = "James Buckley"
books_df.loc[books_df["ISBN"] == "0789466953", "Publisher"] = "DK Publishing Inc"
books_df.loc[books_df["ISBN"] == "0789466953", "Book-Title"] = (
    "DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)"
)

# ISBN '078946697X'
books_df.loc[books_df["ISBN"] == "078946697X", "Year-Of-Publication"] = 2000
books_df.loc[books_df["ISBN"] == "078946697X", "Book-Author"] = "Michael Teitelbaum"
books_df.loc[books_df["ISBN"] == "078946697X", "Publisher"] = "DK Publishing Inc"
books_df.loc[books_df["ISBN"] == "078946697X", "Book-Title"] = (
    "DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)"
)

# rechecking
books_df.loc[(books_df["ISBN"] == "0789466953") | (books_df["ISBN"] == "078946697X"), :]

In [None]:
ratings_df.head()

In [None]:
users_df.head()

In [None]:
# Make sure merged_df right
merged_df = pd.merge(books_df, ratings_df, on="ISBN")

# final_df
final_df = pd.merge(merged_df, users_df, on="User-ID")

# Function to normalize book titles by removing special characters and converting to lowercase
def normalize_title(title):
    return re.sub(r'[^\w\s]', '', title.lower())

# Apply normalization to the Book-Title and ISBN columns
final_df["Normalized-Title"] = final_df["Book-Title"].apply(normalize_title)
final_df["Normalized-ISBN"] = final_df["ISBN"].apply(lambda x: re.sub(r'[^\dX]', '', x.upper()))

# Drop duplicate entries based on the normalized columns
final_df = final_df.drop_duplicates(subset=["Normalized-Title", "Normalized-ISBN"])

# Drop the normalized columns after deduplication (optional, based on your needs)
final_df = final_df.drop(columns=["Normalized-Title", "Normalized-ISBN"])

# Recheck the final DataFrame
final_df.head()

In [None]:
# Check for missing values in the final merged DataFrame
missing_values = final_df.isnull().sum()
missing_values

In [None]:
final_df.info()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the global drawing style and font size
sns.set(style="whitegrid", palette="muted")
plt.rcParams.update({"font.size": 12})
ratings = ratings_df[ratings_df.ISBN.isin(books_df.ISBN)]
ratings_explicit = ratings[ratings["Book-Rating"] != 0]
ratings_implicit = ratings[ratings["Book-Rating"] == 0]
# Plot the distribution of book ratings
plt.figure(figsize=(10, 6))
sns.countplot(data=ratings_explicit, x="Book-Rating", palette="rocket_r")
plt.title("Distribution of Explicit Book Ratings", fontsize=16)
plt.xlabel("Book Rating", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.show()

In [None]:
# Plot the distribution of the number of user ratings
user_rating_counts = ratings_df.groupby("User-ID")["ISBN"].count()
plt.figure(figsize=(10, 6))
sns.histplot(
    user_rating_counts, kde=True, log_scale=(True, False), bins=30, color="teal"
)
plt.title("Distribution of Total Ratings per User", fontsize=16)
plt.xlabel("Number of Ratings", fontsize=14)
plt.ylabel("Number of Users", fontsize=14)
plt.show()

In [None]:
# Plot the distribution of user ages
df_plot = users_df[users_df["Age"] <= 100]
plt.figure(figsize=(10, 6))
sns.histplot(df_plot["Age"], kde=True, color="#4169E1")
plt.title("Distribution of User Ages", fontsize=16)
plt.xlabel("Age", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.show()

In [None]:
# Plot the distribution of the year the book was published
books_df["Year-Of-Publication"] = pd.to_numeric(
    books_df["Year-Of-Publication"], errors="coerce"
)
df_plot2 = books_df[
    (books_df["Year-Of-Publication"] > 1950) & (books_df["Year-Of-Publication"] <= 2016)
]
plt.figure(figsize=(10, 6))
sns.histplot(df_plot2["Year-Of-Publication"], kde=True, color="#FFA07A")
plt.title("Distribution of Year of Publication", fontsize=16)
plt.xlabel("Year of Publication", fontsize=14)
plt.ylabel("Frequency", fontsize=14)
plt.show()

In [None]:
# Plot the distribution of the most popular book Categories (Top 10 Categories)
top_categories = books_df["Categories"].value_counts().head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_categories.values, y=top_categories.index, palette="magma")
plt.title("Top 10 Book Categories", fontsize=16)
plt.xlabel("Number of Books", fontsize=14)
plt.ylabel("Categories", fontsize=14)
plt.show()

In [None]:
# Top 10 Authors by Number of Ratings
# Merge books_df and ratings_df and count the number of ratings per author
data = pd.merge(books_df, ratings_df, on="ISBN")[
    ["Book-Author", "Book-Rating", "Book-Title", "ISBN"]
]

# Group and aggregate the mean and quantity of 'Book-Rating' by 'Book-Author'
data = data.groupby("Book-Author").agg({"Book-Rating": ["mean", "count"]}).reset_index()

data.columns = ["Book-Author", "mean", "count"]
plt.figure(figsize=(10, 6))
top_rated_authors = data.sort_values("count", ascending=False).head(10)
sns.barplot(x="count", y="Book-Author", data=top_rated_authors, palette="viridis")
plt.title("Top 10 Authors by Number of Ratings", fontsize=16)
plt.xlabel("Number of Ratings", fontsize=14)
plt.ylabel("Author", fontsize=14)
plt.show()

In [None]:
# Aggregate score data for books
data = ratings_df.groupby("ISBN").agg(["mean", "count"])["Book-Rating"].reset_index()

# Merge rating data with book data to get author information
data = pd.merge(data, books_df[["ISBN", "Book-Author"]], on="ISBN")

# Calculate the weighted score
m = data["count"].quantile(0.90)  
C = data["mean"].mean()  
data = data[data["count"] > m]

# Weighted scoring formula
data["weighted rating"] = (data["count"] / (data["count"] + m)) * data["mean"] + (
    m / (m + data["count"])
) * C

# Weighted Rating vs. Number of Ratings
plt.figure(figsize=(10, 6))
sns.scatterplot(
    x="count",
    y="weighted rating",
    data=data,
    hue="mean",
    size="mean",
    palette="cool",
    sizes=(20, 200),
    legend=False,
)
plt.title("Weighted Rating vs. Number of Ratings", fontsize=16)
plt.xlabel("Number of Ratings", fontsize=14)
plt.ylabel("Weighted Rating", fontsize=14)
plt.show()

In [None]:
def normalize_author_name(name):
    # Remove special characters, convert to lowercase, and remove spaces
    name = re.sub(r'[^\w\s]', '', name.lower())
    return name


# Load the combined books dataset
books_df = pd.read_csv("combined_books.csv")

# Create a new column for normalized author names
books_df["Normalized-Author"] = books_df["Book-Author"].apply(normalize_author_name)

# Drop duplicate books based on 'Normalized-Author' and 'Book-Title'
books_df = books_df.drop_duplicates(["Normalized-Author", "Book-Title"])

# Merge books and ratings dataframes on 'ISBN'
ratings_df = pd.read_csv("content_Ratings.csv")
data = pd.merge(books_df, ratings_df, on="ISBN")[
    ["Book-Author", "Normalized-Author", "Book-Rating", "Book-Title", "ISBN"]
]

# Group by 'Normalized-Author' and aggregate mean and count of 'Book-Rating'
data = (
    data.groupby("Normalized-Author")
    .agg({"Book-Rating": ["mean", "count"], "Book-Author": "first"})
    .reset_index()
)

# Rename columns for clarity
data.columns = ["Normalized-Author", "mean", "count", "Book-Author"]

# Determine the threshold for number of votes
m = data["count"].quantile(0.99)
data = data[data["count"] > m]

print("m =", m)
print(data.shape)

# Calculate weighted rating
R = data["mean"]  # Average rating for the author
v = data["count"]  # Number of votes for the author
C = data["mean"].mean()  # Mean vote across all authors
data["weighted rating"] = (v / (v + m)) * R + (m / (v + m)) * C

# Sort by weighted rating and reset index
data = data.sort_values("weighted rating", ascending=False).reset_index(drop=True)

# Display top 20 highest rated authors
print(data[["Book-Author", "mean", "count", "weighted rating"]].head(20))

In [None]:
# Top 20 Highest Rated Books
data = ratings_df.groupby("ISBN").agg(["mean", "count"])["Book-Rating"].reset_index()


m = data["count"].quantile(0.99)  # Keep only books that are rated more than this value
data = data[data["count"] > m]
print("m =", m)
print(data.shape)
R = data["mean"]  # average for the book rating
v = data["count"]  # number of votes for the book = (votes)
C = data["mean"].mean()  # mean vote across all books
data["weighted rating"] = (v / (v + m)) * R + (m / (v + m)) * C
data = data.sort_values("weighted rating", ascending=False).reset_index(drop=True)


data = (
    pd.merge(data, books_df, on="ISBN")[
        [
            "Book-Title",
            "Book-Author",
            "mean",
            "count",
            "weighted rating",
            "Year-Of-Publication",
        ]
    ]
    .drop_duplicates("Book-Title")
    .iloc[:20]
)
data

### Data process

In [None]:
missing_values = final_df.isnull().sum()
missing_values

In [None]:
final_df["Book-Author"].fillna(final_df["Book-Author"].mode()[0], inplace=True)
final_df["Publisher"].fillna(final_df["Publisher"].mode()[0], inplace=True)

# Handle missing values in 'Age' column with median
final_df["Age"].fillna(final_df["Age"].median(), inplace=True)

In [None]:
missing_values = final_df.isnull().sum()
missing_values

In [None]:
# Converts the data types of the three fields in the final_df data box to integers
final_df["Age"] = final_df["Age"].astype(int)
final_df["Book-Rating"] = final_df["Book-Rating"].astype(int)
final_df["User-ID"] = final_df["User-ID"].astype(int)

final_df.info()

In [None]:
duplicate_count = final_df.duplicated().sum()
duplicate_count

In [None]:
# Remove the data whose rating is 0
zero_rating_books_df = final_df[final_df["Book-Rating"] == 0]
zero_rating_books_df

In [None]:
final_df = final_df[final_df["Book-Rating"] != 0]

In [None]:
sample_size = 100
df3 = final_df.sample(n=sample_size, random_state=42)


df3["Rating_Category"] = pd.cut(
    df3["Book-Rating"], bins=[0, 3, 7, 10], labels=["Low", "Medium", "High"]
)

# Plotting bar plot dengan seaborn
plt.figure(figsize=(18, 20))
sns.barplot(
    data=df3,
    x="Book-Rating",
    y="Book-Title",
    hue="Rating_Category",
    palette={"Low": "red", "Medium": "orange", "High": "green"},
)
plt.title("Average Rating by Title of Book(new)")
plt.xlabel("Average Rating")
plt.ylabel("Title of Book")
plt.legend(title="Rating Category")
plt.show()

In [None]:
# Data statistics description
final_df.describe()

First identify and mark outliers (greater than 2022 or equal to 0) in the 'Year-Of-Publication' field as missing values.
The average of the 'Year-Of-Publication' field in books_df is then used to populate the missing value of that field in final_df.
Finally, the populated value is converted to an integer.

In [None]:
# Handle outliers in the 'Year-Of-Publication' column
import numpy as np

final_df.loc[
    (final_df["Year-Of-Publication"] > 2022) | (final_df["Year-Of-Publication"] == 0),
    "Year-Of-Publication",
] = np.nan
final_df.loc[:, "Year-Of-Publication"] = (
    final_df["Year-Of-Publication"]
    .fillna(round(books_df["Year-Of-Publication"].mean()))
    .astype(np.int32)
)

Then identify and mark outliers in the 'Age' field (greater than 90 years old or less than 5 years old) as missing values.
The average value of the 'Age' field in users_df is then used to populate the missing value of that field in final_df.
Finally, the data types of the 'Age' and 'Year-Of-Publication' fields are converted to 32-bit integers

In [None]:
final_df.loc[(final_df.Age > 90) | (final_df.Age < 5), "Age"] = np.nan

# replacing NaNs with mean
final_df.Age = final_df.Age.fillna(users_df.Age.mean())

# setting the data type as int
final_df.Age = final_df.Age.astype(np.int32)
final_df["Year-Of-Publication"] = final_df["Year-Of-Publication"].astype(np.int32)

In [None]:
final_df.describe()

Data preparation:

First, the 'Book-Title', 'Book-Author', 'Description', and 'Categories' columns in books_df are concatenated into a text corpus.
Using Gensim's Word2Vec model, a word vector model word2vec_model_recommender is obtained after training in corpus.

In [None]:

corpus = (
    (
        books_df["Book-Title"].astype(str)
        + " "
        + books_df["Book-Author"].astype(str)
        + " "
        + books_df["Description"].astype(str)
        + " "
        + books_df["Categories"].astype(str)
        + " "
        + books_df["openlibrary_Description"].astype(str)
        + " "
        + books_df["openlibrary_Categories"].astype(str)
    )
    .apply(str.split)
    .tolist()
)

**Recommended algorithm Word2Vec:**
- The recommend function is the core of the recommendation algorithm.
- The input parameters are the user ID user_id, the complete book data, and the trained Word2Vec model word2vec_model.
- First get the user's favorite book information, including title, author, category, and description.
- This information is spliced into a text and divided into words to obtain an average word vector avg_vector.
- Then the average word vector of each book is calculated, and the cosine similarity is calculated with the average word vector of the  user, and the similarity score is obtained.
- According to the similarity score, the top 10 books with the highest similarity and scores higher than 5 are selected as the recommended results.
- Finally, the recommendation result is returned in DataFrame format.

In [None]:
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

# Train a Word2Vec model
word2vec_model_recommender = Word2Vec(
    sentences=corpus, vector_size=500, window=5, min_count=5, sg=2
)


def recommend(user_id, data, word2vec_model):
    # Get user preferences
    user_preferences = data[data["User-ID"] == user_id]
    if user_preferences.empty:
        return None

    # Get information about the user's favorite books
    liked_books = user_preferences["Book-Title"].tolist()
    liked_ISBN = user_preferences["ISBN"].tolist()
    liked_authors = user_preferences["Book-Author"].tolist()
    liked_genres = user_preferences["Categories"].tolist()
    liked_description = user_preferences["Description"].tolist()
    liked_openlibrary_description = user_preferences["openlibrary_Description"].tolist()
    liked_openlibrary_categories = user_preferences["openlibrary_Categories"].tolist()

    # Merge user preference information
    text = " ".join(
        liked_ISBN
        + liked_books
        + liked_authors
        + liked_genres
        + liked_description
        + liked_openlibrary_description
        + liked_openlibrary_categories
    )
    tokens = text.split()

    # Get text vector
    vectors = [
        word2vec_model.wv[token] for token in tokens if token in word2vec_model.wv
    ]
    if len(vectors) == 0:
        return None
    avg_vector = sum(vectors) / len(vectors)

    # Calculate the similarity to each book
    similarities = []
    recommended_titles = set()
    for idx, row in data.iterrows():
        # Skip books that the user has already rated
        if row["ISBN"] in liked_ISBN:
            continue

        row_text = " ".join(
            [
                str(row[col])
                for col in [
                    "Book-Title",
                    "Book-Author",
                    "Categories",
                    "Description",
                    "openlibrary_Description",
                    "openlibrary_Categories",
                ]
            ]
        )
        row_tokens = row_text.split()
        row_vectors = [
            word2vec_model.wv[token]
            for token in row_tokens
            if token in word2vec_model.wv
        ]
        if len(row_vectors) > 0:
            row_avg_vector = sum(row_vectors) / len(row_vectors)
            similarity = cosine_similarity([avg_vector], [row_avg_vector])[0][0]
            similarities.append((row, similarity))

    # Rank the similarity and select the top 10 recommendations
    similarities.sort(key=lambda x: x[1], reverse=True)
    recommendations = []
    for book, sim in similarities:
        recommendations.append(book.to_dict())
        recommended_titles.add(book["Book-Title"])
        if len(recommendations) >= 10:
            break

    # Convert the recommendation result to a DataFrame
    recommendations_df = pd.DataFrame(recommendations)
    return recommendations_df


# Example of using the function
user_id = 279858
recommendations = recommend(user_id, final_df, word2vec_model_recommender)
recommendations

**Data Preprocessing**

Splicing the 'Book Title', 'Book Author', 'Description' and 'Categories' columns in' final_df 'into a large text corpus.
Use 'TfidfVectorizer' to extract TF-IDF features from 'corpus' and generate tfidf_matrix.

**Recommendation Algorithm**:
- The 'recommend' function accepts the user ID and the entire dataset 'final_df' as input.
- First, get the user's favorite book list 'user_preferences'.
- Then, generate the user preference vector user_vector according to user_preferences.
- Next, traverse the entire data set, calculating the cosine similarity between each book and the user's preference vector, and store the results in the similarities list.
- Sorts the similarities list in descending order of similarity.
- From the similarities list after sorting, select books that the first 10 users are not familiar with, score more than 5 points, and do not repeat as the recommendation results.
- Finally, the recommendation result is returned in DataFrame format.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = (books_df['Book-Title'].astype(str) + ' ' + 
          books_df['Book-Author'].astype(str) + ' ' +
          books_df['Description'].astype(str) + ' ' +
          books_df['Categories'].astype(str)).tolist()

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

def recommend(user_id, data):
    # Get user preferences
    user_preferences = final_df.loc[final_df['User-ID'] == user_id, 'Book-Title'].tolist()
    
    # Create a user preference vector
    user_vector = tfidf_vectorizer.transform([' '.join(user_preferences)])
    
    # Calculate how similar each book is to user preferences
    similarities = []
    for idx, row in data.iterrows():
        row_text = ' '.join([str(row[col]) for col in data.columns])
        row_vector = tfidf_vectorizer.transform([row_text])
        similarity = cosine_similarity(user_vector, row_vector)[0][0]
        similarities.append((row, similarity))
    
    # Sort in descending order of similarity
    similarities.sort(key=lambda x: x[1], reverse=True)

    # Get the top 10 recommendations
    recommendations = []
    recommended_titles = set()
    for book, sim in similarities:
        if book['Book-Title'] not in user_preferences and book['Book-Rating'] > 5 and book['Book-Title'] not in recommended_titles:
            recommendations.append(book.to_dict())
            recommended_titles.add(book['Book-Title'])
        if len(recommendations) >= 10:
            break
    
    # Convert to a DataFrame
    recommendations_df = pd.DataFrame(recommendations)

    return recommendations_df

user_id = 279858
recommendations = recommend(user_id, final_df)
recommendations

In [None]:
from sklearn.metrics import precision_score, recall_score

def evaluate_recommendations(data, word2vec_model, top_n=10):
    user_ids = data['User-ID'].unique()
    
    all_recall = []
    all_precision = []
    hit_rate = []
    
    for user_id in user_ids:
        recommendations = recommend(user_id, data, word2vec_model)
        if recommendations is None or recommendations.empty:
            continue
        
        user_preferences = data[data['User-ID'] == user_id]
        liked_books = user_preferences['Book-Title'].tolist()
        
        recommended_books = recommendations['Book-Title'].tolist()
        true_positives = [1 if book in liked_books else 0 for book in recommended_books[:top_n]]
        
        # Calculate precision and recall
        precision = sum(true_positives) / top_n
        recall = sum(true_positives) / len(liked_books)
        
        all_precision.append(precision)
        all_recall.append(recall)
        
        # Calculate hit rate (whether there is at least one correct recommendation)
        if sum(true_positives) > 0:
            hit_rate.append(1)
        else:
            hit_rate.append(0)
    
    # Calculate the average recall, precision, and hit rate across all users
    average_recall = np.mean(all_recall)
    average_precision = np.mean(all_precision)
    hit_rate = np.mean(hit_rate)
    
    return average_recall, average_precision, hit_rate

# Example of using the evaluation function
average_recall, average_precision, hit_rate = evaluate_recommendations(final_df, word2vec_model_recommender)
print(f'Average Recall: {average_recall:.2f}')
print(f'Average Precision: {average_precision:.2f}')
print(f'Hit Rate: {hit_rate:.2f}')

TP = 7
FP = 8
TN = impossible
FN = impossible
Precision = TP/(TP+FP) = 7/15 = 0.47
Recall = TP/(TP+FN)
F1 = 2*precision*recall/(precision+recall)
Accuracy=(TP+TN)/(whole dataset)

## Collaborative Filtering

Since we have users 'interaction' with books - theirs ratings, we can use collaborative filtering using these interactions, based on the idea that users who have agreed in the past will agree in the future.

In our project, Alternating Least Squares (ALS) algorithm is used to identify the patterns in both users and books. ALS can factorize the large user-item interaction matrix into two lower-dimensional matrices that capture the latent factors of users and items. The final goal of ALS is to minimize the difference between the actual ratings and the predicted ones derived from the latent factors.


1. Data preparation 

First, the user and book IDs are mapped to a range of index, which is essential for creating a sparse matrix.

In [None]:
import numpy as np
from scipy.sparse.linalg import svds
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [None]:
ratings = pd.read_csv('./content_Ratings.csv')
users = pd.read_csv('./content_Users.csv')
dtype_spec = {
    'ISBN': str,
    'Book-Title': str,
    'Book-Author': str,
    'Year-Of-Publication': str,
    'Publisher': str,
    'Image-URL-S': str,
    'Image-URL-M': str,
    'Image-URL-L': str,
    'Description': str,
    'Categories': str
}
books = pd.read_csv('./combined_books.csv', dtype=dtype_spec, low_memory=False)


In [None]:
ratings_head = ratings.head()
books_head = books.head()
users_head = users.head()

ratings_head, books_head, users_head

In [None]:
def preprocess_users(users):
    users['Age'].fillna(users['Age'].median(), inplace=True)
    users['Age'] = users['Age'].astype(int)
    return users

def preprocess_books(books):
    books.fillna('', inplace=True)
    books['Year-Of-Publication'] = pd.to_numeric(books['Year-Of-Publication'], errors='coerce').fillna(0).astype(int)
    return books
users = preprocess_users(users)
books = preprocess_books(books)

2. Train-Test Split

The data is split into training and test sets. To ensure that all users and books are represented in the training set, any users or books that are only present in the test set are added back to the training set.

In [None]:
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

user_id_mapping = {id: idx for idx, id in enumerate(ratings['User-ID'].unique())}
book_id_mapping = {id: idx for idx, id in enumerate(ratings['ISBN'].unique())}

ratings['User-ID'] = ratings['User-ID'].map(user_id_mapping)
ratings['ISBN'] = ratings['ISBN'].map(book_id_mapping)
ratings.dropna(subset=['Book-Rating'], inplace=True)
all_user_ids = ratings['User-ID'].unique()
all_book_ids = ratings['ISBN'].unique()

train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

train_user_ids = set(train_data['User-ID'])
train_book_ids = set(train_data['ISBN'])
missing_users = set(all_user_ids) - train_user_ids
missing_books = set(all_book_ids) - train_book_ids

missing_data = ratings[ratings['User-ID'].isin(missing_users) | ratings['ISBN'].isin(missing_books)]
train_data = pd.concat([train_data, missing_data]).drop_duplicates()

n_users = ratings['User-ID'].nunique()
n_items = ratings['ISBN'].nunique()
train_matrix = csr_matrix((train_data['Book-Rating'], (train_data['User-ID'], train_data['ISBN'])), shape=(n_users, n_items))
test_matrix = csr_matrix((test_data['Book-Rating'], (test_data['User-ID'], test_data['ISBN'])), shape=(n_users, n_items))
print("-------matrix finished---------")

if np.any(np.isnan(train_matrix.data)):
    print("NaN values found in training matrix")
else:
    print("No NaN values in training matrix")

3. Create Sparse Matrices

Sparse matrices for the training and test data are created. These matrices are used by the ALS to learn the latent factors. Here we also added a test to judge whether there is null values in the training matrix, in case during matric factorization there is error.

4. Train the ALS Model

An ALS model is initialized and trained using the training data. The model learns latent factors for users and books. Here we set the iteration to be 50.

In [None]:
als_model = AlternatingLeastSquares(factors=50, regularization=0.1, iterations=50, use_gpu=False, calculate_training_loss=True)

als_model.fit(train_matrix.T, show_progress=True)
print("-----------------fitting finished-----------------")

5. Evaluate the Model

The model is evaluated using the test data by calculating the Root Mean Square Error (RMSE) between the predicted ratings and the actual ratings in the test set.

We first extract non-zero entries, then iterate over Non-Zero User-Item pairs to calculate predictions, during this we also check index bounds. After the predictions, we first handle empty predictions, If no predictions were made when the predictions list is empty, the function returns infinity, which is a safeguard to indicate that the model could not make any predictions. Finally RMSE is calculated.

In our experiment, Test RMSE is: 7.844746002414569

In [None]:
def evaluate_model(test_matrix, als_model):
    test_user_items = test_matrix.nonzero()
    predictions = []
    ground_truth = []
    for user, item in zip(test_user_items[0], test_user_items[1]):
        if user < als_model.user_factors.shape[0] and item < als_model.item_factors.shape[0]:
            prediction = als_model.user_factors[user, :].dot(als_model.item_factors[item, :].T)
            predictions.append(prediction)
            ground_truth.append(test_matrix[user, item])
    if len(predictions) == 0:
        return float('inf')  
    return np.sqrt(mean_squared_error(ground_truth, predictions))

In [None]:
rmse = evaluate_model(train_matrix, als_model)
print(f"Test RMSE: {rmse}")

6. Generate Recommendations

Since the recommendation is based on user's past action, after checking whether user is in our mapping, we retrieve the ratings given by the user from the training matrix, then use it by the recommend method to filter out books that the user has already rated. The recommendation is based on Implicit library.

In [None]:
def recommend_books_als(user_id, num_recommendations=5):
    if user_id not in user_id_mapping:
        return [] 
    user_index = user_id_mapping[user_id]
    user_ratings = train_matrix[user_index]
    recommended_books = als_model.recommend(user_index, user_ratings, N=num_recommendations, filter_already_liked_items=True)
    # recommended_book_ids = [list(book_id_mapping.keys())[list(book_id_mapping.values()).index(i)] for i, _ in recommended_books]
    # return books[books['ISBN'].isin(recommended_book_ids)]
    return recommended_books

7. Conclusion for ALS:

By using ALS, we managed to discover hidden patterns and relationships in the rating data by learning latent factors for both users and books. Also, we managed to reduce the high-dimensional user-item matrix into lower-dimensional ones. The RMSE is relatively low, and we show it effeciency in predicting unknown interactions and generate recommendations for users. Also, we can use it to compute similarities between users and items.

## Recommendation based on embedding

We plan to use embedding to calculate similarities of books and users, thus generating recommendations.

In [None]:
import numpy as np
from scipy.sparse.linalg import svds
import pandas as pd
import warnings
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
warnings.filterwarnings("ignore")

Data preprocessing is similar:

1. Handling Missing Values:

Fill missing ages with the median age.
Convert 'Year-Of-Publication' to numeric, filling missing values with the median year.

2. Encoding Categorical Variables:

Convert 'Location', 'Book-Author', and 'Publisher' to categorical codes.

3. Normalizing Numerical Features:

Standardize 'Age' and 'Year-Of-Publication' using StandardScaler to ensure these features are on a similar scale.

In [None]:
user_id_mapping = {id: idx for idx, id in enumerate(ratings['User-ID'].unique())}
book_id_mapping = {id: idx for idx, id in enumerate(ratings['ISBN'].unique())}

ratings['User-ID'] = ratings['User-ID'].map(user_id_mapping)
ratings['ISBN'] = ratings['ISBN'].astype(str)
books['ISBN'] = books['ISBN'].astype(str)
ratings['ISBN'] = ratings['ISBN'].map(book_id_mapping)
books['ISBN'] = books['ISBN'].map(book_id_mapping)
ratings.dropna(subset=['Book-Rating'], inplace=True)
# ratings = ratings[ratings['ISBN'].isin(books['ISBN']) & ratings['User-ID'].isin(users['User-ID'])]

all_user_ids = ratings['User-ID'].unique()
all_book_ids = ratings['ISBN'].unique()

train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

train_user_ids = set(train_data['User-ID'])
train_book_ids = set(train_data['ISBN'])
missing_users = set(all_user_ids) - train_user_ids
missing_books = set(all_book_ids) - train_book_ids

missing_data = ratings[ratings['User-ID'].isin(missing_users) | ratings['ISBN'].isin(missing_books)]
train_data = pd.concat([train_data, missing_data]).drop_duplicates()

n_users = ratings['User-ID'].nunique()
n_items = ratings['ISBN'].nunique()
train_matrix = csr_matrix((train_data['Book-Rating'], (train_data['User-ID'], train_data['ISBN'])), shape=(n_users, n_items))
test_matrix = csr_matrix((test_data['Book-Rating'], (test_data['User-ID'], test_data['ISBN'])), shape=(n_users, n_items))
print("-------matrix finished---------")

if np.any(np.isnan(train_matrix.data)):
    print("NaN values found in training matrix")
else:
    print("No NaN values in training matrix")

In [None]:
scaler = StandardScaler()
users[['Age']] = scaler.fit_transform(users[['Age']])
books[['Year-Of-Publication']] = scaler.fit_transform(books[['Year-Of-Publication']])


#### Embedding Model
We use an embedding-based neural network model to learn representations of users and items. 

#### Architecture:

Embedding Layers:Dense representations for users, books, locations, authors, and publishers.

Linear Layers: Transform age, year of publication into embeddings.

Concatenation Layer: Combine all embeddings into a single dense vector.

Output Layer: Output embeddings

In [None]:
class EmbeddingNet(nn.Module):
    def __init__(self, num_users, num_books, embedding_dim, num_locations, num_authors, num_publishers):
    # def __init__(self, num_users, num_books, embedding_dim, num_authors):
        super(EmbeddingNet, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.book_embedding = nn.Embedding(num_books, embedding_dim)
        self.location_embedding = nn.Embedding(num_locations, embedding_dim)
        self.author_embedding = nn.Embedding(num_authors, embedding_dim)
        self.publisher_embedding = nn.Embedding(num_publishers, embedding_dim)
        self.user_age = nn.Linear(1, embedding_dim)
        self.book_year = nn.Linear(1, embedding_dim)
    
    def forward(self, user_id, book_id, location_id, age, author_id, year, publisher_id):
    # def forward(self, user_id, book_id, age, author_id, year):
        user_embed = self.user_embedding(user_id).squeeze()
        book_embed = self.book_embedding(book_id).squeeze()
        location_embed = self.location_embedding(location_id).squeeze()
        author_embed = self.author_embedding(author_id).squeeze()
        publisher_embed = self.publisher_embedding(publisher_id).squeeze()
        age_embed = self.user_age(age).squeeze()
        year_embed = self.book_year(year).squeeze()
        return torch.cat([user_embed, book_embed, location_embed, age_embed, author_embed, year_embed, publisher_embed], dim=-1)
        # return torch.cat([user_embed, book_embed, author_embed, age_embed, year_embed], dim=-1)

# train_data = train_data.merge(users[['User-ID', 'Age']], on='User-ID')
# train_data = train_data.merge(books[['ISBN', 'Book-Author', 'Year-Of-Publication']], on='ISBN')
train_data = train_data.merge(users[['User-ID', 'Age', 'Location']], on='User-ID')
train_data = train_data.merge(books[['ISBN', 'Book-Author', 'Year-Of-Publication', 'Publisher']], on='ISBN')

train_data['Age'] = train_data['Age'].apply(lambda x: torch.tensor(x).float().unsqueeze(0))
train_data['Book-Author'] = train_data['Book-Author'].apply(lambda x: torch.tensor(x).long())
train_data['Year-Of-Publication'] = train_data['Year-Of-Publication'].apply(lambda x: torch.tensor(x).float().unsqueeze(0))
train_data['Location'] = train_data['Location'].apply(lambda x: torch.tensor(x).long())
train_data['Publisher'] = train_data['Publisher'].apply(lambda x: torch.tensor(x).long())

user_ids = torch.tensor(train_data['User-ID'].values).long()
book_ids = torch.tensor(train_data['ISBN'].values).long()
ratings = torch.tensor(train_data['Book-Rating'].values).float()
ages = torch.stack(list(train_data['Age'].values))
locations = torch.tensor(train_data['Location'].values.tolist()).long()
authors = torch.tensor(train_data['Book-Author'].values.tolist()).long()
years = torch.stack(list(train_data['Year-Of-Publication'].values))
publishers = torch.tensor(train_data['Publisher'].values.tolist()).long()

# dataset = TensorDataset(user_ids, book_ids, ratings, ages, authors, years)
dataset = TensorDataset(user_ids, book_ids, ratings, ages, locations, authors, years, publishers)

dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

#### Model Training

The model is trained to minimize the mean squared error (MSE) between the predicted and actual user ratings. 
We created DataLoader to generate batching and shuffling during training, then updating the model parameters using backpropagation and the Adam optimizer.

In [None]:
embedding_dim = 10
num_users = len(user_id_mapping)
num_books = len(book_id_mapping)
num_locations = users['Location'].nunique()
num_authors = books['Book-Author'].nunique()
num_publishers = books['Publisher'].nunique()

print(f"{num_users} users, {num_books} books")
print(f"dataloader: {len(dataloader)}")

model = EmbeddingNet(num_users, num_books, embedding_dim, num_locations, num_authors, num_publishers)
# model = EmbeddingNet(num_users, num_books, embedding_dim, num_authors)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()


In [None]:
num_epochs = 20
for epoch in range(num_epochs):
    print(f'epoch{epoch} starts:')
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(dataloader):
        user_id, book_id, rating, age, location_id, author_id, year, publisher_id = batch
        
        user_id = user_id.long()
        book_id = book_id.long()
        rating = rating.float()
        
        embedding = model(user_id, book_id, location_id, age, author_id, year, publisher_id)
        # embedding = model(user_id, book_id, age, author_id, year)

        output = torch.sum(embedding, dim=1)
        # print(f'Rating: {rating}, Output: {output}')
        
        optimizer.zero_grad()
        # output = embedding.dot(embedding)
        loss = criterion(output, rating)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        # print(f"{i}/{len(dataloader)} in epoch {epoch}")
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_data)}')

torch.save(model.state_dict(), 'embedding_model.pth')

user_embeddings = model.user_embedding.weight.data.numpy()
book_embeddings = model.book_embedding.weight.data.numpy()

np.save('user_embeddings.npy', user_embeddings)
np.save('book_embeddings.npy', book_embeddings)

#### Recommendation Generation
Once the model is trained, we can generate recommendations for users. The process involves:

1. Precomputing Book Embeddings:

Compute and store embeddings for all books in the dataset to speed up similarity calculations during recommendation.

2. Generating Recommendations:

For each user, compute the cosine similarity between the user’s embedding and all book embeddings.
Sort books by similarity score and select the top N recommendations.

In [None]:
# Preprocess test data
test_data = test_data.merge(users[['User-ID', 'Age', 'Location']], on='User-ID')
test_data = test_data.merge(books[['ISBN', 'Book-Author', 'Year-Of-Publication', 'Publisher']], on='ISBN')

test_data['Age'] = test_data['Age'].apply(lambda x: torch.tensor(x).float().unsqueeze(0))
test_data['Book-Author'] = test_data['Book-Author'].apply(lambda x: torch.tensor(x).long())
test_data['Year-Of-Publication'] = test_data['Year-Of-Publication'].apply(lambda x: torch.tensor(x).float().unsqueeze(0))
test_data['Location'] = test_data['Location'].apply(lambda x: torch.tensor(x).long())
test_data['Publisher'] = test_data['Publisher'].apply(lambda x: torch.tensor(x).long())

user_ids_test = torch.tensor(test_data['User-ID'].values).long()
book_ids_test = torch.tensor(test_data['ISBN'].values).long()
ratings_test = torch.tensor(test_data['Book-Rating'].values).float()
ages_test = torch.stack(list(test_data['Age'].values))
locations_test = torch.tensor(test_data['Location'].values.tolist()).long()
authors_test = torch.tensor(test_data['Book-Author'].values.tolist()).long()
years_test = torch.stack(list(test_data['Year-Of-Publication'].values))
publishers_test = torch.tensor(test_data['Publisher'].values.tolist()).long()

test_dataset = TensorDataset(user_ids_test, book_ids_test, ratings_test, ages_test, locations_test, authors_test, years_test, publishers_test)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)


#### Evaluation
The recommendation system is evaluated using precision and recall metrics. The evaluation process includes:

1. Defining User Preferences:
Use the embeddings of books that the user has rated to define their preferences.

2. Calculating Similarity:
Compute the cosine similarity between the embeddings of the recommended books and the books the user has rated.

3. Evaluating Recommendations:
Determine if the recommended books are liked by the user based on the similarity scores.
Identify books that the user might like but were not recommended.

4. Calculating Metrics:
Compute average precision and recall across all users in the test set.

In [None]:
def evaluate_model(model, dataloader, criterion):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch in dataloader:
            user_id, book_id, rating, age, location_id, author_id, year, publisher_id = batch

            user_id = user_id.long()
            book_id = book_id.long()
            rating = rating.float()

            embedding = model(user_id, book_id, location_id, age, author_id, year, publisher_id)
            output = torch.sum(embedding, dim=1)

            loss = criterion(output, rating)
            total_loss += loss.item()
    return np.sqrt(total_loss / len(dataloader.dataset))

In [None]:
embedding_dim = 10
num_users = len(user_id_mapping)
num_books = len(book_id_mapping)
num_locations = users['Location'].nunique()
num_authors = books['Book-Author'].nunique()
num_publishers = books['Publisher'].nunique()

model = EmbeddingNet(num_users, num_books, embedding_dim, num_locations, num_authors, num_publishers)
model.load_state_dict(torch.load('embedding_model.pth'))
model.eval()

book_embeddings = {}
for book_id in book_id_mapping.values():
    if book_id in books.index:
        book_idx = torch.tensor([book_id]).long()
        author_id = torch.tensor([books.loc[book_id, 'Book-Author']]).long()
        year = torch.tensor([books.loc[book_id, 'Year-Of-Publication']]).float().unsqueeze(0)
        publisher_id = torch.tensor([books.loc[book_id, 'Publisher']]).long()
        
        with torch.no_grad():
            embedding = model.book_embedding(book_idx).squeeze().numpy()
            book_embeddings[book_id] = embedding

In [None]:
# Initialize the criterion
criterion = nn.MSELoss()

# Evaluate the model
rmse = evaluate_model(model, test_dataloader, criterion)
print(f'Test RMSE: {rmse}')

In [None]:
# def precision_at_k(actual, predicted, k):
#     if len(predicted) > k:
#         predicted = predicted[:k]
#     return len(set(predicted) & set(actual)) / float(k)

# def recall_at_k(actual, predicted, k):
#     if len(predicted) > k:
#         predicted = predicted[:k]
#     return len(set(predicted) & set(actual)) / float(len(actual))

def recommend_books(user_index, user_location, user_age, model, num_recommendations=15):
    scores = {}
    user_embed = model.user_embedding(user_index).squeeze().detach().numpy()
    
    for book_id, book_embed in book_embeddings.items():
        score = cosine_similarity([user_embed], [book_embed])[0][0]
        scores[book_id] = score

    recommended_books = sorted(scores, key=scores.get, reverse=True)[:num_recommendations]
    return recommended_books


def evaluate_model2(test_dataloader, model, k=15):
    precisions = []
    recalls = []
    
    for batch in test_dataloader:
        user_ids, book_ids, ratings, ages, locations, authors, years, publishers = batch
        for i, user_id in enumerate(user_ids):
            user_index = user_ids[i].unsqueeze(0)
            user_location = locations[i].unsqueeze(0)
            user_age = ages[i].unsqueeze(0)
            
            actual_books = test_data[test_data['User-ID'] == user_id.item()]['ISBN'].values
            if len(actual_books) == 0:
                continue
            recommended_books = recommend_books(user_index, user_location, user_age, model, num_recommendations=k)
            actual_book_embeddings = np.array([book_embeddings[book_id_mapping[book]] for book in actual_books if book in book_id_mapping])
            recommended_book_embeddings = np.array([book_embeddings[book_id] for book_id in recommended_books])
            if actual_book_embeddings.size == 0 or recommended_book_embeddings.size == 0:
                continue
            
            similarities = cosine_similarity(recommended_book_embeddings, actual_book_embeddings)
            
            liked_books = (similarities > 0.6).sum(axis=1) > 0  # Define liked books based on similarity threshold
            
            precision = liked_books.sum() / len(recommended_books)
            recall = liked_books.sum() / len(actual_books)
            
            precisions.append(precision)
            recalls.append(recall)
    
    avg_precision = np.mean(precisions)
    avg_recall = np.mean(recalls)
    
    return avg_precision, avg_recall


In [None]:
avg_precision, avg_recall = evaluate_model2(test_dataloader, model, k=15)
print(f'Average Precision@15: {avg_precision}')
print(f'Average Recall@15: {avg_recall}')