So I always wondered how instagram keeps showing me some reels that are very similar to the ones I already watched to keep me engaged to it. Same with youtube. And flipkart for products...

So I wanted to try and make one for books.

There are few different types in it.
- popularity based: everyone gets the same top product reccomendations
- content based: based on what you're looking at or liked so far
- collaborative: based on what - the people who watched the same content you did -are watching...
- hybrid: mix up of those above.

This is content based recommendation system, where based on what you like, you get recommended similar ones.

First, let's get the data for books. I found one that contains data of around 10k books.

I found it here, https://github.com/malcolmosh/goodbooks-10k-extended?tab=readme-ov-file

In [None]:
import pandas as pd

from ast import literal_eval

books_df = pd.read_csv('https://raw.githubusercontent.com/malcolmosh/goodbooks-10k/master/books_enriched.csv', index_col=[1], converters={"genres": literal_eval}).drop(columns=["Unnamed: 0"])

In [None]:
#take a look at the data
books_df.head()

In [None]:
books_df.info()

It has lot of info related to books.

But lets check the data that is needed by us.

In [None]:
data_mini  = books_df[['title','authors','average_rating','ratings_count','genres','description','image_url']]

In [None]:
data_mini.head()

In [None]:
data_mini.info()

In [None]:

data_mini.sort_values(by='average_rating',ascending=False).head(15)



so we saw those with lots of ratings and those with high ratings.
What if to capture popularity, we choose a mix of both average_rating and ratings_count?

to do that, we can have another metric 'popularity' that is 0.5*average_rating + 0.5*ratings_count/max(ratings_count)



In [None]:
max_ratings_count = max(data_mini['ratings_count'])
min_ratings_count = min(data_mini['ratings_count'])
max_rating = max(data_mini['average_rating'])
min_rating = min(data_mini['average_rating'])

print(max_ratings_count)
print(min_ratings_count)
print(max_rating)
print(min_rating)

data_mini['popularity'] = 0.5*(data_mini['average_rating']-min_rating)/(max_rating-min_rating) + 0.5*(data_mini['ratings_count']-min_ratings_count)/(max_ratings_count-min_ratings_count)

In [None]:
top_rated = data_mini.sort_values(by ='popularity',ascending = False)

In [None]:
top_rated.head(15)

In [None]:
book_recs_by_popularity = top_rated.head(15)[['title','authors']]
book_recs_by_popularity

Voila! That's right there is the popularity based recommendation.

Let's look for content based on next. You input what you like and you get recommened what you might like based on that.

In [None]:
data_mini[data_mini['description'].isnull()]

In [None]:
data_mini[data_mini['description'].isnull()].fillna("") # filling the nan values with empty string.

In [None]:
#look at all the genres

set(data_mini['genres'].sum()),len(set(data_mini['genres'].sum()))

In [None]:
#look at all the authors
#need a little change here

set(data_mini['authors']),len(set(data_mini['authors']))

In [None]:
#look at all the authors of a particular index

print(data_mini['authors'].iloc[9997].tolist())

NOw to convert the features into vectors, we use tf-idf for description, mhe for authors and genres.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, normalize
from scipy.sparse import hstack

In [None]:
#description
tfidf_desc = TfidfVectorizer(stop_words="english", max_features=5000)
desc_matrix = tfidf_desc.fit_transform(data_mini["description"].fillna(""))


In [None]:
desc_matrix

In [None]:
#multi hot for genres
mlb1 = MultiLabelBinarizer()
genre_matrix = mlb1.fit_transform(data_mini["genres"])


In [None]:
genre_matrix.shape

In [None]:
mlb2 = MultiLabelBinarizer()
author_matrix = mlb2.fit_transform(data_mini["authors"])


In [None]:
author_matrix.shape

In [None]:
# Sparse matrix stacking (keeps memory efficient)
combined_matrix = hstack([desc_matrix, genre_matrix, author_matrix])

# Normalize for cosine similarity
combined_matrix = normalize(combined_matrix)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(combined_matrix)


In [None]:

# Example: books most similar to Book A
book_idx = 2
similar_scores = list(enumerate(similarity[book_idx]))
sorted_scores = sorted(similar_scores, key=lambda x: x[1], reverse=True)

print("Recommendations for:", data_mini['title'][book_idx], "genre: ",data_mini['genres'][book_idx])
for idx, score in sorted_scores[1:50]:
    print(f"  {data_mini['title'][idx]} (score={score:.2f})","genre: ",data_mini['genres'][idx])


So, this works fine. We can see that books similar to the one we highlighted are shown.


There is sBERT, a transformer that can have better representation of the description than using word2vec. So let's use it.

In [None]:
from sentence_transformers import SentenceTransformer

# load SBERT
model = SentenceTransformer('all-MiniLM-L6-v2')



In [None]:
desc_embeds = model.encode(data_mini["description"].fillna(""))

In [None]:
#next time, use this. It's good to see the progress

desc_embeds = model.encode(
    data_mini["description"].fillna("").tolist(),
    batch_size=32,      # tune: 16, 32, 64 depending on RAM
    show_progress_bar=True
)


In [None]:
#let's save the data_mini
#don't forget to mount the drive

data_mini.to_csv("/content/drive/MyDrive/data science practice/book_rec_system/data_mini_books.csv",index=False)

In [None]:
import numpy as np

'''# Save
np.save("/content/drive/MyDrive/data science practice/book_rec_system/desc_embeds.npy", desc_embeds)'''

# Load later
desc_embeds = np.load("/content/drive/MyDrive/data science practice/book_rec_system/desc_embeds.npy")

In [None]:
desc_embeds.shape, genre_matrix.shape, author_matrix.shape

# similarity using sentence transformer 1


> Add blockquote



In [None]:
#multi hot for genres
mlb1 = MultiLabelBinarizer()
genre_matrix = mlb1.fit_transform(data_mini["genres"])

mlb2 = MultiLabelBinarizer()
author_matrix = mlb2.fit_transform(data_mini["authors"])

desc_embeds = np.load("/content/drive/MyDrive/data science practice/book_rec_system/desc_embeds.npy")

from scipy.sparse import csr_matrix, hstack

desc_sparse = csr_matrix(desc_embeds)   # convert dense SBERT embeddings to sparse

# Sparse matrix stacking (keeps memory efficient)
combined_matrix2 = hstack([desc_sparse, genre_matrix, author_matrix])

# Normalize for cosine similarity
combined_matrix2 = normalize(combined_matrix2)

book_ids = [121,36,44]

#get average of the book_ids embeds
avg_book_embeds = sum(desc_embeds[book_ids])/len(book_ids)
avg_book_embeds.reshape(1,-1)
print(avg_book_embeds.shape)



similarity3 = cosine_similarity(desc_embeds, avg_book_embeds.reshape(1, -1))
#print(similarity3.shape)
similarity3_list = list(enumerate(similarity3))
sorted_similarity = sorted(similarity3_list,key = lambda x:x[1],reverse = True)

to_choose = 50
to_show = 15

#print the loved books
print("loved books")
for book_id in book_ids:
  print(data_mini.iloc[book_id][['title','authors','genres']])


rec_df = pd.DataFrame(columns=['title','authors','genres','average_rating','similarity_score'])


for idx,score in sorted_similarity[1:to_choose]:
  if idx not in book_ids:
    row = data_mini.iloc[[idx]][['title','authors','genres','average_rating']]
    row['similarity_score'] = score
    rec_df = pd.concat([rec_df,row],ignore_index = True)

rec_df = rec_df.sort_values(by='average_rating',ascending=False)

print("recommended books")

'''for i in range(to_show):
  print(list(rec_df.iloc[i][['title','authors','genres','average_rating']]))'''

alpha = 0.6 #giving weight to the rating

rec_df['final_score'] = rec_df['average_rating']*alpha + rec_df['similarity_score']*(1-alpha)
rec_df.sort_values(by = 'final_score',ascending = False)




In [None]:
from scipy.sparse import csr_matrix, hstack

desc_sparse = csr_matrix(desc_embeds)   # convert dense SBERT embeddings to sparse

# Sparse matrix stacking (keeps memory efficient)
combined_matrix2 = hstack([desc_sparse, genre_matrix, author_matrix])

# Normalize for cosine similarity
combined_matrix2 = normalize(combined_matrix2)

In [None]:
combined_matrix2.shape

In [None]:
similarity2 = cosine_similarity(combined_matrix2)


In [None]:

# Example: books most similar to Book A
book_idx = 2
similar_scores = list(enumerate(similarity2[book_idx]))
sorted_scores = sorted(similar_scores, key=lambda x: x[1], reverse=True)

print("Recommendations for:", data_mini['title'][book_idx], "genre: ",data_mini['genres'][book_idx])
for idx, score in sorted_scores[1:50]:
    print(f"  {data_mini['title'][idx]} (score={score:.2f})","genre: ",data_mini['genres'][idx])

In [None]:
book_ids = [121]

In [None]:


#get average of the book_ids embeds
avg_book_embeds = sum(desc_embeds[book_ids])/len(book_ids)
avg_book_embeds.reshape(1,-1)
print(avg_book_embeds.shape)



similarity3 = cosine_similarity(desc_embeds, avg_book_embeds.reshape(1, -1))
#print(similarity3.shape)
similarity3_list = list(enumerate(similarity3))
sorted_similarity = sorted(similarity3_list,key = lambda x:x[1],reverse = True)

to_choose = 50
to_show = 15

#print the loved books
print("loved books")
for book_id in book_ids:
  print(data_mini.iloc[book_id][['title','authors','genres']])


rec_df = pd.DataFrame(columns=['title','authors','genres','average_rating','similarity_score'])


for idx,score in sorted_similarity[1:to_choose]:
  if idx not in book_ids:
    row = data_mini.iloc[[idx]][['title','authors','genres','average_rating']]
    row['similarity_score'] = score
    rec_df = pd.concat([rec_df,row],ignore_index = True)

rec_df = rec_df.sort_values(by='average_rating',ascending=False)

print("recommended books")

'''for i in range(to_show):
  print(list(rec_df.iloc[i][['title','authors','genres','average_rating']]))'''

alpha = 0.6 #giving weight to the rating

rec_df['final_score'] = rec_df['average_rating']*alpha + rec_df['similarity_score']*(1-alpha)
rec_df.sort_values(by = 'final_score',ascending = False)


'''for idx,score in sorted_similarity[1:to_choose]:
  if idx not in book_ids:
    print('score: ',score,data_mini.iloc[idx]['title'])'''



In [None]:
rec_df.head()

In [None]:
alpha = 0.6 #giving weight to the rating

rec_df['final_score'] = rec_df['average_rating']*alpha + rec_df['similarity_score']*(1-alpha)

In [None]:
rec_df.sort_values(by = 'final_score',ascending = False)

In [None]:

similarity3_list = list(enumerate(similarity3))

sorted_similarity = sorted(similarity3_list,key = lambda x:x[1],reverse = True)


In [None]:
print(sorted_similarity)

In [None]:
data_mini.iloc[63]['title']

In [None]:

to_choose = 50

#print the loved books
print("loved books")
for book_id in book_ids:
  print(data_mini.iloc[book_id][['title','authors','genres']])


print("recommended books")
for idx,score in sorted_similarity[1:to_choose]:
  if idx not in book_ids:
    print('score: ',score,data_mini.iloc[idx]['title'])




In [None]:
data_mini.head()

So far, it works fairly well.

But the model we used all-miniLM-l6-v2 is, while small and convenient, is not very accurate. So, I wanted to use a better model that captures the meaning more accurately.

That model is - 'all-mpnet-base-v2'

Now, instead of using MHE for genres and authors and then combining with the description embeddings, it's possible to get a comprehensive description that includes authors and genres. This lets us avoid using MHE altogether and gives rich embeddings overall.


In [None]:
import ast
import re
import pandas as pd

def safe_list(value):
    """Convert stringified lists or messy author/genre fields into clean list of strings."""
    if isinstance(value, list):
        return [str(v).strip(" []'\"") for v in value]
    elif isinstance(value, str):
        # Try to parse stringified list
        try:
            parsed = ast.literal_eval(value)
            if isinstance(parsed, list):
                return [str(v).strip(" []'\"") for v in parsed]
        except Exception:
            pass
        # Fallback: split by comma
        return [v.strip(" []'\"") for v in value.split(',')]
    elif pd.isna(value):
        return []
    else:
        return [str(value).strip(" []'\"")]

def make_modified_description(row):
    title = str(row.get('title', '') or '').strip()
    authors = safe_list(row.get('authors', []))
    genres = safe_list(row.get('genres', []))
    description = str(row.get('description', '') or '').strip()

    authors_text = ', '.join(authors)
    genres_text = ', '.join(genres)

    # Compose the clean structured text
    combined = (
        f"Book Title: {title}. "
        f"Authors: {authors_text}. "
        f"Genres: {genres_text}. "
        f"Description: {description}"
    )
    return combined


In [None]:
#data_mini = data_mini.copy()

data_mini.loc[:, 'combined_text'] = data_mini.apply(make_modified_description, axis=1)


In [None]:
#saving the data with combined feature description

data_mini.to_csv("/content/drive/MyDrive/data science practice/book_rec_system/data_mini_books_update.csv",index=False)

In [None]:
print(data_mini['combined_text'].iloc[9997])

So far, we got to see the combined description that includes title, author, genres along with book description.

Next, load the model

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2',device = 'cuda')


In [None]:
#now we get the embeddings using the model


data_mini = pd.read_csv("/content/drive/MyDrive/data science practice/book_rec_system/data_mini_books_update.csv")

book_texts = data_mini['combined_text'].tolist()
book_embeddings = model.encode(book_texts, normalize_embeddings=True,batch_size = 64,convert_to_numpy = True,show_progress_bar = True)

import numpy as np
#np.save("/content/drive/MyDrive/data science practice/book_rec_system/book_embeddings.npy", book_embeddings)

book_embeddings = np.load("/content/drive/MyDrive/data science practice/book_rec_system/book_embeddings.npy")


In [None]:
data_mini.head(10)

# 2nd recommendation




In [None]:

from sklearn.metrics.pairwise import cosine_similarity


book_ids = [5780,5900,6789,7899]

#get average of the book_ids embeds
avg_book_embeds = sum(book_embeddings[book_ids])/len(book_ids)
avg_book_embeds.reshape(1,-1)
print(avg_book_embeds.shape)



similarity3 = cosine_similarity(book_embeddings, avg_book_embeds.reshape(1, -1))
#print(similarity3.shape)
similarity3_list = list(enumerate(similarity3))
sorted_similarity = sorted(similarity3_list,key = lambda x:x[1],reverse = True)

to_choose = 50
to_show = 15

#print the loved books
print("loved books")
for book_id in book_ids:
  print(data_mini.iloc[book_id][['title','authors','genres']])


rec_df = pd.DataFrame(columns=['title','authors','genres','average_rating','similarity_score'])


for idx,score in sorted_similarity[1:to_choose]:
  if idx not in book_ids:
    row = data_mini.iloc[[idx]][['title','authors','genres','average_rating']]
    row['similarity_score'] = score
    rec_df = pd.concat([rec_df,row],ignore_index = True)

rec_df = rec_df.sort_values(by='average_rating',ascending=False)

print("recommended books")

'''for i in range(to_show):
  print(list(rec_df.iloc[i][['title','authors','genres','average_rating']]))'''

alpha = 0.3 #giving weight to the rating

rec_df['final_score'] = rec_df['average_rating']*alpha + rec_df['similarity_score']*(1-alpha)
rec_df.sort_values(by = 'final_score',ascending = False)




Finally, we can see the recommendations given by the model.

So, this is a content based recommendation system for books.