# Book Recommendation System 

## Project Overview
This project aims to recommend books to users based on their preferences using **Machine Learning** and **Recommendation System techniques**.  
We will explore the dataset, preprocess it, apply collaborative/content-based filtering, and build a system that suggests books effectively.

## Objective
- Build a model that can recommend books tailored to a user’s interests.  
- Implement different recommendation approaches (Collaborative Filtering, Content-Based, Hybrid).  
- Visualize insights about books, users, and ratings.  


## Importing Libraries 

### Libraries Used
- **pandas**: For loading, exploring, and manipulating datasets.  
- **numpy**: For numerical computations and array operations.  
- **matplotlib** : For creating graphs and visualizing data trends.
###  Purpose in this Project
- We will use **pandas** to load the news dataset and inspect the data structure.  
- **numpy** will help with numerical operations like calculating averages, counts, or converting data to arrays for machine learning models.



In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load the Dataset 

We begin by loading the raw datasets (Books, Ratings, and Users) into Pandas DataFrames for further processing.  

In [30]:
books_df=pd.read_csv('C:/Users/PCPR/Desktop/Book Recommendation/Data/Raw/books.csv')
ratings_df=pd.read_csv('C:/Users/PCPR/Desktop/Book Recommendation/Data/Raw/ratings.csv')
users_df=pd.read_csv('C:/Users/PCPR/Desktop/Book Recommendation/Data/Raw/users.csv')

  books_df=pd.read_csv('C:/Users/PCPR/Desktop/Book Recommendation/Data/Raw/books.csv')


## Checking Dataset Shape 

Before we start cleaning or analyzing the data, it's important to know:

- How many **rows** (articles) we have.
- How many **columns** (features) are in the dataset.

This helps us understand the **size of the dataset** and what preprocessing might be needed.


In [31]:
books_df.shape


(271360, 8)

In [32]:
ratings_df.shape


(1149780, 3)

In [33]:
users_df.shape

(278858, 3)

## Merge Datasets 

To make the ratings dataset more meaningful, we merge the **ratings** DataFrame with the **books** DataFrame using the common key `ISBN`.  
This allows us to see which user rated which book by its **title/author** instead of just the ISBN code.



In [34]:
merged_df = ratings_df.merge(books_df, on="ISBN")

## Count Ratings per Book 

Next, we calculate how many ratings each book has received.  
This helps in identifying **popular books** and filtering out books with very few ratings.


In [35]:
Total_Rating=merged_df.groupby('Book-Title').count()['Book-Rating'].reset_index()
Total_Rating.rename(columns={'Book-Rating':'No_Of_Ratings'},inplace=True)
Total_Rating.head()

Unnamed: 0,Book-Title,No_Of_Ratings
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1


## Calculate Average Rating per Book 

Along with the total number of ratings, we also calculate the **average rating** for each book.  
This helps in identifying not just the most popular books, but also the **best-rated** ones.

In [36]:
# Ensure Book-Rating is numeric
merged_df["Book-Rating"] = pd.to_numeric(merged_df["Book-Rating"], errors="coerce")

# Calculate average rating per book
Avg_Rating = merged_df.groupby("Book-Title")["Book-Rating"].mean().reset_index()

# Rename column for clarity
Avg_Rating.rename(columns={"Book-Rating": "Avg_Ratings"}, inplace=True)

# Preview
Avg_Rating.head()

Unnamed: 0,Book-Title,Avg_Ratings
0,A Light in the Storm: The Civil War Diary of ...,2.25
1,Always Have Popsicles,0.0
2,Apple Magic (The Collector's series),0.0
3,"Ask Lily (Young Women of Faith: Lily Series, ...",8.0
4,Beyond IBM: Leadership Marketing and Finance ...,0.0


## Combine Total Ratings & Average Ratings 

To analyze book popularity and quality together,  
we merge the **Total_Ratings** and **Avg_Rating** DataFrames on the common column `Book-Title`.


In [37]:
# Merge total ratings and average ratings
popular_df = Total_Rating.merge(Avg_Rating, on="Book-Title")

# Preview
popular_df.head()

Unnamed: 0,Book-Title,No_Of_Ratings,Avg_Ratings
0,A Light in the Storm: The Civil War Diary of ...,4,2.25
1,Always Have Popsicles,1,0.0
2,Apple Magic (The Collector's series),1,0.0
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1,8.0
4,Beyond IBM: Leadership Marketing and Finance ...,1,0.0


## Filter Popular Books 

We filter books that have received at least **250 ratings** to ensure reliability  
(since books with very few ratings may not be a good indicator of popularity).  
Then, we sort them by their **average rating** in descending order.

In [None]:
# Filter books with at least 250 ratings and sort by average rating
popular_books = popular_df[popular_df["No_Of_Ratings"] >= 250].sort_values("Avg_Ratings", ascending=False)

# Preview
popular_books.head()


Unnamed: 0,Book-Title,No_Of_Ratings,Avg_Ratings
80434,Harry Potter and the Prisoner of Azkaban (Book 3),428,5.852804
80422,Harry Potter and the Goblet of Fire (Book 4),387,5.824289
80441,Harry Potter and the Sorcerer's Stone (Book 1),278,5.73741
80426,Harry Potter and the Order of the Phoenix (Boo...,347,5.501441
80414,Harry Potter and the Chamber of Secrets (Book 2),556,5.183453


## Select Top 50 Popular Books 

From the filtered dataset, we select the **Top 50 books** with the highest average ratings  
(among those that have at least 250 ratings).



In [85]:

# Select top 50 books with highest average ratings
popular_books = popular_df[popular_df["No_Of_Ratings"] >= 250].sort_values("Avg_Ratings", ascending=False).head(50)

# Preview
popular_books.head()


Unnamed: 0,Book-Title,No_Of_Ratings,Avg_Ratings
80434,Harry Potter and the Prisoner of Azkaban (Book 3),428,5.852804
80422,Harry Potter and the Goblet of Fire (Book 4),387,5.824289
80441,Harry Potter and the Sorcerer's Stone (Book 1),278,5.73741
80426,Harry Potter and the Order of the Phoenix (Boo...,347,5.501441
80414,Harry Potter and the Chamber of Secrets (Book 2),556,5.183453


## Add Book Details to Popular Books 

To make the recommendations more meaningful,  
we merge the `popular_books` DataFrame with the original `books` dataset to include:  
- **Book Title**  
- **Author**  
- **Cover Image URL**  
- **Number of Ratings**  
- **Average Rating**  


In [89]:
# # Merge with books dataset to get author and image details
# popular_books = popular_books.merge(books_df, on="Book-Title").drop_duplicates("Book-Title")[["Book-Title", "Book-Author", "Image-URL-M", "No_Of_Ratings", "Avg_Ratings"]]

# # Preview
# popular_books.head()


popular_books = (
    popular_df[popular_df["No_Of_Ratings"] >= 250]
    .sort_values("Avg_Ratings", ascending=False)
    .head(50)
    .merge(
        books_df[["Book-Title", "Book-Author", "Image-URL-M"]],
        on="Book-Title",
        how="left"
    )
    .drop_duplicates("Book-Title")
)

# Final selection
popular_books = popular_books[["Book-Title", "Book-Author", "Image-URL-M", "No_Of_Ratings", "Avg_Ratings"]]

# Preview
popular_books.head()


Unnamed: 0,Book-Title,Book-Author,Image-URL-M,No_Of_Ratings,Avg_Ratings
0,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling,http://images.amazon.com/images/P/0439136350.0...,428,5.852804
3,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,http://images.amazon.com/images/P/0439139597.0...,387,5.824289
5,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,http://images.amazon.com/images/P/0590353403.0...,278,5.73741
9,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,http://images.amazon.com/images/P/043935806X.0...,347,5.501441
13,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,http://images.amazon.com/images/P/0439064872.0...,556,5.183453


In [88]:
popular_books.columns


Index(['Book-Title', 'Book-Author', 'Image-URL-M', 'No_Of_Ratings',
       'Avg_Ratings'],
      dtype='object')

## Filter Active Users 

Not all users give enough ratings to be useful for recommendations.  
So, we filter out **active users** who have rated **more than 200 books**.  
This helps in reducing noise and improving recommendation quality.



In [45]:
# Find users who have rated more than 200 books
users = merged_df.groupby("User-ID").count()["Book-Rating"] > 200

# Extract active user IDs
active_users = users[users].index

# Preview
active_users[:10]   # first 10 active users

Index([254, 2276, 2766, 2977, 3363, 4017, 4385, 6251, 6323, 6543], dtype='int64', name='User-ID')

## Keep Ratings from Active Users 

We now filter the dataset to include only the ratings from **active users**  
(those who rated more than 200 books).  
This ensures we work with reliable data for collaborative filtering.



In [50]:
# Filter ratings for active users only
filtered_ratings = merged_df[merged_df["User-ID"].isin(active_users)]

# Preview
filtered_ratings.head()


Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
1150,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
1151,277427,0026217457,0,Vegetarian Times Complete Cookbook,Lucy Moll,1995,John Wiley &amp; Sons,http://images.amazon.com/images/P/0026217457.0...,http://images.amazon.com/images/P/0026217457.0...,http://images.amazon.com/images/P/0026217457.0...
1152,277427,003008685X,8,Pioneers,James Fenimore Cooper,1974,Thomson Learning,http://images.amazon.com/images/P/003008685X.0...,http://images.amazon.com/images/P/003008685X.0...,http://images.amazon.com/images/P/003008685X.0...
1153,277427,0030615321,0,"Ask for May, Settle for June (A Doonesbury book)",G. B. Trudeau,1982,Henry Holt &amp; Co,http://images.amazon.com/images/P/0030615321.0...,http://images.amazon.com/images/P/0030615321.0...,http://images.amazon.com/images/P/0030615321.0...
1154,277427,0060002050,0,On a Wicked Dawn (Cynster Novels),Stephanie Laurens,2002,Avon Books,http://images.amazon.com/images/P/0060002050.0...,http://images.amazon.com/images/P/0060002050.0...,http://images.amazon.com/images/P/0060002050.0...


## Filter Famous Books 

Just like with users, not all books are rated enough to be reliable.  
So, we filter out **famous books** — those that have received at least **50 ratings**.  



In [51]:
# Find books with at least 50 ratings
y = filtered_ratings.groupby("Book-Title").count()["Book-Rating"] >= 50

# Extract titles of famous books
famous_books = y[y].index

# Preview
famous_books[:10]   # first 10 famous books


Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Day Late and a Dollar Short', 'A Fine Balance'],
      dtype='object', name='Book-Title')

## Keep Ratings for Famous Books 

Now, we filter the `filtered_ratings` dataset to include only the books  
that are in our `famous_books` list (i.e., books with at least 50 ratings).  




In [53]:
# Keep ratings for famous books only
final_ratings = filtered_ratings[filtered_ratings["Book-Title"].isin(famous_books)]

# Preview
final_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
1150,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
1163,277427,0060930535,0,The Poisonwood Bible: A Novel,Barbara Kingsolver,1999,Perennial,http://images.amazon.com/images/P/0060930535.0...,http://images.amazon.com/images/P/0060930535.0...,http://images.amazon.com/images/P/0060930535.0...
1165,277427,0060934417,0,Bel Canto: A Novel,Ann Patchett,2002,Perennial,http://images.amazon.com/images/P/0060934417.0...,http://images.amazon.com/images/P/0060934417.0...,http://images.amazon.com/images/P/0060934417.0...
1168,277427,0061009059,9,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch,http://images.amazon.com/images/P/0061009059.0...,http://images.amazon.com/images/P/0061009059.0...,http://images.amazon.com/images/P/0061009059.0...
1174,277427,006440188X,0,The Secret Garden,Frances Hodgson Burnett,1998,HarperTrophy,http://images.amazon.com/images/P/006440188X.0...,http://images.amazon.com/images/P/006440188X.0...,http://images.amazon.com/images/P/006440188X.0...


## Create the User-Item Rating Matrix 

We now create a **pivot table** where:  
- Rows = **Book Titles**  
- Columns = **User IDs**  
- Values = **Book Ratings**  

This matrix will be used to compute similarity between books or users.  


In [54]:
# Create User-Item Matrix
book_user_matrix = final_ratings.pivot_table(index="Book-Title",columns="User-ID",values="Book-Rating")

# Preview
book_user_matrix.head()


User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,10.0,,,,,,0.0,,,
1st to Die: A Novel,,,,,,,,,,9.0,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,0.0,...,,,,,,0.0,,,0.0,
4 Blondes,,,,,,,,0.0,,,...,,,,,,,,,,
A Bend in the Road,0.0,,7.0,,,,,,,,...,,0.0,,,,,,,,


## Handle Missing Values 

Since most users have not rated all books, our matrix contains many **NaN values**.  
We replace these NaN values with **0** (indicating no rating given).  



In [55]:
# Replace NaN values with 0
book_user_matrix.fillna(0, inplace=True)

# Preview
book_user_matrix.head()

User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Compute Book Similarity 

We use **Cosine Similarity** to measure how similar two books are,  
based on the ratings they received from users.  

- A similarity score closer to **1** → books are very similar.  
- A similarity score closer to **0** → books are dissimilar.  


In [57]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between books
similarity_scores = cosine_similarity(book_user_matrix)

# Preview shape
similarity_scores.shape


(706, 706)

In [58]:
def recommend_books(book_name):
    index=np.where(book_user_matrix.index==book_name)[0][0]
    similar_items=sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:6]

    for i in similar_items:
        print(book_user_matrix.index[i[0]])    

In [68]:
recommend_books('1984')

Animal Farm
The Handmaid's Tale
Brave New World
The Vampire Lestat (Vampire Chronicles, Book II)
The Hours : A Novel


## Saving Processed Data using Joblib 

To efficiently reuse processed data without recomputing, we save it using the **joblib** library.  
This ensures faster loading during model training or web app deployment.  



In [105]:
import joblib
import os  


# Save inside Models folder
joblib.dump(popular_books, open("popular_df.pkl", "wb"))

# Load it back
popular_df = joblib.load(open("popular_df.pkl", "rb"))


In [104]:
import joblib
df = joblib.load(open("popular_df.pkl", "rb"))
print(df.columns)


Index(['Book-Title', 'Book-Author', 'Image-URL-M', 'No_Of_Ratings',
       'Avg_Ratings'],
      dtype='object')
