# üõ†Ô∏è Import Libraries

Before we start building our ML model or visualizations, let's load the essential Python libraries:

## üü¢ Data Processing
- **NumPy**: For numerical operations, arrays, and mathematical computations.  
- **Pandas**: For loading, exploring, and manipulating datasets efficiently.  

These libraries form the foundation for all subsequent data processing and analysis steps.

## üîµ Visualization
- **Matplotlib (`pyplot`)**: For creating static plots and charts.  
- **Seaborn**: For enhanced statistical visualizations built on top of Matplotlib.  
- **Plotly Express**: For quick, high-level interactive visualizations.  
- **Plotly Graph Objects**: For building interactive and highly customizable plots.  
- **Plotly IO**: For controlling Plotly renderers and output settings.  

These libraries allow us to visualize data clearly and interactively, helping us explore patterns and insights before modeling.


In [1]:
# ‚úÖ Import essential libraries

import numpy as np  # For numerical computations, arrays, and mathematical operations
import pandas as pd  # For data manipulation and analysis (DataFrames, CSV/Excel I/O)
import matplotlib.pyplot as plt  # For creating static visualizations (plots, charts)
import seaborn as sns  # For enhanced statistical visualizations on top of matplotlib
import plotly.graph_objects as go  # For building interactive and customizable plots
import plotly.express as px  # For quick, high-level interactive visualizations
import plotly.io as pio  # For controlling Plotly renderers and output settings

# üìÇ Load the Dataset

We will now load the datasets required for our Machine Learning model:

- **Books.csv** ‚Üí Information about books (title, author, genre, etc.)
- **Users.csv** ‚Üí Information about users (user IDs, demographics, etc.)
- **Ratings.csv** ‚Üí User ratings for books

These datasets will help us build a recommendation system.


In [2]:
# ‚úÖ Load datasets using pandas
book = pd.read_csv('Dataset/Books.csv')    # Book details
user = pd.read_csv('Dataset/Users.csv')    # User information
rating = pd.read_csv('Dataset/Ratings.csv')  # Ratings given by users

  book = pd.read_csv('Dataset/Books.csv')    # Book details


In [3]:
# Convert ISBN to string for consistency
book['ISBN'] = book['ISBN'].astype(str)
rating['ISBN'] = rating['ISBN'].astype(str)

# üìä Explore the Datasets

Before building our ML model, let's explore the datasets to understand the data we are working with:

1. **Books Dataset** ‚Äì Details about each book (title, author, publisher, etc.)
2. **Users Dataset** ‚Äì Information about users (age, location, etc.)
3. **Ratings Dataset** ‚Äì User ratings for each book

We'll preview the first few rows and discuss the columns.


In [4]:
# üîπ Preview the Books dataset
print("üìñ Books Dataset (first 5 rows):")
display(book.head())

# Display basic info about Books dataset
print("\nüìñ Books Dataset Info:")
display(book.info())

# üîπ Preview the Users dataset
print("\nüë§ Users Dataset (first 5 rows):")
display(user.head())

# Display basic info about Users dataset
print("\nüë§ Users Dataset Info:")
display(user.info())

# üîπ Preview the Ratings dataset
print("\n‚≠ê Ratings Dataset (first 5 rows):")
display(rating.head())

# Display basic info about Ratings dataset
print("\n‚≠ê Ratings Dataset Info:")
display(rating.info())

üìñ Books Dataset (first 5 rows):


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...



üìñ Books Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


None


üë§ Users Dataset (first 5 rows):


Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",



üë§ Users Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


None


‚≠ê Ratings Dataset (first 5 rows):


Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6



‚≠ê Ratings Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


None

> **Observations:**
> - **Books dataset:** Includes `Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`. Useful for content-based recommendations.
> - **Users dataset:** Includes user demographics which can be helpful for personalized recommendations.
> - **Ratings dataset:** Includes `User-ID`, `Book-ID`, and `Rating`. This is the main interaction data for collaborative filtering.
>
> üí° **UX Tip:** Always explore your dataset first to understand missing values, data types, and potential preprocessing steps.


# üî¢ Check Dataset Dimensions

Knowing the size of each dataset helps us understand how much data we are working with.  
We'll check the number of rows and columns for `Books`, `Users`, and `Ratings`.


In [5]:
# ‚úÖ Display the shape (rows, columns) of each dataset
print(f"üìñ Books Dataset Shape: {book.shape}  ‚Üí Rows: {book.shape[0]}, Columns: {book.shape[1]}")
print(f"üë§ Users Dataset Shape: {user.shape}  ‚Üí Rows: {user.shape[0]}, Columns: {user.shape[1]}")
print(f"‚≠ê Ratings Dataset Shape: {rating.shape}  ‚Üí Rows: {rating.shape[0]}, Columns: {rating.shape[1]}")

üìñ Books Dataset Shape: (271360, 8)  ‚Üí Rows: 271360, Columns: 8
üë§ Users Dataset Shape: (278858, 3)  ‚Üí Rows: 278858, Columns: 3
‚≠ê Ratings Dataset Shape: (1149780, 3)  ‚Üí Rows: 1149780, Columns: 3


> **Observations:**
> - The `Books` dataset has 271360 rows and 8 columns ‚Üí contains all books in our collection.
> - The `Users` dataset has 278858 rows and 3 columns ‚Üí contains all users in our system.
> - The `Ratings` dataset has 1149782 rows and 3 columns ‚Üí contains all user ratings, which is the main interaction data for our recommendation model.
>
>üí° **UX Tip:** Knowing the dataset size helps anticipate computational requirements and preprocessing steps. 

# üßπ Check for Missing Values

Before building our model, we need to check if any datasets have missing values.  
Missing values can affect model performance, so it's important to identify and handle them early.


In [6]:
# üìå Drop missing values in Books dataset
book_clean = book.dropna()
print("üìñ Books Dataset after dropping nulls:")
display(book_clean.isnull().sum())
print(f"Shape: {book_clean.shape}\n")

# üìå Drop missing values in Users dataset
user_clean = user.dropna()
print("üë§ Users Dataset after dropping nulls:")
display(user_clean.isnull().sum())
print(f"Shape: {user_clean.shape}\n")

# üìå Ratings dataset has no null values, so no need to drop
rating_clean = rating.copy()
print("‚≠ê Ratings Dataset has no null values:")
display(rating_clean.isnull().sum())
print(f"Shape: {rating_clean.shape}")

üìñ Books Dataset after dropping nulls:


ISBN                   0
Book-Title             0
Book-Author            0
Year-Of-Publication    0
Publisher              0
Image-URL-S            0
Image-URL-M            0
Image-URL-L            0
dtype: int64

Shape: (271353, 8)

üë§ Users Dataset after dropping nulls:


User-ID     0
Location    0
Age         0
dtype: int64

Shape: (168096, 3)

‚≠ê Ratings Dataset has no null values:


User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

Shape: (1149780, 3)


> **Observations:**
> - Columns with `0` missing values are clean.
> - Any column with missing values may need preprocessing:
>     - For numeric columns ‚Üí consider filling with mean/median or dropping rows.
>     - For categorical columns ‚Üí consider filling with mode or a placeholder.
>
> üí° **UX Tip:** Handling missing values early ensures that the ML model receives clean and reliable data.


# üîç Check for Duplicate Records

Duplicate records can skew our analysis and affect model performance.  
We will check if any duplicates exist in the `Books`, `Users`, and `Ratings` datasets.


In [7]:
# ‚úÖ Check for duplicate rows in each dataset

print("üìñ Books Dataset Duplicate Rows:", book.duplicated().sum())
print("üë§ Users Dataset Duplicate Rows:", user.duplicated().sum())
print("‚≠ê Ratings Dataset Duplicate Rows:", rating.duplicated().sum())

üìñ Books Dataset Duplicate Rows: 0
üë§ Users Dataset Duplicate Rows: 0
‚≠ê Ratings Dataset Duplicate Rows: 0


# üîç Exploratory Data Analysis (EDA)

In [8]:
author_count = book['Book-Author'].value_counts().reset_index(name='Count').rename(columns={'index':'Book-Author'}).head(15)
fig_authors = px.bar(
    author_count,
    x='Book-Author',
    y='Count',
    title='Most Published Authors (Top 15)',
    labels={'Book-Author':'Author','Count':'Number of Books'},
    color='Count',
    color_continuous_scale='Blues'
)
fig_authors.update_layout(xaxis_tickangle=45, title_font_size=20)
fig_authors.show()

In [9]:
publisher_count = book['Publisher'].value_counts().reset_index(name='Count').rename(columns={'index':'Publisher'}).head(15)
fig_publishers = px.bar(
    publisher_count,
    x='Publisher',
    y='Count',
    title='Top Publishers (Top 15)',
    labels={'Publisher':'Publisher','Count':'Number of Books'},
    color='Count',
    color_continuous_scale='Purples'
)
fig_publishers.update_layout(xaxis_tickangle=45, title_font_size=20)
fig_publishers.show()

# ü§ñ Types of Recommendation Systems

There are several approaches to building recommendation systems. Here are the main types:

1. **Popularity-Based Recommendation**
   - Suggests items that are popular among all users.
   - Simple but ignores personal preferences.
   - Example: "Top 10 most-read books this month."

2. **Content-Based Recommendation**
   - Suggests items similar to what the user has liked before.
   - Uses features of items like genre, author, or description.
   - Example: Recommending books by the same author or genre.

3. **Collaborative Filtering**
   - Suggests items based on the behavior of similar users.
   - Can be **user-based** (users with similar tastes) or **item-based** (items liked by similar users).
   - Example: "Users who liked Book A also liked Book B."

4. **Hybrid Recommendation**
   - Combines two or more approaches (e.g., content + collaborative).
   - More accurate and flexible.
   - Example: Netflix uses a hybrid approach combining user behavior and content similarity.


# üìà Popularity-Based Recommendation System

In a **popularity-based recommendation system**, we suggest items that are generally popular among all users.  
Here, we will display the **top 100 books** that have received at least **300 ratings**.  

# üìä Prepare Rating Data

To create recommendation systems, we need to know:

1. **Number of Ratings per Book** ‚Üí Helps filter out books with very few ratings.
2. **Average Rating per Book** ‚Üí Helps rank books based on user satisfaction.

We'll merge the `Ratings` and `Books` datasets to calculate these metrics.


In [10]:
# ‚úÖ Merge Ratings with Book Details
rating_with_name = rating.merge(book, on='ISBN')

# üîπ Calculate Number of Ratings per Book
num_rating_df = rating_with_name.groupby('Book-Title').count()['Book-Rating'].reset_index()
num_rating_df.rename(columns={'Book-Rating':'Num_rating'}, inplace=True)

print("üìñ Number of Ratings per Book (Top 5):")
display(num_rating_df.head())

# üîπ Calculate Average Rating per Book
avg_rating_df = rating_with_name.groupby('Book-Title').mean(numeric_only=True)['Book-Rating'].reset_index()
avg_rating_df.rename(columns={'Book-Rating':'Avg_rating'}, inplace=True)

print("‚≠ê Average Rating per Book (Top 5):")
display(avg_rating_df.head())

üìñ Number of Ratings per Book (Top 5):


Unnamed: 0,Book-Title,Num_rating
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1


‚≠ê Average Rating per Book (Top 5):


Unnamed: 0,Book-Title,Avg_rating
0,A Light in the Storm: The Civil War Diary of ...,2.25
1,Always Have Popsicles,0.0
2,Apple Magic (The Collector's series),0.0
3,"Ask Lily (Young Women of Faith: Lily Series, ...",8.0
4,Beyond IBM: Leadership Marketing and Finance ...,0.0


In [11]:
# ‚úÖ Merge both summaries
final_rating = num_rating_df.merge(avg_rating_df, on='Book-Title')

# üßπ Filter books with at least 50 ratings (optional, avoids bias)
popular_books = final_rating[final_rating['Num_rating'] >= 50]

# üîù Get top 10 books by average rating
top_books = popular_books.sort_values('Avg_rating', ascending=False).head(10)

# üìä Create a bar chart
fig = px.bar(
    top_books,
    x='Book-Title',
    y='Avg_rating',
    color='Num_rating',
    title='Top 10 Highest Rated Books (with ‚â• 50 Ratings)',
    labels={'Book-Title': 'Book Title', 'Avg_rating': 'Average Rating', 'Num_rating': 'Number of Ratings'},
    color_continuous_scale='Blues'
)

fig.update_layout(xaxis_tickangle=45, title_font_size=20)
fig.show()


> **Observations:**
> - `Num_rating` tells us how many users rated each book.
> - `Avg_rating` tells us how well users liked each book.
> - Combining these two metrics helps us **filter popular books with enough ratings** and rank them for recommendations.
>
> üí° **UX Tip:** For popularity-based recommendations, it‚Äôs good to **consider both number of ratings and average rating** to avoid suggesting books with very few but high ratings.


# üìà Top 100 Popular Books

Now that we have prepared the rating data, let's create a **Popularity-Based Recommendation System**:

- Filter books with at least **300 ratings**.
- Sort by **average rating** in descending order.
- Display **top 100 books** along with important details like author, publisher, and cover image.

> üí° **UX Tip:** This gives a quick view of the most popular books among users.


In [12]:
# ‚úÖ Combine Number of Ratings and Average Ratings
popular_df = num_rating_df.merge(avg_rating_df, on='Book-Title')

# üîπ Filter books with at least 300 ratings and get top 100 by average rating
pbr_df = popular_df[popular_df['Num_rating'] >= 300].sort_values('Avg_rating', ascending=False).head(100)

# üîπ Merge with book details and remove duplicates
pbr_df = pbr_df.merge(book, on='Book-Title').drop_duplicates('Book-Title')

# Select important columns for display
pbr_df = pbr_df[['Book-Title', 'Book-Author', 'Publisher', 'Image-URL-L', 'Num_rating', 'Avg_rating']]

# Display top 100 popular books
display(pbr_df)

Unnamed: 0,Book-Title,Book-Author,Publisher,Image-URL-L,Num_rating,Avg_rating
0,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling,Scholastic,http://images.amazon.com/images/P/0439136350.0...,428,5.852804
3,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,Scholastic,http://images.amazon.com/images/P/0439139597.0...,387,5.824289
5,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,Scholastic,http://images.amazon.com/images/P/043935806X.0...,347,5.501441
9,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,Scholastic,http://images.amazon.com/images/P/0439064872.0...,556,5.183453
12,The Fellowship of the Ring (The Lord of the Ri...,J.R.R. TOLKIEN,Del Rey,http://images.amazon.com/images/P/0345339703.0...,368,4.948370
...,...,...,...,...,...,...
355,The Cider House Rules,John Irving,Bantam Books,http://images.amazon.com/images/P/0553258001.0...,393,2.969466
363,The Alienist,Caleb Carr,Bantam Books,http://images.amazon.com/images/P/0553572997.0...,350,2.965714
365,Violets Are Blue,James Patterson,Warner Vision,http://images.amazon.com/images/P/0446611212.0...,379,2.955145
369,The Rainmaker,JOHN GRISHAM,Dell,http://images.amazon.com/images/P/044022165X.0...,501,2.922156


In [13]:
# Select top 20 for better visualization
top_n = 20
top_books = pbr_df.head(top_n)

fig = px.bar(
    top_books,
    x='Avg_rating',
    y='Book-Title',
    color='Num_rating',
    orientation='h',
    hover_data=['Book-Author', 'Publisher'],
    color_continuous_scale='bluered',
    title=f'Top {top_n} Books by Average Rating (‚â•300 Ratings)'
)

fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()


> **Observations:**
> - These are the **most popular books** with a significant number of ratings.
> - Columns displayed:
>     - `Book-Title`, `Book-Author`, `Publisher` ‚Üí Basic info
>     - `Image-URL-L` ‚Üí Cover image for UI display
>     - `Num_rating` ‚Üí Number of ratings received
>     - `Avg_rating` ‚Üí Average user rating
>
> üí° **UX Tip:** This table can be used to show a "Top 100 Books" section on a website or app.


# üíæ Save Popular Books Recommendations

After generating the **top 100 popular books**, we save this dataset using `pickle` so it can be **easily loaded later** without recalculating.  

> üí° **UX Tip:** Saving the DataFrame makes it reusable in apps, dashboards, or future analyses.


In [14]:
# ‚úÖ Save the top 100 popular books DataFrame as a pickle file
import pickle

pickle.dump(pbr_df, open('PopularBookRecommendation.pkl', 'wb'))

print("‚úÖ Popular books recommendation saved as 'PopularBookRecommendation.pkl'")

‚úÖ Popular books recommendation saved as 'PopularBookRecommendation.pkl'


# ü§ù Collaborative Filtering-Based Recommendation System

Collaborative Filtering (CF) recommends items (books) based on **user behavior**:

- **Idea:** Users with similar tastes tend to like similar books  
- **Approach:** Build a user-item matrix and predict ratings for unseen books  
- **Types:**  
  1. **User-User CF**: Recommend books liked by similar users  
  2. **Item-Item CF**: Recommend books similar to what the user already liked  

> ‚ÑπÔ∏è Here, our criteria are:  
> - Users will be counted who gave ratings on **minimum 250 books**  
> - Books will be counted based on having **minimum 50 users' ratings**  

> üí° **UX Tip:** CF does **not require book content** (like author or genre); it purely relies on **user ratings**.  
> We will also **merge book names with ratings** for better readability in recommendations.


# ü§ù Collaborative Filtering: Prepare Data

Here, our criteria are:  
- Users who have rated **at least 250 books**  
- Books rated by **at least 50 users**  

We will filter the dataset accordingly and create a **Book-User pivot table** for collaborative filtering.


In [15]:
# ‚úÖ Step 1: Filter users who rated >= 250 books
b = rating_with_name.groupby('User-ID').count()['Book-Rating'] > 250
users_with_ratings = b[b].index
filtered_rating = rating_with_name[rating_with_name['User-ID'].isin(users_with_ratings)]

# ‚úÖ Step 2: Filter books rated by >= 50 users
c = filtered_rating.groupby('Book-Title').count()['Book-Rating'] >= 50
famous_books = c[c].index
final_ratings = filtered_rating[filtered_rating['Book-Title'].isin(famous_books)]

# ‚úÖ Step 3: Create Book-User pivot table
pt = final_ratings.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')

# Fill missing values with 0
pt.fillna(0, inplace=True)

print("‚úÖ Book-User Pivot Table created:")
display(pt)

‚úÖ Book-User Pivot Table created:


User-ID,254,2276,2766,3363,4385,6251,6543,6575,7158,7346,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wuthering Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
book_stats = final_ratings.groupby('Book-Title')['Book-Rating'].agg(['mean', 'count']).reset_index()

fig = px.scatter_3d(
    book_stats,
    x='count',
    y='mean',
    z='Book-Title',
    color='mean',
    size='count',
    title='Book Popularity vs Average Rating',
    labels={'count': 'Number of Ratings', 'mean': 'Average Rating'}
)
fig.show()


# üìä Compute Book Similarity

We will calculate **cosine similarity** between books using the Book-User pivot table:

- Each row represents a book and each column represents a user rating.
- Cosine similarity helps find **books that are rated similarly by users**, which is the basis of **Item-Item Collaborative Filtering**.


In [17]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between books
similarity_scores = cosine_similarity(pt)

# Convert to DataFrame for easy lookup
similarity_df = pd.DataFrame(similarity_scores, index=pt.index, columns=pt.index)

print("‚úÖ Book-Book Similarity Matrix created")
print(f"Shape: {similarity_scores.shape}")
display(similarity_df.head())

‚úÖ Book-Book Similarity Matrix created
Shape: (574, 574)


Book-Title,1984,1st to Die: A Novel,2nd Chance,4 Blondes,A Bend in the Road,A Case of Need,"A Child Called \It\"": One Child's Courage to Survive""",A Civil Action,A Fine Balance,A Heartbreaking Work of Staggering Genius,...,Wild Animus,Winter Moon,Winter Solstice,Wish You Well,Without Remorse,Wuthering Heights,You Belong To Me,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zoya,"\O\"" Is for Outlaw"""
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,1.0,0.126378,0.015848,0.0,0.065831,0.033855,0.110083,0.17651,0.051155,0.081486,...,0.020942,0.042781,0.067436,0.017203,0.012433,0.010667,0.020272,0.138729,0.084505,0.05211
1st to Die: A Novel,0.126378,1.0,0.288248,0.0,0.111258,0.115639,0.148294,0.226466,0.061774,0.088561,...,0.092666,0.08967,0.080145,0.109508,0.16577,0.070055,0.207731,0.080104,0.181171,0.161724
2nd Chance,0.015848,0.288248,1.0,0.0,0.083826,0.127433,0.0,0.136987,0.057509,0.0,...,0.042567,0.135653,0.195479,0.273257,0.025272,0.021682,0.187711,0.051658,0.056185,0.128939
4 Blondes,0.0,0.0,0.0,1.0,0.0,0.126977,0.0,0.0,0.0,0.163726,...,0.025247,0.0,0.0,0.0,0.0,0.135025,0.081463,0.0,0.0,0.0
A Bend in the Road,0.065831,0.111258,0.083826,0.0,1.0,0.115553,0.120556,0.047974,0.0,0.006941,...,0.092884,0.09136,0.091934,0.115863,0.026551,0.028036,0.105453,0.042847,0.121303,0.017158


In [18]:
import plotly.express as px

# Sample 15 books to visualize
sample_books = similarity_df.sample(15, axis=0).sample(15, axis=1)

fig = px.imshow(
    sample_books,
    color_continuous_scale='Viridis',
    title='Book-to-Book Similarity Heatmap (Sample)',
    labels=dict(x="Book", y="Book", color="Similarity")
)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()


# üìö Item-Item Book Recommendation

- Input a book title and get **8 similar books** recommended based on user ratings.  
- Recommendations include **Book Title, Author, and Image** for better UX.  
- Works best for books that have been rated by many users.


In [19]:
def recommendation(book_name):
    """
    Recommends 8 books similar to the given book_name based on user ratings.

    Parameters:
        book_name (str): The title of the book for which recommendations are needed.

    Returns:
        data (list): A list of recommended books with [Book-Title, Book-Author, Image-URL-L]
    """
    # Fetch the index of the book in pivot table
    index = np.where(np.array(list(pt.index)) == book_name)[0][0]

    # Get similarity scores for this book and sort in descending order
    similar_items = sorted(list(enumerate(similarity_scores[index])), 
                           reverse=True, key=lambda x: x[1])[1:9]  # Exclude the book itself

    # Fetch book details for recommendations
    data = []
    for i in similar_items:
        temp_df = book[book['Book-Title'] == pt.index[i[0]]]
        item = [
            temp_df['Book-Title'].drop_duplicates().values[0],
            temp_df['Book-Author'].drop_duplicates().values[0],
            temp_df['Image-URL-L'].drop_duplicates().values[0]
        ]
        data.append(item)
    
    return data

In [20]:
# Get recommendations for the book
recommended_books = recommendation("You Belong To Me")

# Display the recommended books nicely
import pandas as pd

# Convert list of recommendations to a DataFrame
rec_df = pd.DataFrame(recommended_books, columns=['Book-Title', 'Book-Author', 'Image-URL-L'])
rec_df

Unnamed: 0,Book-Title,Book-Author,Image-URL-L
0,"Loves Music, Loves to Dance",Mary Higgins Clark,http://images.amazon.com/images/P/0671758896.0...
1,I'll Be Seeing You,Mary Higgins Clark,http://images.amazon.com/images/P/0671888587.0...
2,Daddy's Little Girl,Mary Higgins Clark,http://images.amazon.com/images/P/0743206045.0...
3,Before I Say Good-Bye,Mary Higgins Clark,http://images.amazon.com/images/P/0671004573.0...
4,All Around the Town,Mary Higgins Clark,http://images.amazon.com/images/P/0671793489.0...
5,My Gal Sunday,Mary Higgins Clark,http://images.amazon.com/images/P/0684832291.0...
6,Moonlight Becomes You,Mary Higgins Clark,http://images.amazon.com/images/P/0671867113.0...
7,Let Me Call You Sweetheart,Mary Higgins Clark,http://images.amazon.com/images/P/0671568175.0...


In [21]:
from IPython.display import HTML

# Function to render book covers in a table
def display_books(df):
    df_html = "<table>"
    for _, row in df.iterrows():
        df_html += f"""
        <tr>
            <td><img src="{row['Image-URL-L']}" width="100"></td>
            <td><b>{row['Book-Title']}</b><br>by {row['Book-Author']}</td>
        </tr>
        """
    df_html += "</table>"
    return HTML(df_html)

# Display recommendations
display_books(rec_df)


0,1
,"Loves Music, Loves to Dance by Mary Higgins Clark"
,I'll Be Seeing You by Mary Higgins Clark
,Daddy's Little Girl by Mary Higgins Clark
,Before I Say Good-Bye by Mary Higgins Clark
,All Around the Town by Mary Higgins Clark
,My Gal Sunday by Mary Higgins Clark
,Moonlight Becomes You by Mary Higgins Clark
,Let Me Call You Sweetheart by Mary Higgins Clark


# üíæ Save Data & Models for Collaborative Filtering

We are saving the essential objects so we can **load them later** for recommendations without recomputing:

1. **Book-User Pivot Table** (`pt`)  
2. **Books Dataset** (`book`)  
3. **Book-Book Similarity Scores** (`similarity_scores`)  

> This allows us to create a **real-time recommendation system** efficiently.


In [22]:
import pickle

# Save Book-User pivot table
pickle.dump(pt, open('pt.pkl', 'wb'))

# Save Books dataset
pickle.dump(book, open('book.pkl', 'wb'))

# Save Book-Book similarity scores
pickle.dump(similarity_scores, open('similarity_scores.pkl', 'wb'))

print("‚úÖ Pickle files saved: pt.pkl, book.pkl, similarity_scores.pkl")

‚úÖ Pickle files saved: pt.pkl, book.pkl, similarity_scores.pkl
