This cell imports the core Python libraries used throughout the notebook. NumPy supports numerical operations, Pandas handles data loading and preprocessing, and additional libraries are used later for visualization and similarity computation.

In [263]:
import numpy as np
import pandas as pd

The book, user, and rating datasets are loaded from CSV files into Pandas DataFrames. These datasets form the foundation of the recommender system.

In [264]:
books = pd.read_csv('books.csv')
users = pd.read_csv('users.csv')
ratings = pd.read_csv('ratings.csv')

  books = pd.read_csv('books.csv')


This cell accesses a sample image URL from the books dataset to verify that image links are correctly loaded and available for use in the UI or display

In [265]:
books['Image-URL-M'][1]

'http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg'

Displays the first few rows of the users dataset to understand its structure, columns, and sample values.

In [266]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


Shows the ratings dataset to inspect user–book interactions and rating values.

In [267]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Examines dataset dimensions and data types to assess size, sparsity, and potential preprocessing needs.

In [268]:
print(books.shape)
print(ratings.shape)
print(users.shape)

(271360, 8)
(1149780, 3)
(278858, 3)


Checking for Null values in books dataset


In [269]:
books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

Checking for Null values in users dataset


In [270]:
users.isnull().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

Checking for Null values in ratings dataset


In [271]:
ratings.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

Checking for duplicate values in books dataset

In [272]:
books.duplicated().sum()

np.int64(0)

Checking for duplicate values in ratings dataset

In [273]:
ratings.duplicated().sum()

np.int64(0)

Checking for duplicate values in users dataset

In [274]:
users.duplicated().sum()

np.int64(0)

## Popularity Based Recommender System

This step merges the user ratings dataset with the books dataset using the ISBN as a common key. The resulting DataFrame associates each user rating with the corresponding book title and metadata, enabling meaningful analysis and recommendation generation based on book-level information.

In [275]:
ratings_with_name = ratings.merge(books,on='ISBN')

This cell groups the merged ratings dataset by book title and counts the total number of ratings each book has received. The count is stored as a new column named `num_ratings`, which is later used to identify popular books and apply popularity-based filtering.

In [276]:
num_rating_df = ratings_with_name.groupby('Book-Title').count()['Book-Rating'].reset_index()
num_rating_df.rename(columns={'Book-Rating':'num_ratings'},inplace=True)
num_rating_df

Unnamed: 0,Book-Title,num_ratings
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1
...,...,...
241066,Ã?Â?lpiraten.,2
241067,Ã?Â?rger mit Produkt X. Roman.,4
241068,Ã?Â?sterlich leben.,1
241069,Ã?Â?stlich der Berge.,3


This cell displays the data types of each column in the merged ratings dataset. It is used to verify that columns such as user IDs, book titles, and ratings have appropriate data types before further preprocessing and analysis.


In [277]:
ratings_with_name.dtypes


User-ID                 int64
ISBN                      str
Book-Rating             int64
Book-Title                str
Book-Author               str
Year-Of-Publication    object
Publisher                 str
Image-URL-S               str
Image-URL-M               str
Image-URL-L               str
dtype: object

This step converts the `Book-Rating` column to a numeric data type. Any invalid or non-numeric values are coerced into missing values (`NaN`), ensuring the ratings data is suitable for numerical operations and analysis.


In [278]:
ratings_with_name['Book-Rating'] = pd.to_numeric(
    ratings_with_name['Book-Rating'],
    errors='coerce'
)


This cell calculates the average rating for each book by grouping the data by book title and computing the mean of the corresponding ratings. The result is stored in a new column named `avg_rating`, which is later used to assess overall book quality and support ranking or filtering decisions.


In [279]:

avg_rating_df = (
    ratings_with_name
    .groupby('Book-Title')['Book-Rating']
    .mean()
    .reset_index(name='avg_rating')
)

avg_rating_df

Unnamed: 0,Book-Title,avg_rating
0,A Light in the Storm: The Civil War Diary of ...,2.250000
1,Always Have Popsicles,0.000000
2,Apple Magic (The Collector's series),0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,0.000000
...,...,...
241066,Ã?Â?lpiraten.,0.000000
241067,Ã?Â?rger mit Produkt X. Roman.,5.250000
241068,Ã?Â?sterlich leben.,7.000000
241069,Ã?Â?stlich der Berge.,2.666667


This step merges the number of ratings per book with the average rating per book into a single DataFrame. The resulting dataset combines popularity and quality indicators, which are used to identify and rank popular books.


In [280]:
popular_df = num_rating_df.merge(avg_rating_df,on='Book-Title')
popular_df

Unnamed: 0,Book-Title,num_ratings,avg_rating
0,A Light in the Storm: The Civil War Diary of ...,4,2.250000
1,Always Have Popsicles,1,0.000000
2,Apple Magic (The Collector's series),1,0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1,8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,1,0.000000
...,...,...,...
241066,Ã?Â?lpiraten.,2,0.000000
241067,Ã?Â?rger mit Produkt X. Roman.,4,5.250000
241068,Ã?Â?sterlich leben.,1,7.000000
241069,Ã?Â?stlich der Berge.,3,2.666667


This line filters the dataset to include only books with at least 250 ratings, ensuring sufficient user interaction for reliability. The filtered books are then sorted by average rating in descending order, and the top 50 books are selected to represent the most popular and highly rated titles.


In [281]:
popular_df = popular_df[popular_df['num_ratings']>=250].sort_values('avg_rating',ascending=False).head(50)

This step enriches the popular books dataset by merging it with the books metadata to include author names and cover image URLs. Duplicate book titles are removed, and only the relevant columns are selected to prepare a clean and concise dataset for display on the home page.


In [282]:
popular_df = popular_df.merge(books,on='Book-Title').drop_duplicates('Book-Title')[['Book-Title','Book-Author','Image-URL-M','num_ratings','avg_rating']]

This cell accesses a sample image URL from the popular books dataset to verify that the book cover image links are correctly loaded and available for display in the user interface.


In [283]:
popular_df['Image-URL-M'][0]

'http://images.amazon.com/images/P/0439136350.01.MZZZZZZZ.jpg'

## Collaborative Filtering Based Recommender System

This code identifies highly active users by counting the number of ratings provided by each user and selecting those who have rated more than 200 books. The resulting user IDs are stored for further filtering to ensure reliable collaborative filtering signals.


In [284]:
x = ratings_with_name.groupby('User-ID').count()['Book-Rating'] > 200
padhe_likhe_users = x[x].index

This step filters the ratings dataset to include only entries from highly active users. By restricting the data to users with substantial rating history, the system reduces noise and improves the quality of similarity-based recommendations.


In [285]:
filtered_rating = ratings_with_name[ratings_with_name['User-ID'].isin(padhe_likhe_users)]

This code identifies books with sufficient user engagement by counting the number of ratings per book and selecting those that have received at least 50 ratings. These books are considered reliable for collaborative filtering and are retained for further processing.


In [286]:
y = filtered_rating.groupby('Book-Title').count()['Book-Rating']>=50
famous_books = y[y].index

This step filters the ratings dataset to retain only books with sufficient rating counts. By keeping only frequently rated books, the system reduces sparsity in the data and improves the stability and accuracy of similarity-based recommendations.


In [287]:
final_ratings = filtered_rating[filtered_rating['Book-Title'].isin(famous_books)]

This cell creates a user–book rating matrix using a pivot table, with book titles as rows, user IDs as columns, and rating values as entries. This matrix serves as the core input for computing similarity between books.


In [288]:
pt = final_ratings.pivot_table(index='Book-Title',columns='User-ID',values='Book-Rating')

This step replaces missing values in the user–book rating matrix with zeros. Filling missing ratings ensures the matrix is fully numeric and suitable for similarity computations.


In [289]:
pt.fillna(0,inplace=True)

User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [290]:
pt

User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This line imports the cosine similarity function from scikit-learn, which is used to measure the similarity between books based on their user rating vectors.


In [291]:
from sklearn.metrics.pairwise import cosine_similarity

This cell computes the pairwise cosine similarity between all books using the user–book rating matrix. The resulting similarity matrix quantifies how similar each pair of books is based on shared user rating patterns.


In [292]:
similarity_scores = cosine_similarity(pt)

This line checks the dimensions of the similarity matrix to confirm that similarity scores have been computed for all books in the dataset.


In [293]:
similarity_scores.shape

(706, 706)

This cell defines the recommendation function used by the system. Given a book title as input, the function locates the corresponding index in the rating matrix, retrieves similarity scores, and identifies the most similar books. For each recommended book, it extracts the title, author, and cover image URL from the books dataset and returns the results as a structured list for display.


In [294]:
def recommend(book_name):
    # index fetch
    index = np.where(pt.index==book_name)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:5]
    
    data = []
    for i in similar_items:
        item = []
        temp_df = books[books['Book-Title'] == pt.index[i[0]]]
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Title'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Author'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Image-URL-M'].values))
        
        data.append(item)
    
    return data

This line calls the recommendation function with a sample book title to test the system and verify that relevant book recommendations are generated correctly


In [295]:
recommend('1984')

[['Animal Farm',
  'George Orwell',
  'http://images.amazon.com/images/P/0451526341.01.MZZZZZZZ.jpg'],
 ["The Handmaid's Tale",
  'Margaret Atwood',
  'http://images.amazon.com/images/P/0449212602.01.MZZZZZZZ.jpg'],
 ['Brave New World',
  'Aldous Huxley',
  'http://images.amazon.com/images/P/0060809833.01.MZZZZZZZ.jpg'],
 ['The Vampire Lestat (Vampire Chronicles, Book II)',
  'ANNE RICE',
  'http://images.amazon.com/images/P/0345313860.01.MZZZZZZZ.jpg']]

In [296]:
pt.index[545]

"The Handmaid's Tale"

This step serializes and saves the popular books DataFrame to a file using pickle. Storing this object allows it to be reused later (for example, in a deployment or web application) without recomputing


In [297]:
import pickle
pickle.dump(popular_df,open('popular.pkl','wb'))

This line removes duplicate entries based on book titles from the books dataset. It ensures that each book title appears only once,


In [298]:
books.drop_duplicates('Book-Title')

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271354,0449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


This cell saves key components of the recommender system to disk using pickle, including the user–book rating matrix, the books metadata, and the similarity matrix. Persisting these objects enables faster loading and reuse of the trained recommendation logic in external applications without rerunning the entire preprocessing pipeline.


In [299]:
pickle.dump(pt,open('pt.pkl','wb'))
pickle.dump(books,open('books.pkl','wb'))
pickle.dump(similarity_scores,open('similarity_scores.pkl','wb'))