# Book Recommendation System Using Cosine Similarity

This book recommender system uses **collaborative filtering** and leverages **cosine similarity** to measure the similarity between books based on user ratings. Collaborative filtering focuses on identifying patterns in user behavior (such as ratings) to recommend items without requiring additional content-based information. 

Below is a detailed explanation of how cosine similarity is applied in this system.


### Step 1: Data Preprocessing


The first step is to clean and organize the data to make it useful for analysis.

- **Loading the Data**:
  - The system uses three datasets:
    - **Books**: Contains book titles, authors, publishers, and cover images.
    - **Ratings**: Tracks user ratings for books.
    - **Users**: Holds demographic information (optional for this workflow).

- **Merging Data**:
  - Book details are combined with ratings so every book's information (like title and cover image) is linked to its ratings.

- **Focusing on Popular Books**:
  - Only books with a sufficient number of ratings (e.g., 50 or more) are kept. This ensures the recommendations are based on books that many users interacted with.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display, HTML 

In [2]:
# Load the datasets
books = pd.read_csv('data/BX-Books.csv', delimiter=';', encoding='latin-1', on_bad_lines='skip', dtype={3: 'string'})
ratings = pd.read_csv('data/BX-Book-Ratings.csv', delimiter=';', encoding='latin-1', on_bad_lines='skip')
users = pd.read_csv('data/BX-Users.csv', delimiter=';', encoding='latin-1', on_bad_lines='skip')

In [3]:
# Retain necessary columns
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-L']]
ratings = ratings[['User-ID', 'ISBN', 'Book-Rating']]

# Merge books and ratings data on ISBN
books_ratings = ratings.merge(books, on='ISBN', how='left')

In [4]:
# Filter active users (who rated more than 50 books)
user_rating_counts = books_ratings['User-ID'].value_counts()
active_users = user_rating_counts[user_rating_counts > 50].index
books_ratings = books_ratings[books_ratings['User-ID'].isin(active_users)]

# Filter popular books (books with more than 50 ratings)
book_rating_counts = books_ratings['Book-Title'].value_counts()
popular_books = book_rating_counts[book_rating_counts > 50].index

books_ratings = books_ratings[books_ratings['Book-Title'].isin(popular_books)]

In [5]:
books_ratings.isnull().sum()

User-ID                0
ISBN                   0
Book-Rating            0
Book-Title             0
Book-Author            0
Year-Of-Publication    0
Publisher              0
Image-URL-L            0
dtype: int64

In [6]:
books_ratings.shape

(133212, 8)

In [7]:
#duplicated_titles = books_ratings[books_ratings.duplicated(subset=['Book-Title', 'User-ID'], keep=False)]
#duplicate_count = duplicated_titles['Book-Title'].nunique()
#print(f"Number of duplicated book titles: {duplicate_count}")
# Group by 'Book-Title' and count occurrences
#duplicate_groups = duplicated_titles.groupby('Book-Title').size().reset_index(name='count')

# Display the groups
#print(duplicate_groups)

In [8]:
# Convert 'Book-Title' to lowercase for case-insensitive duplicate checking
books_ratings['title_lower'] = books_ratings['Book-Title'].str.lower()


In [9]:
books_ratings[books_ratings.duplicated(subset=['title_lower', 'User-ID'], keep=False)]

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L,title_lower
1610,277427,0425115801,0,Lightning,Dean R. Koontz,1996,Berkley Publishing Group,http://images.amazon.com/images/P/0425115801.0...,lightning
1646,277427,0440221501,0,Lightning,DANIELLE STEEL,1996,Dell,http://images.amazon.com/images/P/0440221501.0...,lightning
1725,277427,0553571818,0,Long After Midnight,IRIS JOHANSEN,1997,Bantam,http://images.amazon.com/images/P/0553571818.0...,long after midnight
1753,277427,0671037692,8,Long After Midnight,Ray Bradbury,2004,Pocket Books,http://images.amazon.com/images/P/0671037692.0...,long after midnight
4489,278418,0140065172,0,Ordinary People,Judith Guest,1993,Penguin Books,http://images.amazon.com/images/P/0140065172.0...,ordinary people
...,...,...,...,...,...,...,...,...,...
1147167,275970,0694519405,8,I Know This Much Is True,Wally Lamb,1998,ReganBooks,http://images.amazon.com/images/P/0694519405.0...,i know this much is true
1148711,276463,0345413350,10,"The Golden Compass (His Dark Materials, Book 1)",PHILIP PULLMAN,1997,Del Rey,http://images.amazon.com/images/P/0345413350.0...,"the golden compass (his dark materials, book 1)"
1148757,276463,0679893105,0,"The Golden Compass (His Dark Materials, Book 1)",PHILIP PULLMAN,1998,Knopf Books for Young Readers,http://images.amazon.com/images/P/0679893105.0...,"the golden compass (his dark materials, book 1)"
1149521,276680,0385313543,8,The Glass Lake,Maeve Binchy,1995,Delacorte Press,http://images.amazon.com/images/P/0385313543.0...,the glass lake


In [10]:
# Remove duplicates based on 'Book-Title' and 'User-ID', keeping the first occurrence
books_ratings = books_ratings.drop_duplicates(subset=['title_lower', 'User-ID'], keep='first')
books_ratings.shape

(130445, 9)

In [11]:
books_ratings.drop(columns='title_lower', inplace=True)

In [12]:
books_ratings[books_ratings.duplicated(subset=['Book-Title', 'User-ID'], keep=False)]

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L


In [13]:
# General cleanup for special characters and unnecessary text
#books_ratings['Book-Title'] = books_ratings['Book-Title'].str.replace(r'[^\w\s]', '', regex=True)  # Remove special characters
#books_ratings['Book-Title'] = books_ratings['Book-Title'].str.replace(r'\s+', ' ', regex=True)  # Remove extra spaces
#books_ratings['Book-Title'] = books_ratings['Book-Title'].str.strip()  # Strip leading/trailing spaces

In [14]:
#x = books_ratings[books_ratings['Book-Title'].isin(['A Wrinkle In Time', 'A Wrinkle in Time'])]


In [15]:
#x[x['User-ID'].duplicated(keep=False)]


### Step 2: Creating a Rating Matrix

The data is reorganized into a format that highlights the relationships between books.

- **The Matrix**:
  - A table is created where:
    - **Rows**: Represent books.
    - **Columns**: Represent users.
    - **Cells**: Contain the ratings users gave to books.
  - If a user hasn’t rated a book, the cell is filled with `0`.

This matrix forms the foundation for finding similar books.


In [16]:
# User-Book Rating Matrix
book_matrix = books_ratings.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')

# Fill NaN values with 0 for cosine similarity
book_matrix.fillna(0, inplace=True)


In [17]:
book_matrix

User-ID,183,243,254,507,626,638,643,741,882,929,...,277928,277965,278026,278137,278144,278188,278418,278582,278633,278843
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010: Odyssey Two,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
204 Rosewood Lane,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24 Hours,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 3: Compute Cosine Similarity


Cosine similarity is used to calculate how similar two books are based on their ratings.

- **How It Works**:
  - Each book is treated as a "profile" of user ratings.
  - Cosine similarity measures how closely two books' profiles align, with a score near `1` meaning they are very similar.

- **Similarity Matrix**:
  - A new table is created where each cell shows the similarity score between two books. This makes it easy to find books that are similar to any given book.


In [18]:
cosine_sim = cosine_similarity(book_matrix)

# Create a DataFrame for similarity
cosine_sim_df = pd.DataFrame(cosine_sim, index=book_matrix.index, columns=book_matrix.index)


### Step 4: Build Recommendation Function

This is where the system generates recommendations.

- **How It Picks Books**:
  - When a user selects a book, the system looks at the similarity matrix to find books with the highest similarity scores.

- **Adding Details**:
  - For each recommended book, the system provides:
    - The title.
    - A cover image.
    - The similarity score (to show how closely it matches the input book).

- **Handling Missing Books**:
  - If the chosen book isn’t in the dataset, the system notifies the user in a friendly way.


In [19]:
# Function to recommend books using cosine similarity
def recommend_books_with_images_and_scores(book_title, top_n=5):
    if book_title not in cosine_sim_df.index:
        print(f"Book '{book_title}' not found in the dataset.")
        return []
    
    # Get similar books based on cosine similarity
    recommendations = cosine_sim_df[book_title].sort_values(ascending=False)[1:top_n + 1]
    
    recommended_books = []
    for similar_book_title, score in recommendations.items():
        if similar_book_title in books['Book-Title'].values:
            book_info = books[books['Book-Title'] == similar_book_title].iloc[0]
            recommended_books.append({
                "title": book_info['Book-Title'],
                "image_url": book_info['Image-URL-L'],
                "score": score
            })
        else:
            print(f"Warning: Book '{similar_book_title}' not found in the books dataset.")
    
    return recommended_books

The recommendations are presented in a visually appealing and user-friendly format.

- **Table Format**:
  - Recommendations are displayed in a table with three columns:
    1. **Cover**: Displays the book’s image.
    2. **Title**: Shows the book’s name.
    3. **Similarity Score**: Indicates how closely the book matches the input.

- **User Experience**:
  - The table is neatly formatted, with everything centered and easy to read for a smooth user experience.


In [20]:
# Function to display recommendations with scores
def display_recommendations_with_images_and_scores(recommended_books):
    html = """
    <h3 style='text-align: center;'>Recommended Books</h3>
    <table style='border-collapse: collapse; width: 100%;'>
        <tr>
            <th style='text-align: center;'>Cover</th>
            <th style='text-align: center;'>Title</th>
            <th style='text-align: center;'>Cosine Similarity</th>
        </tr>
    """
    for book in recommended_books:
        html += f"""
        <tr>
            <td style='padding: 10px; text-align: center;'><img src="{book['image_url']}" style="width:100px;"/></td>
            <td style='padding: 10px; text-align: center;'>{book['title']}</td>
            <td style='padding: 10px; text-align: center;'>{book['score']:.4f}</td>
        </tr>
        """
    html += "</table>"
    display(HTML(html))


### Step 5: Test the Recommendation System
Finally, the system is tested with a sample book.

- **Example**:
  - A book like _"Harry Potter and the Chamber of Secrets"_ is selected.
  - The system recommends books with:
    - Cover images.
    - Titles of similar books.
    - Similarity scores that explain how closely the books relate to the input.


In [21]:
# Example usage
target_book = 'A Is for Alibi (Kinsey Millhone Mysteries (Paperback))'  # Replace with a valid book title
recommended_books = recommend_books_with_images_and_scores(target_book)

# Display recommendations with images and scores
if recommended_books:
    display_recommendations_with_images_and_scores(recommended_books)

Cover,Title,Cosine Similarity
,B Is for Burglar (Kinsey Millhone Mysteries (Paperback)),0.351
,C Is for Corpse (Kinsey Millhone Mysteries (Paperback)),0.339
,K Is for Killer (Kinsey Millhone Mysteries (Paperback)),0.3131
,E Is for Evidence: A Kinsey Millhone Mystery (Kinsey Millhone Mysteries (Paperback)),0.2901
,D Is for Deadbeat (Kinsey Millhone Mysteries (Paperback)),0.2861


In [22]:
# Example usage
target_book = 'Harry Potter and the Chamber of Secrets (Book 2)'  # Replace with a valid book title
recommended_books = recommend_books_with_images_and_scores(target_book)

# Display recommendations with images and scores
if recommended_books:
    display_recommendations_with_images_and_scores(recommended_books)

Cover,Title,Cosine Similarity
,Harry Potter and the Prisoner of Azkaban (Book 3),0.621
,Harry Potter and the Goblet of Fire (Book 4),0.6083
,Harry Potter and the Sorcerer's Stone (Book 1),0.4834
,Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback)),0.3983
,Harry Potter and the Order of the Phoenix (Book 5),0.3826


In [23]:
# Example usage
target_book = 'A Walk to Remember'  # Replace with a valid book title
recommended_books = recommend_books_with_images_and_scores(target_book)

# Display recommendations with images and scores
if recommended_books:
    display_recommendations_with_images_and_scores(recommended_books)

Cover,Title,Cosine Similarity
,The Rescue,0.3906
,Nights in Rodanthe,0.2735
,A Bend in the Road,0.2619
,Message in a Bottle,0.2441
,The Notebook,0.2381


### Conclusion

Cosine similarity enables this model to recommend books that align with user preferences by analyzing shared rating patterns. Based on the testing outcomes, the model effectively highlights related books, particularly those from the same series or by the same author. For example, in the Harry Potter series, the recommendations are accurate and backed by relatively high similarity scores (e.g., 0.5341 for Harry Potter and the Prisoner of Azkaban). Similarly, the model successfully identifies entries from the Kinsey Millhone Mysteries and novels by Nicholas Sparks. However, the lower similarity scores for some recommendations (e.g., 0.1247 for Message in a Bottle) indicate that the model may struggle with broader connections outside well-established patterns. Overall, the model is effective for closely related books, though improvements in diversity and exploring weaker relationships could enhance its utility for a wider audience.













### Copyright Remark

This project is created as part of a self-learning initiative to explore and understand recommendation systems. The data used in this project comes from publicly available datasets, and all code and methods implemented are for educational purposes only. 

No commercial use is intended, and all intellectual property rights for the datasets and external resources belong to their respective owners. If this work references copyrighted materials or methods, it is done solely for the purpose of learning and demonstration under the principles of fair use.
