# <center> Book Recommendations System - Advance Modeling</center>
#### <center>**By: Mili Ketan Thakrar**</center>

<a id="TOC"></a> <br>
## Table of Contents
1. [Introduction](#intro)
2. [Importing Libraries and Custom Definitions](#import)
4. [Loading the Dataset](#loading)
5. [Data Dictionary](#dict)
6. [Getting Genre Information](#popular)
7. [Collabrative Filtering Model](#log)
   1. [Machine Lerning Pipeline](#pipeline)
   2. [Model Iteration](#iterate)
   3. [Evaluating Models](#eval1)
8. [Hybrid Model (Content + Collabrative Filtering)](#cosine)
    1. [Evaluating Models](#eval2)
10. [Conclusion](#conclusion)

<a id="intro"></a>
## Introduction

A book recommendation system is an intelligent application designed to help users discover books tailored to their interests and reading preferences. By analyzing user behavior, ratings, and book attributes, the system provides personalized suggestions, enhancing the reading experience and making it easier to find engaging and relevant books. This project aims to develop an efficient recommendation engine using advanced algorithms and data analysis techniques.

#### **Advanced Models: Collaborative Filtering and Hybrid Approaches**

In this notebook, we extend our recommendation system with advanced modeling techniques that leverage both user interactions and book content to deliver highly personalized suggestions.

We will implement and evaluate the following approaches:

1. **Collaborative Filtering Model**:  
   This model predicts a user’s preferences by analyzing patterns in user ratings and interactions. By identifying similarities among users and books, collaborative filtering generates recommendations based on what similar users have enjoyed. The process includes building a machine learning pipeline, iterating on model selection, and thoroughly evaluating performance to ensure robust recommendations.

2. **Hybrid Model (Content + Collaborative Filtering)**:  
   To further enhance recommendation quality, we combine the strengths of content-based and collaborative filtering methods. The hybrid model integrates information from both user behavior and book attributes, resulting in more accurate and diverse suggestions. We also conduct comprehensive model evaluation to assess the effectiveness of this combined approach.

Throughout the notebook, we will evaluate models using appropriate metrics such as RMSE, Precision@K, Recall@K, MAP, and NDCG to measure both prediction accuracy and ranking quality. By comparing these metrics, we aim to identify the best performing model that balances recommendation relevance and user satisfaction.

By the end of this notebook, we aim to:
- Demonstrate the value of collaborative and hybrid recommendation strategies.
- Evaluate and compare model performance using relevant metrics.
- Highlight the practical benefits and potential improvements for real-world book recommendation systems.

<a id="import"></a>
## Importing Libraries 
[Back to Table of Contents](#TOC)

In [1]:
# Data Handling & Utilities
import pandas as pd
import numpy as np

import sys
sys.path.append('../Data')

from data_utils import (  # Custom utility functions
    import_csv, 
    generate_data_dictionary, 
    define_df_settings  
)

# Machine Learning Pipeline
# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    OneHotEncoder, 
    StandardScaler, 
    FunctionTransformer, 
    LabelEncoder
)
from sklearn.decomposition import PCA, IncrementalPCA

# Model Training & Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    confusion_matrix,
    roc_auc_score,
    classification_report,
    ConfusionMatrixDisplay,
    RocCurveDisplay
)
from sklearn.pipeline import Pipeline

# Natural Language Processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Recommendation System
from gensim.models import Word2Vec, KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

# Visualization
import matplotlib.pyplot as plt


# Progress Tracking
from tqdm import tqdm
tqdm.pandas()  # Enable pandas progress tracking

# Text Processing Utilities
import re  # Regular expressions

# Warning Configuration
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Pandas Display Configuration
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

<a id="loading"></a>
## Loading the Dataset
[Back to Table of Contents](#TOC)

In [2]:
# Loading the clean dataset 
df = import_csv('cleaned_data.csv')

Successfully imported data from cleaned_data.csv


In [3]:
df.head()

Unnamed: 0,ISBN,Title,Author,Ratings,Total_num_of_ratings,Avg_ratings,Avg_ratings_excluding_zero,Publisher,Year_Category,Publication_year,User_id,Age_Category,City,State,Country,Image_URL
0,1558746218,A Second Chicken Soup for the Woman's Soul (Chicken Soup for the Soul Series),Jack Canfield,0,56,3.89,7.79,Health Communications,1980-1999,1998,8,26-32,timmins,ontario,canada,http://images.amazon.com/images/P/1558746218.01.LZZZZZZZ.jpg
1,2005018,Clara Callan,Richard Bruce Wright,5,14,4.93,7.67,HarperFlamingo Canada,2000-2009,2001,8,26-32,timmins,ontario,canada,http://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg
2,60973129,Decision in Normandy,Carlo D'Este,0,3,5.0,7.5,HarperPerennial,1980-1999,1991,8,26-32,timmins,ontario,canada,http://images.amazon.com/images/P/0060973129.01.LZZZZZZZ.jpg
3,374157065,Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It,Gina Bari Kolata,0,11,4.27,7.83,Farrar Straus Giroux,1980-1999,1999,8,26-32,timmins,ontario,canada,http://images.amazon.com/images/P/0374157065.01.LZZZZZZZ.jpg
4,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,7,3,4.67,7.0,River City Pub,1980-1999,1994,8,26-32,timmins,ontario,canada,http://images.amazon.com/images/P/1881320189.01.LZZZZZZZ.jpg


<a id="dict"></a>
## Data Dictionary
[Back to Table of Contents](#TOC)

In [4]:
data_dict = generate_data_dictionary(df)
display(data_dict)

Unnamed: 0,Column Name,Data Type,Description,Unique Values,Missing Values,Value Range
0,ISBN,object,"International Standard Book Number, unique identifier for books",149831,0,
1,Title,object,The title of the book,135563,0,
2,Author,object,The name of the book's author,62110,0,
3,Ratings,int64,"User's rating of the book, scale of 1-10",11,0,"(0, 10)"
4,Total_num_of_ratings,int64,Total number of ratings for the book,377,0,"(1, 2502)"
5,Avg_ratings,float64,Average rating score for the book,710,0,"(0.11, 10.0)"
6,Avg_ratings_excluding_zero,float64,Average rating score for the book excluding 0 values,418,0,"(1.0, 10.0)"
7,Publisher,object,The name of the book's publisher,11573,0,
8,Year_Category,object,Categorized time period of publication,7,0,
9,Publication_year,int64,The year the book was published,100,0,"(0, 2020)"


### Getting Genre information for API 

In [None]:
df['Title'].nunique()

In [None]:
# Assuming your dataframe is called 'df'
top_5000_books = (df.sort_values('Total_num_of_ratings', ascending=False)
                    .drop_duplicates('Title')
                    .head(5000))
top_5000_books

In [None]:
harry_potter_books = top_5000_books[top_5000_books['Title'].str.contains('Harry Potter', case=False)]

if not harry_potter_books.empty:
    print("Harry Potter books found in the top 5000:")
    print(harry_potter_books[['Title', 'Total_num_of_ratings']])
else:
    print("No Harry Potter books found in the top 5000.")

In [None]:
import requests
import pandas as pd
import time
from requests.exceptions import RequestException

def get_genre(isbn):
    url = f"https://openlibrary.org/api/books?bibkeys=ISBN:{isbn}&format=json&jscmd=data"
    max_retries = 3
    retry_delay = 5

    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raises an HTTPError for bad responses
            data = response.json()
            
            if f"ISBN:{isbn}" in data:
                book_data = data[f"ISBN:{isbn}"]
                subjects = book_data.get("subjects", [])
                return ", ".join([subject["name"] for subject in subjects[:3]])  # Get up to 3 subjects
            return "Genre not found"
        
        except RequestException as e:
            if attempt < max_retries - 1:
                print(f"Error fetching data for ISBN {isbn}. Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                print(f"Failed to fetch data for ISBN {isbn} after {max_retries} attempts. Error: {str(e)}")
                return "Error fetching genre"
        
        except (KeyError, ValueError) as e:
            print(f"Error processing data for ISBN {isbn}. Error: {str(e)}")
            return "Error processing genre"

# Load your dataset
df = pd.read_csv("/Users/milithakrar/Desktop/Data_Science/Capstone_Project/Book-Recommendations/notebooks/cleaned_data.csv")

# Find the top 5000 unique books
top_5000_books = df.sort_values('Total_num_of_ratings', ascending=False).drop_duplicates('Title').head(5000)

# Add a new column for Genre
top_5000_books["Genre"] = ""

# Update Genre for each book in the top 5000
for index, row in top_5000_books.iterrows():
    isbn = row["ISBN"]
    genre = get_genre(isbn)
    top_5000_books.at[index, "Genre"] = genre
    # print(f"Processed book {index + 1}/5000: {row['Title']} - Genre: {genre}")
    time.sleep(3)  # Respect rate limits, adjust as needed

# Save the updated top_5000_books dataset
top_5000_books.to_csv("top_5000_books_with_genre.csv", index=False)
print("Processing complete. Data saved to 'top_5000_books_with_genre.csv'")


### Collabertaive filtering 

In [None]:
# ratings_df['User_id'].value_counts()

In [None]:
# # Find users with more than 200 ratings
# user_counts = ratings_df['User_id'].value_counts()
# top_user_ids = user_counts[user_counts > 200].index

# # Filter the DataFrame to include only rows with User_id in top_user_ids
# top_users_rating = ratings_df[ratings_df['User_id'].isin(top_user_ids)]

# # Display the result
# top_users_rating

In [None]:
# # Seeing only rows that have a Total_num_of_ratings per titel more than 20 
# Top_ratings_with_books = ratings_with_books[ratings_with_books['Total_num_of_ratings'] > 20]

In [None]:
#ratings_with_books = top_users_rating.merge(books_df, on='ISBN')