# Book Recommendation Engine


This project aims to create a book recommendation system by leveraging a dataset containing information about books, user ratings, and user demographics. The primary objectives include:

- Cleaning and preprocessing the datasets to ensure data quality.
- Analyzing and understanding user preferences through exploratory data analysis (EDA).
- Preparing the data for building a recommendation engine using content-based or collaborative filtering methods.
  This notebook provides a step-by-step approach to preparing and analyzing the data, setting the foundation for building a robust recommendation system.

## Imports

In [5]:
import pandas as pd 
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix
from sklearn.neighbors import NearestNeighbors

## Loading data

In [None]:
book_df = pd.read_csv("./data/cleaned_data.csv")

Unnamed: 0.1,Unnamed: 0,ISBN,Book-Title,Book-Author,Publisher,Book-Rating,Location,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Country,New-Rating,Weighted-Rating,Rating-Count
0,193827,0590567330,A Light in the Storm: The Civil War Diary of ...,Karen Hesse,Hyperion Books for Children,9,"albuquerque, new mexico, usa",http://images.amazon.com/images/P/0590567330.0...,http://images.amazon.com/images/P/0590567330.0...,http://images.amazon.com/images/P/0590567330.0...,96448,usa,3,3.0,1
1,383790,0310232546,"Ask Lily (Young Women of Faith: Lily Series, ...",Nancy N. Rue,Zonderkidz,8,"ypsilanti, michigan, usa",http://images.amazon.com/images/P/0310232546.0...,http://images.amazon.com/images/P/0310232546.0...,http://images.amazon.com/images/P/0310232546.0...,269557,usa,2,2.0,1
2,217596,006250746X,Earth Prayers From around the World: 365 Pray...,Elizabeth Roberts,HarperSanFrancisco,9,"woodbridge, virginia, usa",http://images.amazon.com/images/P/006250746X.0...,http://images.amazon.com/images/P/006250746X.0...,http://images.amazon.com/images/P/006250746X.0...,26544,usa,3,2.142857,7
3,206869,1566869250,Final Fantasy Anthology: Official Strategy Gu...,David Cassady,BradyGames,10,"st. louis, missouri, usa",http://images.amazon.com/images/P/1566869250.0...,http://images.amazon.com/images/P/1566869250.0...,http://images.amazon.com/images/P/1566869250.0...,30072,usa,3,3.0,1
4,295695,082177350X,Flight of Fancy: American Heiresses (Zebra Ba...,Tracy Cozzens,Kensington Publishing Corporation,8,"charleston, south carolina, usa",http://images.amazon.com/images/P/082177350X.0...,http://images.amazon.com/images/P/082177350X.0...,http://images.amazon.com/images/P/082177350X.0...,61028,usa,2,2.0,1


## Data Overview

### Preview

In [6]:
book_df.head()

Unnamed: 0.1,Unnamed: 0,ISBN,Book-Title,Book-Author,Publisher,Book-Rating,Location,Image-URL-S,Image-URL-M,Image-URL-L,User-ID,Country,New-Rating,Weighted-Rating,Rating-Count
0,193827,0590567330,A Light in the Storm: The Civil War Diary of ...,Karen Hesse,Hyperion Books for Children,9,"albuquerque, new mexico, usa",http://images.amazon.com/images/P/0590567330.0...,http://images.amazon.com/images/P/0590567330.0...,http://images.amazon.com/images/P/0590567330.0...,96448,usa,3,3.0,1
1,383790,0310232546,"Ask Lily (Young Women of Faith: Lily Series, ...",Nancy N. Rue,Zonderkidz,8,"ypsilanti, michigan, usa",http://images.amazon.com/images/P/0310232546.0...,http://images.amazon.com/images/P/0310232546.0...,http://images.amazon.com/images/P/0310232546.0...,269557,usa,2,2.0,1
2,217596,006250746X,Earth Prayers From around the World: 365 Pray...,Elizabeth Roberts,HarperSanFrancisco,9,"woodbridge, virginia, usa",http://images.amazon.com/images/P/006250746X.0...,http://images.amazon.com/images/P/006250746X.0...,http://images.amazon.com/images/P/006250746X.0...,26544,usa,3,2.142857,7
3,206869,1566869250,Final Fantasy Anthology: Official Strategy Gu...,David Cassady,BradyGames,10,"st. louis, missouri, usa",http://images.amazon.com/images/P/1566869250.0...,http://images.amazon.com/images/P/1566869250.0...,http://images.amazon.com/images/P/1566869250.0...,30072,usa,3,3.0,1
4,295695,082177350X,Flight of Fancy: American Heiresses (Zebra Ba...,Tracy Cozzens,Kensington Publishing Corporation,8,"charleston, south carolina, usa",http://images.amazon.com/images/P/082177350X.0...,http://images.amazon.com/images/P/082177350X.0...,http://images.amazon.com/images/P/082177350X.0...,61028,usa,2,2.0,1


### Basic Stats

In [5]:
print(book_df.describe())

book_df.info()

          Unnamed: 0   Book-Rating        User-ID    New-Rating  \
count   94580.000000  94580.000000   94580.000000  94580.000000   
mean   264735.003859      7.546997  116602.223218      2.061186   
std     88930.495760      1.897310   82547.296358      0.773104   
min         0.000000      1.000000       8.000000      1.000000   
25%    207703.750000      6.000000   41364.000000      1.000000   
50%    280949.500000      8.000000  102880.500000      2.000000   
75%    338032.250000      9.000000  184424.000000      3.000000   
max    383839.000000     10.000000  278854.000000      3.000000   

       Weighted-Rating  Rating-Count  
count     94580.000000  94580.000000  
mean          2.059837      2.633485  
std           0.691279      7.729140  
min           1.000000      1.000000  
25%           1.646008      1.000000  
50%           2.000000      1.000000  
75%           2.666667      2.000000  
max           3.000000    585.000000  
<class 'pandas.core.frame.DataFrame'>
RangeIn

In [6]:
# Extraire les indices utilisateur, livre et les notes
users = book_df['User-ID'].astype("category").cat.codes
books = book_df['Book-Title'].astype("category").cat.codes
ratings = book_df['Book-Rating']

# Créer une matrice sparse utilisateur-livre
sparse_user_item_matrix_full = coo_matrix((ratings, (users, books)))

# Vérifier les dimensions de la matrice sparse
sparse_user_item_matrix_full.shape

(22005, 94580)

In [7]:
sparse_user_item_matrix_full

<22005x94580 sparse matrix of type '<class 'numpy.int64'>'
	with 94580 stored elements in COOrdinate format>

In [8]:
from sklearn.neighbors import NearestNeighbors

In [9]:
# Configurer le modèle KNN pour trouver les livres similaires
knn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=8)
knn_model.fit(sparse_user_item_matrix_full.T)  # Transposer pour que les colonnes soient les livres

In [10]:
# convertir la matrice sparse en format CSR pour l'indexation
sparse_user_item_matrix_full_csr = sparse_user_item_matrix_full.tocsr()

# trouver les livres similaires pour un livre spécifique (par index)
example_book_index = 0  # par défaut, le premier livre dans la matrice
distances, indices = knn_model.kneighbors(sparse_user_item_matrix_full_csr.T[example_book_index], n_neighbors=5)

# Extraire les noms des livres correspondants
book_titles = book_df['Book-Title'].astype("category").cat.categories
similar_books = book_titles[indices.flatten()]

# Afficher les résultats
similar_books

Index(['Karens School Surprise (Baby-Sitters Little Sister, 77)',
       'How to Learn Any Language', 'The Wonderful Wizard of Oz',
       'The Case of the Mystery Cruise (The Adventures of Mary Kate and Ashley, No 2)',
       'Good-Bye Stacey, Good-Bye (Baby-Sitters Club, 13)'],
      dtype='object')

In [13]:
sparse_user_item_matrix_full_csr

<22005x94580 sparse matrix of type '<class 'numpy.int64'>'
	with 94580 stored elements in Compressed Sparse Row format>

In [12]:
import pickle

In [15]:
pickle.dump(knn_model, open('artifacts/knn_model.pkl', 'wb'))
pickle.dump(book_titles, open('artifacts/book_titles.pkl', 'wb'))
pickle.dump(book_df, open('artifacts/book_df.pkl', 'wb'))
pickle.dump(sparse_user_item_matrix_full_csr, open('artifacts/sparse_user_item_matrix_full_csr.pkl', 'wb'))

In [17]:
book_df.columns

Index(['Unnamed: 0', 'ISBN', 'Book-Title', 'Book-Author', 'Publisher',
       'Book-Rating', 'Location', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L',
       'User-ID', 'Country', 'New-Rating', 'Weighted-Rating', 'Rating-Count'],
      dtype='object')