# Books Recommender system using clustering
### Collaborative filtering
- Book-Crossing Dataset :- https://www.kaggle.com/ra4u12/bookrecommendation
- The Book-Crossing Dataset is a widely used dataset in recommendation systems research. It contains information about books, their authors, publication years, publishers, and associated images.

In [1]:
# Importing necessary library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
books = pd.read_csv('data/BX-Books.csv', sep=";", error_bad_lines=False, encoding='latin-1')



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  exec(code_obj, sel

In [3]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
books.shape

(271360, 8)

In [5]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

### Column names and meaning
1. **ISBN**:
   - Stands for *International Standard Book Number*. It is a unique identifier assigned to books for commercial use, allowing for easy identification and cataloging.

2. **Book-Title**:
   - Refers to the title of the book, which is the name given to the work by its author or publisher.

3. **Book-Author**:
   - Indicates the author or authors who wrote the book. This column helps identify the creators of the literary work.

4. **Year-Of-Publication**:
   - Represents the year when the book was published. This is useful for understanding the book's time period and possibly its relevance or popularity in that era.

5. **Publisher**:
   - The company or entity responsible for the production and distribution of the book. This column identifies the publisher's name.

6. **Image-URL-S**:
   - The URL link to a *small* image of the book cover. This typically points to a thumbnail image used for quick display on websites or apps.

7. **Image-URL-M**:
   - The URL link to a *medium* size image of the book cover, which is a more detailed version of the small image.

8. **Image-URL-L**:
   - The URL link to a *large* image of the book cover, offering the highest resolution available for the book's cover art.

In [6]:
books['Image-URL-S'].iloc[0]

'http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg'

In [7]:
books['Image-URL-M'].iloc[0]

'http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg'

In [8]:
books['Image-URL-L'].iloc[0]

'http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg'

#### Conclution:
From what we can see the `'Image-URL-L'` columnn gives us the best image size.

In [9]:
books = books[['ISBN','Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher','Image-URL-L']]
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...


### Let's rename some columns

In [10]:
books.rename(columns={"Book-Title":'title',
                      'Book-Author':'author',
                     "Year-Of-Publication":'year',
                     "Publisher":"publisher",
                     "Image-URL-L":"image_url"},inplace=True)

In [11]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher,image_url
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...


### Now let's load the user's dataset

This dataset holds details about the users who the rated books. Details such as the user ID, location, and age.


In [12]:
users = pd.read_csv('data/BX-Users.csv', sep=";", error_bad_lines=False, encoding='latin-1')
users.head()



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [13]:
users.shape

(278858, 3)

### Let's rename the columns for convinience

In [14]:
users.rename(columns={"User-ID":'user_id',
                      'Location':'location',
                     "Age":'age'},inplace=True)

In [15]:
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Now Let's load the Ratingd dataset

This contains user ratings for the books, with columns for the User-ID, ISBN, and Book-Rating

In [16]:
ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=";", error_bad_lines=False, encoding='latin-1')
ratings.head()



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [17]:
ratings.shape

(1149780, 3)

### Let's rename the columns for our convinience

In [18]:
# Lets remane some wierd columns name
ratings.rename(columns={"User-ID":'user_id',
                      'Book-Rating':'rating'},inplace=True)

In [19]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Now we have three dataframes, lets have a close look

In [20]:
print(books.shape, users.shape, ratings.shape, sep='\n')

(271360, 6)
(278858, 3)
(1149780, 3)


#### Let's see how many ratings each user has provided

In [21]:
ratings['user_id'].value_counts()

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
116180        1
116166        1
116154        1
116137        1
276723        1
Name: user_id, Length: 105283, dtype: int64

#### Let's see how many unique users have provided ratings.

In [22]:
ratings['user_id'].value_counts().shape

(105283,)

In [23]:
ratings['user_id'].unique().shape

(105283,)

### Let's now store the user IDs of the users who have rated more than 200 books, based on the filtering condition in the above code

In [24]:
# Lets store users who had at least rated more than 200 books

x = ratings['user_id'].value_counts() > 200
x

11676      True
198711     True
153662     True
98391      True
35859      True
          ...  
116180    False
116166    False
116154    False
116137    False
276723    False
Name: user_id, Length: 105283, dtype: bool

In [25]:
y= x[x].index
y

Int64Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352,
            110973, 235105,
            ...
            260183,  73681,  44296, 155916,   9856, 274808,  28634,  59727,
            268622, 188951],
           dtype='int64', length=899)

In [26]:
y.shape

(899,)

Filtering users with over 200 ratings is a common practice in recommendation systems, and here are some reasons why this might be done:

### 1. **Ensuring Active Users**:
   - By filtering users with fewer than 200 ratings, you focus on the most active users in the dataset. Active users tend to provide more diverse and consistent feedback, which helps build more accurate recommendation models. Including users with very few ratings can skew the analysis, as their preferences are underrepresented and not as reliable for pattern detection.
   - **Source**: *Recommender Systems Handbook* by Francesco Ricci, Lior Rokach, and Bracha Shapira.

### 2. **Improving Model Accuracy**:
   - When building recommendation models, having a substantial number of ratings per user allows the system to better understand the preferences of each user. With too few ratings, a user's preferences are harder to discern, leading to less reliable predictions. Including only users with more than 200 ratings increases the quality of the training data.
   - **Source**: *Practical Recommender Systems* by Kim Falk.

### 3. **Reducing Noise**:
   - Users with very few ratings often provide less diverse and less consistent data, which can add noise to the model. This can lead to overfitting, where the model "learns" from these less representative preferences, hurting its ability to generalize well to new data.
   - **Source**: *Recommender Systems: From Research to Practice* by Giovanni Rizzo, Paolo Cremonesi.

### 4. **Data Imbalance**:
   - In many datasets, users may have a skewed distribution of ratings, where a small portion of users contribute the majority of ratings. Filtering users with fewer than 200 ratings helps balance this distribution and prevents the model from being dominated by users with few ratings, allowing for more meaningful patterns to emerge.
   - **Source**: *Collaborative Filtering Recommender Systems* by Yehuda Koren, Robert Bell, Chris Volinsky.

### 5. **Practical Considerations**:
   - From a computational perspective, working with a dataset containing only active users with more ratings can reduce the computational complexity of the recommendation algorithms. This can lead to faster processing and more efficient model training.
   - **Source**: *Building Data Science Applications with FastAPI* by Abhinav Bhat.

By focusing on more active users, you increase the robustness and quality of your recommendations while ensuring that the model's insights are derived from users with sufficiently rich data.

### Redefining the ratings column with respect to `y`

In [27]:
# New ratings values

ratings = ratings[ratings['user_id'].isin(y)]
ratings.head()

Unnamed: 0,user_id,ISBN,rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


In [28]:
ratings.shape

(526356, 3)

### Now let's merge the ratings dataframe and the books dataframe

In [29]:
# Merge on "ISBN" column

ratings_with_books = ratings.merge(books, on='ISBN')
ratings_with_books.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...


In [30]:
ratings_with_books.shape

(487671, 8)

#### Let's group the ratings data by book `title` and count the number of `ratings` for each book.


In [31]:
# Let's group the ratings data by book title and count the number of ratings for each book.

number_rating = ratings_with_books.groupby('title')['rating'].count().reset_index()
number_rating.head()

Unnamed: 0,title,rating
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


### Now we have two of our dataframes ready (`ratings_with_books` and `number_rating`).

Even though our dataframes are ready we need to make some changes. Both dataframes have a column named rating. The rating in `ratings_with_books datafrmae` is the rating per `user_id` while the rating in `number_rating dataframe`, accounts for the rating per each `title`. To make things easy we'll rename the rating column on in `number_rating dataframe` to `num_of_rating`.

In [32]:
number_rating.rename(columns={'rating':'num_of_rating'},inplace=True)
number_rating.head()

Unnamed: 0,title,num_of_rating
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


### Let's now merge the dataframes

In [33]:
final_rating = ratings_with_books.merge(number_rating, on='title')
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82


In [34]:
# Lets take those books which got at least 50 rating of user

final_rating = final_rating[final_rating['num_of_rating'] >= 50]
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82


In [35]:
final_rating.shape

(61853, 9)

### Netx we create a pivot table

A **pivot table** is used to summarize, aggregate, and reorganize data to better understand patterns and relationships. In this case, the pivot table will be used to transform the book rating data into a more insightful format.

### Why use a Pivot Table:
1. **Restructure the Data:**
   - The original dataset has multiple rows for each book, each row representing a rating by a different user. By creating a pivot table, we reorganize the data so that each book's ratings are aggregated across all users, making it easier to analyze.
   
2. **Simplifying Analysis:**
   - Instead of manually looking through each individual rating, the pivot table aggregates them by `title` (book) and `user_id` (user), creating a matrix where the rows represent the book titles and the columns represent the users.
   - This table shows the ratings a user has given to each book, with `NaN` representing missing ratings (i.e., the user has not rated that book).

### What is it used for in this case:
- **Finding User Preferences:**
   - This pivot table can help identify patterns in user behavior. For example, you can see which books have been rated highly by particular users or which books remain unrated by most users.
   
- **Collaborative Filtering:**
   - In recommendation systems, pivot tables like these are used to implement collaborative filtering. This involves finding users with similar preferences to recommend books they haven't rated yet.
   
- **Aggregating Ratings:**
   - You can easily compute the average rating for each book by calculating row-wise means, or you can identify the most and least rated books by examining the columns and rows.

In summary, the pivot table serves as a powerful tool to convert the user-book rating data into a more digestible format for analysis and recommendation system building.

### Below is the pivot table syntax

pivot_table = df.pivot(index='row_col', columns='col_col', values='value_col')

In [36]:
# Lets create a pivot table
book_pivot = final_rating.pivot_table(index='title', columns='user_id', values= 'rating')
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,,...,,,,0.0,,,,,0.0,
4 Blondes,,,,,,,,,,0.0,...,,,,,,,,,,
84 Charing Cross Road,,,,,,,,,,,...,,,,,,10.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,,,,7.0,,,,,7.0,,...,,,,,,0.0,,,,
You Belong To Me,,,,,,,,,,,...,,,,,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,0.0,,,,,0.0,...,,,,,,0.0,,,,
Zoya,,,,,,,,,,,...,,,,,,,,,,


In [37]:
book_pivot.shape

(742, 888)

### Let's take care of missing values

In [38]:
book_pivot.fillna(0, inplace=True)
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Training Model

In [39]:
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import numpy as np


In [40]:
book_sparse = csr_matrix(book_pivot)
book_sparse

<742x888 sparse matrix of type '<class 'numpy.float64'>'
	with 15226 stored elements in Compressed Sparse Row format>

In [41]:
type(book_sparse)

scipy.sparse.csr.csr_matrix

# We will be using Nearest Neighbourhood which is an Unsupervised algorithmn

### Nearest Neighbors Algorithm  
The **Nearest Neighbors algorithm** is a foundational unsupervised learning method used to find the closest points in a dataset based on distance metrics like **Euclidean**, **Manhattan**, or **cosine similarity**. Unlike supervised models, it doesn't involve labels or target predictions; instead, it focuses on relationships and similarities within the data.

#### Key Features:  
- **Unsupervised Learning**: Operates without requiring labeled data.
- **Distance Metrics**: Determines similarity between points using customizable distance measures.
- **Efficient Search**: Utilizes algorithms like **Ball Tree**, **KD-Tree**, or **brute force** for neighbor retrieval.
- **Applications**:  
  - Clustering (e.g., identifying similar groups).  
  - Recommender systems (e.g., finding similar users or items).  
  - Anomaly detection (e.g., identifying outliers based on distance).

#### Implementation in Scikit-Learn:  
```python
from sklearn.neighbors import NearestNeighbors

# Example: Initialize the model
nn = NearestNeighbors(n_neighbors=5, algorithm='brute')

# Fit the model on the dataset
nn.fit(data)

# Find the 5 nearest neighbors for a query point
distances, indices = nn.kneighbors(query_point)
```

#### Use Cases:  
- **Recommendation Systems**: Suggesting similar products or items.  
- **Similarity Search**: Finding similar entries in datasets.  
- **Outlier Detection**: Identifying rare or unusual points.

#### Key Methods:
- `fit(data)`: Stores the dataset for neighbor searches.
- `kneighbors(query_point)`: Retrieves the nearest neighbors for the input.


In [42]:
# Now import our clustering algoritm which is Nearest Neighbors this is an unsupervised ml algorithm

model = NearestNeighbors(algorithm= 'brute')
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

### Let's see what the output of the k-nearest neighbors (k-NN) search for the 254th book (row) in the pivot table will be


```python
distance, suggestion = model.kneighbors(book_pivot.iloc[254,:].values.reshape(1,-1), n_neighbors=6 )
```

The `254` refers to the **index of the row** in the pivot table `book_pivot`. Specifically, it refers to the 255th book in the pivot table. This index is used to retrieve the ratings associated with that book across all users.

### Explanation:
- `book_pivot.iloc[254, :]` selects the **254th row** of the `book_pivot` DataFrame. This row represents the ratings for a specific book (with its title as the row index).
- `book_pivot.iloc[254, :]` returns a **series** where each value corresponds to a rating given by a specific user (the user IDs are the column labels).

The line `model.kneighbors(...)` uses this row of ratings as input to the model, which finds the **nearest neighbors** (other books with similar ratings by users) based on the distances between these ratings.

In summary, the `254` in this context is an **index** referring to the **book** in the pivot table. It accesses the ratings for that specific book across all users.

In [43]:
#  Distance and Suggestion row 254 and all the columns -- > [254, :]

distance, suggestion = model.kneighbors(book_pivot.iloc[254,:].values.reshape(1,-1), n_neighbors=6 )

In [44]:
distance

array([[ 0.        , 36.48439545, 36.79673899, 37.18870796, 37.92756254,
        38.06573262]])

In [45]:
suggestion

array([[254,  34, 393, 372, 536, 184]])

In [46]:
book_pivot.iloc[254,:]

user_id
254       0.0
2276      0.0
2766      0.0
2977      0.0
3363      0.0
         ... 
275970    0.0
277427    0.0
277478    0.0
277639    0.0
278418    0.0
Name: High Fidelity, Length: 888, dtype: float64

### To see the non zero rating for Book 254 let's used the method below

In [47]:
# Select the row for the specific book (row 254 in this case)
book_ratings = book_pivot.iloc[254,:]
non_zero_ratings = book_ratings[book_ratings != 0]
non_zero_ratings

user_id
6242       6.0
8681      10.0
23902      9.0
78973      6.0
130571     6.0
137190     7.0
147451     6.0
149908    10.0
156150     7.0
164323     8.0
180651     8.0
184299     8.0
197364     8.0
209516     7.0
229313    10.0
229741     9.0
242083     7.0
246311     7.0
252820     8.0
260183     8.0
270713     5.0
Name: High Fidelity, dtype: float64

### Now let us carefully treate the `suggestions`

In [48]:
book_pivot.index


Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       '84 Charing Cross Road', 'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Cry In The Night',
       ...
       'Winter Solstice', 'Wish You Well', 'Without Remorse',
       'Wizard and Glass (The Dark Tower, Book 4)', 'Wuthering Heights',
       'Year of Wonders', 'You Belong To Me',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"'],
      dtype='object', name='title', length=742)

In [49]:
book_pivot.index[0]

'1984'

### Let's now store the book names from the `book_pivot` into  variable called `book_names`

In [51]:
#keeping books name
book_names = book_pivot.index
book_names

Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       '84 Charing Cross Road', 'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Cry In The Night',
       ...
       'Winter Solstice', 'Wish You Well', 'Without Remorse',
       'Wizard and Glass (The Dark Tower, Book 4)', 'Wuthering Heights',
       'Year of Wonders', 'You Belong To Me',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"'],
      dtype='object', name='title', length=742)

In [None]:
#np.where(book_pivot.index == '4 Blondes')[0][0]

3

# find url

In [52]:
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82


In [53]:
final_rating['title']

0         Politically Correct Bedtime Stories: Modern Ta...
1         Politically Correct Bedtime Stories: Modern Ta...
2         Politically Correct Bedtime Stories: Modern Ta...
3         Politically Correct Bedtime Stories: Modern Ta...
4         Politically Correct Bedtime Stories: Modern Ta...
                                ...                        
236701                                     And Then You Die
236702                                     And Then You Die
236703                                     And Then You Die
236704                                     And Then You Die
236705                                     And Then You Die
Name: title, Length: 61853, dtype: object

In [54]:
final_rating['title'].iloc[600]

'Girl in Hyacinth Blue'

### Let's try randomly to see

In [55]:
final_rating['title'].iloc[50]

'Politically Correct Bedtime Stories: Modern Tales for Our Life and Times'

In [56]:
# Find all indices where the title matches "One for the Money (Stephanie Plum Novels (Paperback))" in the 'final_rating' dataset

ids = np.where(final_rating['title'] == "One for the Money (Stephanie Plum Novels (Paperback))")
ids

(array([323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335,
        336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348,
        349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361,
        362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,
        375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387,
        388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400,
        401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413,
        414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426,
        427, 428, 429, 430]),)

### Let's extract the first index where the title `"One for the Money (Stephanie Plum Novels (Paperback))"` appears in the final_rating['title'] column of the DataFrame.



In [57]:
# Find the first index where the title matches the specified book, "One for the Money (Stephanie Plum Novels (Paperback))"

ids = np.where(final_rating['title'] == "One for the Money (Stephanie Plum Novels (Paperback))")[0][0]
ids

323

In [58]:
# Let's find the image url of the above ID

final_rating.iloc[ids]['image_url']

'http://images.amazon.com/images/P/0061009059.01.LZZZZZZZ.jpg'

### Let's retrieve book names based on book ids suggested by the model


In [59]:
book_name = []

# Loop over the suggested book indices (IDs) from the model
for book_id in suggestion:
    # Append the corresponding book name (index from book_pivot) to the book_name list
    book_name.append(book_pivot.index[book_id])

In [60]:
book_name[0]

Index(['High Fidelity', 'About a Boy', 'Pleading Guilty', 'No Safe Place',
       'The Cradle Will Fall', 'Exclusive'],
      dtype='object', name='title')

### Let's get the index of each book name in the final_rating DataFrame


In [61]:
ids_index = []

# Loop through the first item of the book_name list (i.e., the first book name suggestion)
for name in book_name[0]: 
    # Find the index of the book in the final_rating DataFrame where the title matches the name
    ids = np.where(final_rating['title'] == name)[0][0]
    # Append the index to the ids_index list
    ids_index.append(ids)


### Let's print the image URL for each book from the final_rating DataFrame


In [62]:
for idx in ids_index:
    # Retrieve the image URL of the book using the index from final_rating
    url = final_rating.iloc[idx]['image_url']
    # Print the URL
    print(url)


http://images.amazon.com/images/P/1573228214.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/1573227331.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0446365505.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0345404777.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0440115450.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0446604232.01.LZZZZZZZ.jpg


### Let's save the model

In [63]:
import pickle
pickle.dump(model,open('artifacts/model.pkl','wb'))
pickle.dump(book_names,open('artifacts/book_names.pkl','wb'))
pickle.dump(final_rating,open('artifacts/final_rating.pkl','wb'))
pickle.dump(book_pivot,open('artifacts/book_pivot.pkl','wb'))

# Testing model

In [64]:
import joblib

# Load the saved model
model = joblib.load('artifacts/model.pkl')

In [65]:
def recommend_book(book_name):
    # Find the index of the book_name in the book_pivot DataFrame
    book_id = np.where(book_pivot.index == book_name)[0][0]  

    # Get the 6 nearest neighbors for the book based on the model
    distance, suggestion = model.kneighbors(book_pivot.iloc[book_id,:].values.reshape(1,-1), n_neighbors=6)

    # Loop through the suggested books and print the recommendations
    for i in range(len(suggestion)):

        # Loop through the suggestion array to retrieve the suggested book indices
        books = book_pivot.index[suggestion[i]]
        for j in books:
            if j == book_name:
                # If the suggested book is the same as the searched book, print a special message
                print(f"You searched '{book_name}'\n")
                print("The suggestion books are: \n")
            else:
                # Print the name of each suggested book that isn't the searched one
                print(j)

In [66]:
book_name = "One for the Money (Stephanie Plum Novels (Paperback))"

recommend_book(book_name)

You searched 'One for the Money (Stephanie Plum Novels (Paperback))'

The suggestion books are: 

Exclusive
Fine Things
No Safe Place
Night Whispers
The Cradle Will Fall
