# Book Recommeded System

## Technique used:
- Item-based Collaborative Filtering
- Nearest Neighbors (Cosine Similarity)
- Collaborative filtering depends on user behaviour, not book description.

## Main idea:
- Books rated similarly by users are considered similar.

## Input:
- Book title

## Output:
-  List of similar books.

# Data set

- https://www.kaggle.com/datasets/ra4u12/bookrecommendation

## BX-Books.csv : contains book metadata
- ISBN
- Book-Title
- Book-Author
- Year-Of-Publication
- Publisher
- Image-URL-S 
- Image-URL-M 
- Image-URL-L

## BX-Book-Ratings.csv
- User-ID	
- ISBN	
- Book-Rating

##  BX-Users.csv
- User-ID	
- Location	
- Age




# UML Flow

- BX-Books.csv
- BX-Users.csv
- BX-Book-Ratings.csv
- ↓ Merge
- Cleaned Dataset
- ↓ Filter active users
- Pivot Table (Books × Users Matrix)
- ↓ NearestNeighbors Model
- Recommendations


# Import Libraries
- pandas → data handling
- numpy → numerical operations
- sklearn → machine learning

In [1]:
import pandas as pd
import numpy as np

## Load Books Dataset and create books DataFramne

- Read CSV using:
    - separator `;`
    -  encoding `latin-1`
- Why encoding?
    - Dataset contains special characters
- A bad line is a row in the CSV that:
    - Has more or fewer columns than expected
    - Contains unescaped delimiters (; in your case)
    - Is corrupted or improperly formatted
    - Breaks the CSV structure
- Many older datasets (especially from the 1990s–2000s), like BX-Books, were created using `Latin-1` instead of `UTF-8`

In [2]:
books = pd.read_csv('BX-Books.csv', sep=";", on_bad_lines='skip', encoding='latin-1',low_memory=False)

In [3]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [5]:
books.iloc[0]['Image-URL-L']

'http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg'

In [6]:
books.shape

(271360, 8)

In [7]:
# remove the columns which are not required for our analysis
books.drop(['Image-URL-S', 'Image-URL-M'], axis=1, inplace=True)

In [8]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...


In [9]:
# Lets rename some wierd columns name
books.rename(columns={"Book-Title":'title',
                      'Book-Author':'author',
                     "Year-Of-Publication":'year',
                     "Publisher":"publisher",
                     "Image-URL-L":"image_url"},inplace=True)

In [10]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher,image_url
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...


## Load Users Dataset and create users DataFramne

In [11]:
# Now load the second dataframe

users = pd.read_csv('BX-Users.csv', sep=";", on_bad_lines='skip', encoding='latin-1',low_memory=False)

In [12]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [13]:
users.shape

(278858, 3)

In [14]:
# Lets remane some wierd columns name
users.rename(columns={"User-ID":'user_id',
                      'Location':'location',
                     "Age":'age'},inplace=True)

In [15]:
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


**Note**
- Age is having some missing values

## Load Ratings Dataset and create ratings DataFramne

In [16]:
# Now load the third dataframe

ratings = pd.read_csv('BX-Book-Ratings.csv', sep=";", on_bad_lines='skip', encoding='latin-1')

In [17]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [18]:
ratings.shape

(1149780, 3)

In [19]:
# Lets remane some wierd columns name
ratings.rename(columns={"User-ID":'user_id',
                      'Book-Rating':'rating'},inplace=True)

In [20]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


## Shapes of all DataFrames

In [21]:
print(f"Books : {books.shape}, Users:{users.shape}, Ratings {ratings.shape}", sep='\n')

Books : (271360, 6), Users:(278858, 3), Ratings (1149780, 3)


# Users who has given the more number of times ratings

## Example

In [22]:
import pandas as pd

ratings1 = pd.DataFrame({
    'user_id': [1,1,1,2,2,3,3,3,3,3],
    'book_id': [10,11,12,10,13,11,12,14,15,16],
    'rating':  [5,4,3,5,4,2,3,4,5,3]
})

print(ratings1)


   user_id  book_id  rating
0        1       10       5
1        1       11       4
2        1       12       3
3        2       10       5
4        2       13       4
5        3       11       2
6        3       12       3
7        3       14       4
8        3       15       5
9        3       16       3


In [26]:
# Number of times user given ratings
ratings1['user_id'].value_counts()

user_id
3    5
1    3
2    2
Name: count, dtype: int64

In [28]:
# get the users who have given more than 3 times ratings
x1 = ratings1['user_id'].value_counts() > 3
x1

user_id
3     True
1    False
2    False
Name: count, dtype: bool

**Explanation:**
- User 3 → 5 ratings → True
- User 1 → 3 ratings → False
- User 2 → 2 ratings → False

In [30]:
x1[x1]

user_id
3    True
Name: count, dtype: bool

In [31]:
# to get the index of the users who have given more than 3 times ratings
y1 = x1[x1].index
y1

Index([3], dtype='int64', name='user_id')

In [33]:
# filter the ratings dataframe to get the ratings given by the users who have given more than 3 times ratings
filtered1 = ratings1[ratings1['user_id'].isin(y1)]
filtered1

Unnamed: 0,user_id,book_id,rating
5,3,11,2
6,3,12,3
7,3,14,4
8,3,15,5
9,3,16,3


In [34]:
len(filtered1)

5

## Actual Program

In [35]:
# Number of times user given ratings
ratings['user_id'].value_counts()

user_id
11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
119573        1
276706        1
276697        1
276679        1
276676        1
Name: count, Length: 105283, dtype: int64

In [36]:
# Lets store users who had at least rated more than 200 books
x = ratings['user_id'].value_counts() > 200

In [37]:
x

user_id
11676      True
198711     True
153662     True
98391      True
35859      True
          ...  
119573    False
276706    False
276697    False
276679    False
276676    False
Name: count, Length: 105283, dtype: bool

In [38]:
x[x].shape

(899,)

**Note**
- 899 people are rated more than 200 books

In [39]:
# to get the index number of users who has rated more than 200 books
y= x[x].index

In [40]:
y

Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352, 110973,
       235105,
       ...
       116122,  44296,  28634,  59727,  73681, 274808, 188951,   9856, 155916,
       268622],
      dtype='int64', name='user_id', length=899)

In [41]:
# Now lets filter the ratings dataframe based on these users only
ratings = ratings[ratings['user_id'].isin(y)]

In [42]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


# Merge Books + Ratings
-  Replace ISBN with readable Book-Title
- books ==> `ISBN	title	author	year	publisher	image_url`
- ratings ==> `user_id	ISBN	rating`

In [43]:
books.shape, ratings.shape

((271360, 6), (526356, 3))

In [None]:
# Now join ratings with books

ratings_with_books = ratings.merge(books, on='ISBN')

In [45]:
ratings_with_books.shape

(487671, 8)

In [47]:
ratings_with_books.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
1,277427,0026217457,0,Vegetarian Times Complete Cookbook,Lucy Moll,1995,John Wiley &amp; Sons,http://images.amazon.com/images/P/0026217457.0...
2,277427,003008685X,8,Pioneers,James Fenimore Cooper,1974,Thomson Learning,http://images.amazon.com/images/P/003008685X.0...
3,277427,0030615321,0,"Ask for May, Settle for June (A Doonesbury book)",G. B. Trudeau,1982,Henry Holt &amp; Co,http://images.amazon.com/images/P/0030615321.0...
4,277427,0060002050,0,On a Wicked Dawn (Cynster Novels),Stephanie Laurens,2002,Avon Books,http://images.amazon.com/images/P/0060002050.0...


# Data Cleaning

**Filter Active Users**
- Count ratings per user
- Keep users with >200 ratings

**Filter Popular Books**
- Keep books that have many ratings.

In [48]:
#  Now lets count number of ratings for each book
number_rating = ratings_with_books.groupby('title')['rating'].count().reset_index()

In [49]:
number_rating

Unnamed: 0,title,rating
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1
...,...,...
160264,Ã?Â?ber die Pflicht zum Ungehorsam gegen den S...,3
160265,Ã?Â?lpiraten.,1
160266,Ã?Â?rger mit Produkt X. Roman.,1
160267,Ã?Â?stlich der Berge.,1


In [50]:
# Rename the column name
number_rating.rename(columns={'rating':'num_of_rating'},inplace=True)

In [51]:
number_rating.head()

Unnamed: 0,title,num_of_rating
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1


In [52]:
# Now lets merge number of ratings with ratings_with_books dataframe
final_rating = ratings_with_books.merge(number_rating, on='title')

In [53]:
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,277427,0026217457,0,Vegetarian Times Complete Cookbook,Lucy Moll,1995,John Wiley &amp; Sons,http://images.amazon.com/images/P/0026217457.0...,7
2,277427,003008685X,8,Pioneers,James Fenimore Cooper,1974,Thomson Learning,http://images.amazon.com/images/P/003008685X.0...,1
3,277427,0030615321,0,"Ask for May, Settle for June (A Doonesbury book)",G. B. Trudeau,1982,Henry Holt &amp; Co,http://images.amazon.com/images/P/0030615321.0...,1
4,277427,0060002050,0,On a Wicked Dawn (Cynster Novels),Stephanie Laurens,2002,Avon Books,http://images.amazon.com/images/P/0060002050.0...,13


In [54]:
# Lets take those books which got at least 50 rating of user

final_rating = final_rating[final_rating['num_of_rating'] >= 50]

In [55]:
final_rating

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
13,277427,0060930535,0,The Poisonwood Bible: A Novel,Barbara Kingsolver,1999,Perennial,http://images.amazon.com/images/P/0060930535.0...,133
15,277427,0060934417,0,Bel Canto: A Novel,Ann Patchett,2002,Perennial,http://images.amazon.com/images/P/0060934417.0...,108
18,277427,0061009059,9,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch,http://images.amazon.com/images/P/0061009059.0...,108
24,277427,006440188X,0,The Secret Garden,Frances Hodgson Burnett,1998,HarperTrophy,http://images.amazon.com/images/P/006440188X.0...,79
...,...,...,...,...,...,...,...,...,...
487505,275970,1400031354,0,Tears of the Giraffe (No.1 Ladies Detective Ag...,Alexander McCall Smith,2002,Anchor,http://images.amazon.com/images/P/1400031354.0...,84
487506,275970,1400031362,0,Morality for Beautiful Girls (No.1 Ladies Dete...,Alexander McCall Smith,2002,Anchor,http://images.amazon.com/images/P/1400031362.0...,60
487579,275970,1573229725,0,Fingersmith,Sarah Waters,2002,Riverhead Books,http://images.amazon.com/images/P/1573229725.0...,59
487618,275970,1586210661,9,Me Talk Pretty One Day,David Sedaris,2001,Time Warner Audio Major,http://images.amazon.com/images/P/1586210661.0...,146


In [56]:
final_rating.shape

(61853, 9)

In [57]:
final_rating.columns

Index(['user_id', 'ISBN', 'rating', 'title', 'author', 'year', 'publisher',
       'image_url', 'num_of_rating'],
      dtype='object')

In [58]:
# lets drop the duplicates
final_rating.drop_duplicates(['user_id','title'],inplace=True)

In [59]:
# check the missing values
final_rating.isnull().sum()

user_id          0
ISBN             0
rating           0
title            0
author           0
year             0
publisher        0
image_url        0
num_of_rating    0
dtype: int64

# Create Pivot table
- Convert long format → wide format.
- Structure:
    -  Rows → Book-Title
    -  Columns → User-ID
    -  Values → Book-Rating
- Each book becomes a vector in high-dimensional space

In [60]:
# Lets create a pivot table for user and books ==>collaborative filtering
book_pivot = final_rating.pivot_table(columns='user_id', index='title', values= 'rating')

In [61]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,,...,,,,0.0,,,,,0.0,
4 Blondes,,,,,,,,,,0.0,...,,,,,,,,,,
84 Charing Cross Road,,,,,,,,,,,...,,,,,,10.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,,,,7.0,,,,,7.0,,...,,,,,,0.0,,,,
You Belong To Me,,,,,,,,,,,...,,,,,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,0.0,,,,,0.0,...,,,,,,0.0,,,,
Zoya,,,,,,,,,,,...,,,,,,,,,,


In [62]:
book_pivot.shape

(742, 888)

**Note**
- Replace `NaN` values because we cannot pass `NaN` to the model

In [63]:
book_pivot.fillna(0, inplace=True)

In [64]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Note**
- our data contains most of `0`s ==> sparse data
- it will consumes lot of memory
- A CSR matrix (Compressed Sparse Row matrix) is a memory-efficient way to store sparse data
- It is heavily used in machine learning, recommender systems, NLP, and GenAI because real datasets (ratings, tokens, embeddings) are often sparse
- in scipy `cst_matrix` is present

# CSR  Matrix using scipy

In [65]:
from scipy.sparse import csr_matrix
import numpy as np


In [66]:
book_sparse = csr_matrix(book_pivot)

In [67]:
book_sparse

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 14961 stored elements and shape (742, 888)>

In [68]:
type(book_sparse)

scipy.sparse._csr.csr_matrix

In [69]:
book_sparse.data

array([ 9.,  8.,  9., ..., 10.,  9.,  8.], shape=(14961,))

In [70]:
book_sparse.indices

array([  0,  16,  37, ..., 745, 830, 880], shape=(14961,), dtype=int32)

In [71]:
book_sparse.indptr

array([    0,    27,    70,   104,   117,   142,   168,   183,   208,
         214,   225,   237,   243,   253,   264,   280,   314,   328,
         341,   351,   374,   428,   439,   470,   479,   491,   545,
         556,   585,   597,   620,   653,   668,   696,   702,   725,
         733,   745,   762,   781,   789,   813,   825,   853,   875,
         883,   907,   929,   937,   957,   987,   998,  1052,  1071,
        1083,  1109,  1135,  1158,  1188,  1202,  1229,  1248,  1269,
        1282,  1318,  1341,  1357,  1393,  1414,  1419,  1433,  1445,
        1481,  1502,  1519,  1534,  1557,  1586,  1601,  1610,  1626,
        1637,  1664,  1684,  1703,  1724,  1737,  1749,  1770,  1789,
        1803,  1823,  1837,  1855,  1868,  1882,  1966,  1988,  2003,
        2028,  2032,  2046,  2058,  2081,  2107,  2121,  2132,  2155,
        2167,  2189,  2206,  2229,  2252,  2278,  2287,  2306,  2325,
        2334,  2353,  2363,  2376,  2400,  2419,  2425,  2433,  2475,
        2499,  2524,

# Train NearestNeighbors Model

## import NearestNeighbors 

- The purpose of NearestNeighbors is to find the closest data points to a given point based on a distance metric.
- It is a core building block for similarity search, not a predictive model.
- NearestNeighbors finds the k most similar points to a query point using distance calculations.
- `algorithm='brute'` tells scikit-learn to compute distances between the query and every data point directly.
- How brute Works Internally
    - For each query point:
    - Take the query vector
    - Compute distance to all points in the dataset
    - Sort distances
    - Return top k nearest neighbors
- This is O(n × d) per query:
    - n = number of samples
    - d = number of features
- Why Use brute? (Very Important)
    - Works with ANY distance metric
    - Tree-based algorithms do not support all metrics.
    
| Metric         | Supported by Brute |
| -------------- | ------------------ |
| Euclidean      | ✅                  |
| Manhattan      | ✅                  |
| Cosine         | ✅                  |
| Minkowski      | ✅                  |
| Custom metrics | ✅                  |

- https://scikit-learn.org/stable/modules/neighbors.html

In [72]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm= 'brute')

**Note**
- This imports a similarity search tool, not a predictive model
- Purpose:
    - Find similar items (books)
    - Based on distance between vectors
- Creating the Model 
    - `model = NearestNeighbors(algorithm='brute')`
- What this means
    - You are creating a nearest-neighbor search engine
    - algorithm='brute' means:
        - Compare the query book with every other book
        - No tree, no index
        - Required for high-dimensional sparse data
- Why brute is correct here
    - Your data (book_sparse) is almost certainly:
        - TF-IDF
        - User-rating matrix
        - High-dimensional
        - Sparse (CSR matrix)
- Tree-based methods do not work well in this scenario.


In [73]:
model.fit(book_sparse)

0,1,2
,n_neighbors,5
,radius,1.0
,algorithm,'brute'
,leaf_size,30
,metric,'minkowski'
,p,2
,metric_params,
,n_jobs,


**Note**
- model.fit(book_sparse)
    - This is not training in the ML sense.
- What fit() does here:
    - Stores the matrix (book_sparse) internally
    - Prepares it for fast distance computation
- No learning.
- No weights.
- No labels.

# Test the model

In [74]:
book_pivot.iloc[237,:]

user_id
254       9.0
2276      0.0
2766      0.0
2977      0.0
3363      0.0
         ... 
275970    9.0
277427    0.0
277478    0.0
277639    0.0
278418    0.0
Name: Harry Potter and the Chamber of Secrets (Book 2), Length: 888, dtype: float64

In [75]:
# pass the book title for which we want to find the similar books
distance, suggestion = model.kneighbors(book_pivot.iloc[237,:].values.reshape(1,-1), n_neighbors=6 )

**Note**
- book_pivot.iloc[237, :]
    - book_pivot → pivot table (books × users or features)
    - Row 237 → one specific book
    - Columns → users or features
- So this row is:
    - `How this book is represented numerically`
- `.values`
    - Converts Pandas Series → NumPy array
    - Required because kneighbors() expects NumPy input
- Reshaping the Input ==> `.reshape(1, -1)`
    - Scikit-learn expects input as `(n_samples, n_features)`
    - But you currently have: `(n_features,)`
    - So you reshape it to: ==> `(1, n_features)`
- I am querying with one book ==> Without this, the code would crash.
- For the query book:
    - Compute distance to every book in book_sparse
    - Sort distances
    - Pick the 6 closest books
- understand the `distance`
    - Distance between query book and each neighbor
    - Shape: (1, 6)
    - Lower distance = more similar
- understand the suggestion
    - Indices of similar books
    - Shape: (1, 6)
    -These are row indices in book_pivot.
- `Why n_neighbors=6`
    - 1st neighbor → the same book
    - Remaining 5 → recommendations
    - This is a standard recommender pattern.
- How This Becomes a Book Recommendation System
    - Conceptually:
        - `Users who interacted with book #237 also interacted with these books`
    - You then:
        - Convert indices → book titles
        - Exclude the first one (same book)
        - Display the rest as recommendations
- Mental Model (Very Important)
    - Think of this as:
        - `Take one book → compare it with all books → return the most similar ones.`
    - No learning.
    - No prediction.
    - Pure similarity.

- Common Follow-Up Interview Questions
```text
Q: Is this supervised or unsupervised?
👉 Unsupervised

Q: Why reshape(1, -1)?
👉 Scikit-learn expects a 2D array for queries

Q: Why is brute used?
👉 High-dimensional sparse data + cosine similarity compatibility

Q: Is fit() actually training?
👉 No, it only stores the data
```

- Final Takeaway
    - NearestNeighbors = similarity engine
    - fit() = data storage, not learning
    - kneighbors() = similarity search
    - Output = distances + indices
    - Core of recommender systems and RAG pipelines


In [76]:
distance

array([[ 0.        , 67.75691847, 68.05145112, 72.277244  , 75.81556568,
        76.30203143]])

In [77]:
suggestion

array([[237, 238, 240, 241, 184, 536]])

In [78]:
# to get the names of the books
for i in range(len(suggestion)):
    print(book_pivot.index[suggestion[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
       'The Cradle Will Fall'],
      dtype='object', name='title')


In [79]:
#keeping books name
book_names = book_pivot.index

In [80]:
book_names[4]

'84 Charing Cross Road'

In [81]:
book_names[238]

'Harry Potter and the Goblet of Fire (Book 4)'

In [82]:
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,image_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
13,277427,0060930535,0,The Poisonwood Bible: A Novel,Barbara Kingsolver,1999,Perennial,http://images.amazon.com/images/P/0060930535.0...,133
15,277427,0060934417,0,Bel Canto: A Novel,Ann Patchett,2002,Perennial,http://images.amazon.com/images/P/0060934417.0...,108
18,277427,0061009059,9,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch,http://images.amazon.com/images/P/0061009059.0...,108
24,277427,006440188X,0,The Secret Garden,Frances Hodgson Burnett,1998,HarperTrophy,http://images.amazon.com/images/P/006440188X.0...,79


In [83]:
# gives the index value
np.where(book_pivot.index == 'Harry Potter and the Goblet of Fire (Book 4)')

(array([238]),)

In [84]:
np.where(book_pivot.index == 'Harry Potter and the Goblet of Fire (Book 4)')[0]

array([238])

In [85]:
np.where(book_pivot.index == 'Harry Potter and the Goblet of Fire (Book 4)')[0][0]

np.int64(238)

In [87]:
# final_rating['title'].value_counts()
ids = np.where(final_rating['title'] == "Harry Potter and the Goblet of Fire (Book 4)")[0][0]

In [88]:
ids

np.int64(321)

In [89]:
# extract image url
final_rating.iloc[ids]['image_url']

'http://images.amazon.com/images/P/0439139597.01.LZZZZZZZ.jpg'

In [90]:
# to get the names of the books in a list
book_name = []
for book_id in suggestion:
    book_name.append(book_pivot.index[book_id])

In [91]:
book_name

[Index(['Harry Potter and the Chamber of Secrets (Book 2)',
        'Harry Potter and the Goblet of Fire (Book 4)',
        'Harry Potter and the Prisoner of Azkaban (Book 3)',
        'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
        'The Cradle Will Fall'],
       dtype='object', name='title')]

In [92]:
# extract the ids of the books from final_rating dataframe
ids_index = []
for name in book_name[0]: 
    ids = np.where(final_rating['title'] == name)[0][0]
    ids_index.append(ids)

In [93]:
ids_index

[np.int64(44),
 np.int64(321),
 np.int64(45),
 np.int64(46),
 np.int64(786),
 np.int64(2297)]

In [94]:
# extract image urls
for idx in ids_index:
    url = final_rating.iloc[idx]['image_url']
    print(url)

http://images.amazon.com/images/P/0439064872.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0439139597.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0439136369.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/043936213X.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0446604232.01.LZZZZZZZ.jpg
http://images.amazon.com/images/P/0440115450.01.LZZZZZZZ.jpg


# Save into binary object

In [95]:
book_names

Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       '84 Charing Cross Road', 'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Cry In The Night',
       ...
       'Winter Solstice', 'Wish You Well', 'Without Remorse',
       'Wizard and Glass (The Dark Tower, Book 4)', 'Wuthering Heights',
       'Year of Wonders', 'You Belong To Me',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"'],
      dtype='object', name='title', length=742)

In [97]:
import pickle
pickle.dump(model,open('artifacts/model.pkl','wb'))
pickle.dump(book_names,open('artifacts/book_names.pkl','wb'))
pickle.dump(final_rating,open('artifacts/final_rating.pkl','wb'))
pickle.dump(book_pivot,open('artifacts/book_pivot.pkl','wb'))

In [98]:
book_names


Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       '84 Charing Cross Road', 'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Cry In The Night',
       ...
       'Winter Solstice', 'Wish You Well', 'Without Remorse',
       'Wizard and Glass (The Dark Tower, Book 4)', 'Wuthering Heights',
       'Year of Wonders', 'You Belong To Me',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"'],
      dtype='object', name='title', length=742)

# Testing model

In [99]:
def recommend_book(book_name):
    book_id = np.where(book_pivot.index == book_name)[0][0]
    distance, suggestion = model.kneighbors(book_pivot.iloc[book_id,:].values.reshape(1,-1), n_neighbors=6 )
    
    for i in range(len(suggestion)):
            books = book_pivot.index[suggestion[i]]
            for j in books:
                if j == book_name:
                    print(f"You searched '{book_name}'\n")
                    print("The suggestion books are: \n")
                else:
                    print(j)

In [100]:
book_name = "2nd Chance"
recommend_book(book_name)

You searched '2nd Chance'

The suggestion books are: 

The Next Accident
The Ghost
Exclusive
Last Man Standing
Unspeakable


In [101]:
book_name1 = "Harry Potter and the Chamber of Secrets (Book 2)"
recommend_book(book_name1)

You searched 'Harry Potter and the Chamber of Secrets (Book 2)'

The suggestion books are: 

Harry Potter and the Goblet of Fire (Book 4)
Harry Potter and the Prisoner of Azkaban (Book 3)
Harry Potter and the Sorcerer's Stone (Book 1)
Exclusive
The Cradle Will Fall


```python
books ==> books dataframe
users ==> users dataframe
ratings ==> ratings dataframe(active users who has read more than 200 books)

merge books and ratings based on `ISBN`
ratings_with_books

count the no of ratings of each book
number_rating

merge number_rating with ratings_with_books based on `title`
final_rating

final_rating ==> filter only the books is having above 50 ratings

from final_rating drop the duplicate user_id and title


create Pivot table from final_rating ==> collaborative filtering
book_pivot ==> sparse data ==>more zeros

handle with csr_matrix
book_sparse = csr_matrix(book_pivot)

```
