# **Book Recommendation: k-Nearest Neighbors (Item Based)**

# KNN Algorithm Overview



1.   Basic Concept:

*   KNN classifies a data point based on the majority label of its k nearest neighbors in the feature space.
*   In regression, the output is the average (or weighted average) of the labels of k nearest neighbors.


2.   Key Features:

*   Instance-based: It does not build an explicit model but uses the dataset as its model.
*   Distance Metric: Determines the similarity between data points using metrics like Euclidean, Manhattan, or Minkowski distance.


3.   Parameters:

*   k: Number of neighbors to consider.
*   Distance metric: Defines how "closeness" is calculated


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
!pip install pandas requests onedrivedownloader

Collecting onedrivedownloader
  Downloading onedrivedownloader-1.1.3-py3-none-any.whl.metadata (2.0 kB)
Downloading onedrivedownloader-1.1.3-py3-none-any.whl (5.1 kB)
Installing collected packages: onedrivedownloader
Successfully installed onedrivedownloader-1.1.3


# Import and Load Data

In [None]:
import requests
# Import the download function instead of OneDriveDownloader class
from onedrivedownloader import download

# Replace with your direct OneDrive link for the TSV file
onedrive_link = "https://indianinstituteofscience-my.sharepoint.com/:x:/g/personal/rishavg_iisc_ac_in/ET-n21kcA3tIh-n2BjHvLjMBWI-sTFpE0O6zdUDLokuajQ?e=JZ4NjZ"

# Extract the file ID from the OneDrive link - not needed for current download method
#file_id = onedrive_link.split('/')[-1]  # Assuming the file ID is the last part of the URL

# Download the file using the download function
# filename should be provided and the file will be automatically saved, you can directly input your local path
download(onedrive_link, filename="filtered_user_rating.csv")

100%|██████████| 25.2M/25.2M [00:00<00:00, 28.2MiB/s]


'filtered_user_rating.csv'

In [None]:
import pandas as pd
all_users_rating_df = pd.read_csv("filtered_user_rating.csv", sep=',', on_bad_lines='skip')

In [None]:
all_users_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203540 entries, 0 to 203539
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   customer_id      203540 non-null  int64  
 1   review_id        203540 non-null  object 
 2   product_id       203540 non-null  object 
 3   product_title    203540 non-null  object 
 4   star_rating      203540 non-null  float64
 5   helpful_votes    203540 non-null  float64
 6   total_votes      203540 non-null  float64
 7   review_headline  203540 non-null  object 
 8   review_date      203540 non-null  object 
dtypes: float64(3), int64(1), object(5)
memory usage: 14.0+ MB


In [None]:
all_users_rating_df.shape

(203540, 9)

## Data Cleaning and Preprocessing


In [None]:
# Drop Rows with Any Missing Values
all_users_rating_df.dropna(inplace=True)

all_users_rating_df.shape

(203540, 9)

In [None]:
#Number of distinct customers, and distinct products.
distinct_customer_ids = all_users_rating_df['customer_id'].unique()
distinct_product_ids = all_users_rating_df['product_id'].unique()

print(f"Number of distinct customers: {len(distinct_customer_ids)}")
print(f"Number of distinct products: {len(distinct_product_ids)}")


Number of distinct customers: 81797
Number of distinct products: 89507


In [None]:
# Filtering out the data with star_rating >= 3 to focus on positive interactions
filter_user_rating_df = all_users_rating_df[all_users_rating_df['star_rating'] >= 3]
filter_user_rating_df.shape

(189417, 9)

## Prepare dataset for KNN

In [None]:
# Map categorical IDs to numerical indices (sparse matrix requires numerical indices)
customer_mapping = {id_: idx for idx, id_ in enumerate(filter_user_rating_df['customer_id'].unique())}
product_mapping = {id_: idx for idx, id_ in enumerate(filter_user_rating_df['product_id'].unique())}

filter_user_rating_df['customer_idx'] = filter_user_rating_df['customer_id'].map(customer_mapping)
filter_user_rating_df['product_idx'] = filter_user_rating_df['product_id'].map(product_mapping)

filter_user_rating_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filter_user_rating_df['customer_idx'] = filter_user_rating_df['customer_id'].map(customer_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filter_user_rating_df['product_idx'] = filter_user_rating_df['product_id'].map(product_mapping)


Unnamed: 0,customer_id,review_id,product_id,product_title,star_rating,helpful_votes,total_votes,review_headline,review_date,customer_idx,product_idx
0,51964897,R1TNWRKIVHVYOV,262181533,The Psychology of Proof: Deductive Reasoning i...,4.0,0.0,2.0,Execellent cursor examination,2005-10-14,0,0
1,24853483,RCYSGJQVQLD3R,373513194,Kiss of the Blue Dragon (Silhouette Bombshell),4.0,0.0,0.0,A different sort of futuristic & very interest...,2005-10-14,1,1
2,50122160,R36ACJURUNHD38,1410202984,Dahcotah: Life and Legends of the Sioux,5.0,0.0,0.0,A groundbreaking look into Sioux (Dakota) cust...,2005-10-14,2,2
3,50122160,R3QP8VTFWA343T,816524718,Navajo Nation Peacemaking: Living Traditional ...,5.0,0.0,1.0,An anthology of essays offering insights from ...,2005-10-14,2,3
4,47412112,R229JMAAVX4SMK,1591160529,"Inuyasha, Volume 5",5.0,0.0,0.0,TONIGHT I'M A BOY,2005-10-14,3,4


In [None]:
from scipy.sparse import csr_matrix
#Sparse matrix: rows = products, columns = customers, values = ratings
sparse_matrix = csr_matrix((filter_user_rating_df['star_rating'], (filter_user_rating_df['product_idx'], filter_user_rating_df['customer_idx'])))
sparse_matrix

<83384x77372 sparse matrix of type '<class 'numpy.float64'>'
	with 180171 stored elements in Compressed Sparse Row format>

In [None]:
#Split into training and testing datasets
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(filter_user_rating_df, test_size=0.3, random_state=42)
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")

Train data shape: (132591, 11)
Test data shape: (56826, 11)


In [None]:
# Build a sparse matrix for the train dataset
train_sparse_matrix = csr_matrix((train_data['star_rating'],
                                  (train_data['product_idx'], train_data['customer_idx'])))
train_sparse_matrix

<83382x77371 sparse matrix of type '<class 'numpy.float64'>'
	with 127514 stored elements in Compressed Sparse Row format>

In [None]:
# Fit KNN Model
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5, n_jobs=-1)
knn.fit(train_sparse_matrix)

In [None]:
# Reverse mapping for product indices
reverse_product_mapping = {idx: id_ for id_, idx in product_mapping.items()}

In [None]:
# Recommend books similar to the given product_id using KNN.
def recommend_books(product_id, n_recommendations=5):
    """
    Recommends books similar to the given product_id using KNN.

    Args:
        product_id: The ID of the product to find recommendations for.
        n_recommendations: The number of recommendations to generate.

    Returns:
        A tuple containing two lists:
            - recommended_products: A list of recommended product IDs.
            - distances: A list of distances corresponding to the recommended products.
    """

    # Get the index for the given product_id
    product_idx = product_mapping.get(product_id)
    if product_idx is None or product_idx >= train_sparse_matrix.shape[0]:
        return f"Product ID {product_id} not found in the training data."

    # Find K nearest neighbors
    distances, indices = knn.kneighbors(train_sparse_matrix[product_idx], n_neighbors=n_recommendations + 1)

    # Get recommended product IDs and distances
    recommended_indices = indices.flatten()[1:]  # Exclude the input product itself
    recommended_products = [reverse_product_mapping[idx] for idx in recommended_indices]
    distances = distances.flatten()[1:]  # Exclude the distance to the input product itself

    return recommended_products, distances  # Return both recommendations and distances

## Create get_recommends()

In [None]:
import random

# Get a random index within the range of the DataFrame's length
random_index = random.randint(0, len(product_mapping) - 1)

# Access the product_id at the random index
test_product_id = list(product_mapping.keys())[random_index]

# Get recommendations and distances
recommendations, distances = recommend_books(test_product_id, n_recommendations=5)  # Get both values

# Display the recommendations and distances
print(f"Recommendations for Product ID {test_product_id}:")
for product_id, distance in zip(recommendations, distances):  # Iterate through both lists
    print(f"Product ID: {product_id}, Distance: {distance:.4f}")

Recommendations for Product ID 0816046042:
Product ID: 0071402802, Distance: 0.3434
Product ID: 1560253762, Distance: 0.5000
Product ID: 1560253371, Distance: 0.6464
Product ID: 0553801821, Distance: 0.7379
Product ID: 0806917598, Distance: 1.0000
