<a href="https://colab.research.google.com/github/Tiwari666/Recommender_System/blob/main/KNN_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project is an example of collaborative filtering recommendation system based on the behavior of similar users (using the KNN algorithm).

# **Collaborative filtering:**


Collaborative filtering is a technique commonly used in recommender systems where the system recommends items to users based on the preferences or behavior of other users. In this approach, the system does not require any knowledge about the items themselves but rather relies on historical user-item interactions (e.g., ratings or purchases) to make recommendations.

The K-nearest neighbors (KNN) algorithm used in this example is a type of collaborative filtering algorithm that recommends items to a user based on the preferences of similar users.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [113]:
books = pd.read_csv("/content/books_knn_recomm_colab_filter.csv")
books.head(5)

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [114]:
ratings = pd.read_csv("/content/ratings_knn_recomm_colab_filter.csv")
ratings.head(5)

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4


In [8]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

In [9]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981756 entries, 0 to 981755
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   book_id  981756 non-null  int64
 1   user_id  981756 non-null  int64
 2   rating   981756 non-null  int64
dtypes: int64(3)
memory usage: 22.5 MB


# **Cleaning the data in the Books Dataframe**

In [12]:
books.isna().sum()

id                              0
book_id                         0
best_book_id                    0
work_id                         0
books_count                     0
isbn                          700
isbn13                        585
authors                         0
original_publication_year      21
original_title                585
title                           0
language_code                1084
average_rating                  0
ratings_count                   0
work_ratings_count              0
work_text_reviews_count         0
ratings_1                       0
ratings_2                       0
ratings_3                       0
ratings_4                       0
ratings_5                       0
image_url                       0
small_image_url                 0
dtype: int64

In [13]:
books = books.dropna()

In [22]:
# Check for duplicates based on singlecolumn
duplicate_items = books[books.duplicated(subset=['original_title'], keep=False)]
duplicate_items

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
2,3,41865,41865,3212258,226,316015849,9.780316e+12,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
50,51,256683,256683,2267189,178,1416914285,9.781417e+12,Cassandra Clare,2007.0,City of Bones,...,1154031,1241799,51589,34122,65349,203466,356048,582814,https://images.gr-assets.com/books/1432730315m...,https://images.gr-assets.com/books/1432730315s...
61,62,119322,119322,1536771,287,679879242,9.780680e+12,Philip Pullman,1995.0,Northern Lights,...,953970,994914,14915,38382,64591,198764,313147,380030,https://images.gr-assets.com/books/1451271747m...,https://images.gr-assets.com/books/1451271747s...
81,82,1845,1845,3284484,108,385486804,9.780385e+12,Jon Krakauer,1996.0,Into the Wild,...,647684,665377,17299,19229,35567,135199,248287,227095,https://images.gr-assets.com/books/1403173986m...,https://images.gr-assets.com/books/1403173986s...
122,123,5358,5358,38270,11,582418275,9.780582e+12,John Grisham,1991.0,The Firm,...,488269,488355,3139,5075,20119,111543,190966,160652,https://images.gr-assets.com/books/1418465200m...,https://images.gr-assets.com/books/1418465200s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9783,9784,26599,26599,1247207,3,871401541,9.780871e+12,"E.E. Cummings, Richard S. Kennedy",1994.0,Selected Poems,...,11174,11278,234,151,344,1740,3688,5355,https://images.gr-assets.com/books/1320891958m...,https://images.gr-assets.com/books/1320891958s...
9804,9805,420734,420734,1784677,38,671027581,9.780671e+12,Linda Howard,2001.0,Open Season,...,14653,15580,705,156,588,3505,6043,5288,https://images.gr-assets.com/books/1324869940m...,https://images.gr-assets.com/books/1324869940s...
9812,9813,97408,97408,93897,12,1571316515,9.781571e+12,Natasha Friend,2004.0,Perfect,...,11497,11648,1106,232,961,3362,3746,3347,https://images.gr-assets.com/books/1328750860m...,https://images.gr-assets.com/books/1328750860s...
9925,9926,12522507,12522507,17508816,32,61935123,9.780062e+12,Sophie Jordan,2012.0,Hidden,...,14760,15826,1468,328,1108,3919,5165,5306,https://images.gr-assets.com/books/1329749855m...,https://images.gr-assets.com/books/1329749855s...


In [24]:
# Check for duplicates based on multiple columns
duplicate_items = books[books.duplicated(subset=['original_title', 'authors'], keep=False)]
# Print the duplicate items
duplicate_items

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
122,123,5358,5358,38270,11,582418275,9780582000000.0,John Grisham,1991.0,The Firm,...,488269,488355,3139,5075,20119,111543,190966,160652,https://images.gr-assets.com/books/1418465200m...,https://images.gr-assets.com/books/1418465200s...
300,301,4900,4900,2877220,1369,1892295490,9781892000000.0,Joseph Conrad,1899.0,Heart of Darkness,...,255576,308391,9791,25591,42946,84433,88608,66813,https://images.gr-assets.com/books/1392799983m...,https://images.gr-assets.com/books/1392799983s...
361,362,11149,17383917,2920952,231,60652896,9780061000000.0,C.S. Lewis,1942.0,A Grief Observed,...,116277,238169,8285,4524,8925,34444,75279,114997,https://images.gr-assets.com/books/1347801873m...,https://images.gr-assets.com/books/1347801873s...
1592,1593,13812,13812,15983,58,553564943,9780554000000.0,Raymond E. Feist,1982.0,Magician,...,62432,64948,1231,806,2358,10465,22606,28713,https://images.gr-assets.com/books/1408317983m...,https://images.gr-assets.com/books/1408317983s...
2268,2269,43916,43916,2960376,39,586217835,9780586000000.0,Raymond E. Feist,1982.0,Magician,...,43964,47819,1118,423,1391,5933,14567,25505,https://images.gr-assets.com/books/1472201148m...,https://images.gr-assets.com/books/1472201148s...
2780,2781,49221,49221,894384,87,60652381,9780061000000.0,C.S. Lewis,1961.0,A Grief Observed,...,35913,38154,1930,403,976,5410,12917,18448,https://images.gr-assets.com/books/1412529325m...,https://images.gr-assets.com/books/1412529325s...
2820,2821,77394,77394,1133797,6,553213180,9780553000000.0,L.M. Montgomery,1917.0,Rainbow Valley,...,57062,60826,1389,872,2177,11467,20526,25784,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
3472,3473,6780439,6780439,285261,17,1847386823,9781847000000.0,L.J. Smith,1991.0,Dark Reunion,...,33023,34790,1050,545,1488,6040,9472,17245,https://images.gr-assets.com/books/1296605183m...,https://images.gr-assets.com/books/1296605183s...
3759,3760,43324,43324,2024958,72,7165161,9780007000000.0,Sidney Sheldon,1985.0,The Sands of Time,...,21443,22906,707,520,2117,7426,7837,5006,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
4352,4353,119382,119382,291432,76,6174434,9780006000000.0,Sidney Sheldon,1988.0,The Sands of Time,...,18499,19908,400,307,1548,6251,6904,4898,https://images.gr-assets.com/books/1356453253m...,https://images.gr-assets.com/books/1356453253s...


In [19]:
duplicate_items.shape

(18, 23)

 Since the the DataFrame,duplicate_items, is NOT empty, it means each combination of author and original title is NOT unique.

That is, as there are rows printed, it indicates that there are duplicate combinations of author and original title in my dataset.

By dropping the duplicates items based on the combination of 'authors' and 'original_title' will, I think, give a better result and lose less information.

So, I will drop the duplicate items based on the combinations of author and original title only.

In [25]:
# removing duplicate rows based on the 'original_title' column, keeping only the first occurrence of each duplicated 'original_title'.
books.drop_duplicates(subset=['original_title', 'authors'], keep='first', inplace=True)

In [None]:
# Check for duplicates in the 'original_title' column

# duplicate_items = books[books.duplicated(subset='original_title', keep=False)]

# **Cleaning the data in the rating dataframe:**

In [36]:
ratings.head()

Unnamed: 0,book_id,user_id,rating
117889,1180,1,4
488112,4893,1,3
625717,6285,1,4
796318,8034,2,4
875008,8855,2,5


In [37]:
ratings.isna().sum()

book_id    0
user_id    0
rating     0
dtype: int64

In [38]:
ratings = ratings.sort_values("user_id")
ratings.drop_duplicates(subset=["user_id","book_id"], keep='first', inplace=True)

# **Merging 'books' and 'ratings' DataFrames based on 'id' and 'book_id' columns**

# **Finding the condition for merging two dfs:**

Determining if the 'id' column from the 'books' DataFrame and the 'book_id' column from the 'ratings' DataFrame can be used to merge the two DataFrames.

Let's check if these columns contain matching values.

In [30]:
# Check unique values in the 'id' column from the 'books' DataFrame
books_unique_ids = books['id'].unique()

In [31]:
# Check unique values in the 'book_id' column from the 'ratings' DataFrame
ratings_unique_book_ids = ratings['book_id'].unique()

In [None]:
# Check if there are any common values between the two sets of unique ids
common_ids = set(books_unique_ids) & set(ratings_unique_book_ids)
common_ids

In [48]:
# Determine if the two DataFrames can be merged based on 'id' and 'book_id' columns
if len(common_ids) > 0:
    print("The 'id' column from 'books' and the 'book_id' column from 'ratings' can be used for merging.")
    # Merge 'books' and 'ratings' DataFrames based on 'id' and 'book_id' columns
    merged_df = pd.merge(books, ratings, how='left', left_on='id', right_on='book_id')
    # Print the DataFrame with selected columns
    df1 = merged_df[['id', 'book_id_y']]
    df1 = df1.rename(columns={'id': 'books_id', 'book_id_y': 'ratings_book_id'})
    print(df1)
else:
    print("The 'id' column from 'books' and the 'book_id' column from 'ratings' cannot be used for merging.")


The 'id' column from 'books' and the 'book_id' column from 'ratings' can be used for merging.
        books_id  ratings_book_id
0              1                1
1              1                1
2              1                1
3              1                1
4              1                1
...          ...              ...
772795      9999             9999
772796      9999             9999
772797      9999             9999
772798      9999             9999
772799      9999             9999

[772800 rows x 2 columns]


In [49]:
# Select specific columns from the merged DataFrame
df = merged_df[['id', 'original_title', 'user_id', 'rating']]
df

Unnamed: 0,id,original_title,user_id,rating
0,1,The Hunger Games,314,5
1,1,The Hunger Games,439,3
2,1,The Hunger Games,588,5
3,1,The Hunger Games,1169,4
4,1,The Hunger Games,1185,4
...,...,...,...,...
772795,9999,Cinderella Ate My Daughter: Dispatches from th...,49753,4
772796,9999,Cinderella Ate My Daughter: Dispatches from th...,50041,4
772797,9999,Cinderella Ate My Daughter: Dispatches from th...,50077,3
772798,9999,Cinderella Ate My Daughter: Dispatches from th...,51095,4


In [50]:
# Rename the 'id' column to 'book_id'
df = df.rename(columns={'id': 'book_id'})
df

Unnamed: 0,book_id,original_title,user_id,rating
0,1,The Hunger Games,314,5
1,1,The Hunger Games,439,3
2,1,The Hunger Games,588,5
3,1,The Hunger Games,1169,4
4,1,The Hunger Games,1185,4
...,...,...,...,...
772795,9999,Cinderella Ate My Daughter: Dispatches from th...,49753,4
772796,9999,Cinderella Ate My Daughter: Dispatches from th...,50041,4
772797,9999,Cinderella Ate My Daughter: Dispatches from th...,50077,3
772798,9999,Cinderella Ate My Daughter: Dispatches from th...,51095,4


In [67]:
df.isna().sum()

book_id           0
original_title    0
user_id           0
rating            0
dtype: int64

In [53]:
df.describe()

Unnamed: 0,book_id,user_id,rating
count,772800.0,772800.0,772800.0
mean,4679.923622,25145.563861,3.852008
std,2875.733468,15228.904208,0.980948
min,1.0,1.0,1.0
25%,2143.0,11854.0,3.0
50%,4537.0,24238.0,4.0
75%,7110.0,38044.0,5.0
max,9999.0,53424.0,5.0


# **Making a pivot table:**

We generally make a pivot table where books are in rows and users are in columns.

This  is a common and effective way to represent rating data for analysis and modeling in recommendation systems and collaborative filtering applications because we can get the rating of various readers on a single book  in recommendation systems to decompose the rating matrix into lower-dimensional matrices.

In [None]:
# Pivot the DataFrame to create 'ratings_df' with 'book_id' as index, 'user_id' as columns, and 'rating' as values
ratings_df = df.pivot(index='book_id', columns='user_id', values='rating').fillna(0)

# Set display option to show maximum 150 columns
pd.set_option('display.max_columns', 150)

In [56]:
# Display the first few rows of the pivot table
ratings_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,74,75,76,...,53348,53349,53350,53351,53352,53353,53354,53355,53356,53357,53358,53359,53360,53361,53362,53364,53365,53366,53367,53368,53369,53371,53372,53373,53374,53375,53376,53377,53378,53379,53380,53381,53382,53383,53384,53385,53386,53387,53388,53389,53390,53391,53392,53393,53394,53395,53396,53397,53398,53399,53400,53401,53402,53403,53404,53405,53406,53407,53408,53409,53410,53411,53412,53413,53414,53415,53416,53417,53418,53419,53420,53421,53422,53423,53424
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
ratings_df.shape

(7851, 50766)


Converting the DataFrame ratings_df into a sparse matrix format using the Compressed Sparse Row (CSR) representation, we can use the csr_matrix function from the SciPy library.


The following code will convert the DataFrame ratings_df into a CSR sparse matrix format. Each non-zero entry in the DataFrame will be represented as a non-zero entry in the CSR matrix, while zero entries will not be explicitly stored, resulting in a more memory-efficient representation for sparse matrices. The resulting ratings_matrix is now in CSR format and can be used for various operations and computations.


Note that converting a DataFrame into a CSR sparse matrix format is beneficial for memory efficiency, efficient computations, support for sparse data, and interoperability with other libraries and frameworks. It is particularly useful when dealing with large datasets with a significant number of zero values, such as user-item rating matrices in recommendation systems.

In the context of converting a DataFrame into a CSR sparse matrix format, interoperability is important for integrating with other libraries and frameworks that expect or work with sparse matrices.

In [60]:
from scipy.sparse import csr_matrix

# Convert DataFrame to CSR sparse matrix
ratings_matrix = csr_matrix(ratings_df.values) # COVERTING df INTO ARRAY TO MAKE IT COMPATIBLE WITH THE  SciPy--a library for scientific computing.

In [61]:
# Display the sparse matrix
print(ratings_matrix)

  (0, 309)	5.0
  (0, 430)	3.0
  (0, 577)	5.0
  (0, 1148)	4.0
  (0, 1163)	4.0
  (0, 2038)	4.0
  (0, 2445)	4.0
  (0, 2843)	5.0
  (0, 3583)	4.0
  (0, 3835)	5.0
  (0, 5251)	5.0
  (0, 5330)	3.0
  (0, 5747)	5.0
  (0, 6470)	5.0
  (0, 7374)	3.0
  (0, 9011)	1.0
  (0, 9871)	4.0
  (0, 9877)	5.0
  (0, 9975)	4.0
  (0, 10061)	4.0
  (0, 10329)	5.0
  (0, 10649)	5.0
  (0, 11532)	4.0
  (0, 11600)	4.0
  (0, 12121)	5.0
  :	:
  (7850, 29336)	4.0
  (7850, 29337)	2.0
  (7850, 30806)	4.0
  (7850, 30830)	3.0
  (7850, 31870)	3.0
  (7850, 31895)	2.0
  (7850, 32602)	3.0
  (7850, 33524)	3.0
  (7850, 33742)	3.0
  (7850, 34043)	2.0
  (7850, 35971)	4.0
  (7850, 36039)	3.0
  (7850, 36198)	3.0
  (7850, 36222)	4.0
  (7850, 38008)	4.0
  (7850, 39389)	5.0
  (7850, 39958)	2.0
  (7850, 40286)	4.0
  (7850, 40944)	4.0
  (7850, 43755)	5.0
  (7850, 47344)	4.0
  (7850, 47609)	4.0
  (7850, 47640)	3.0
  (7850, 48578)	4.0
  (7850, 49713)	5.0


# **Building the KNN MODEL:**

In [62]:
from sklearn.neighbors import NearestNeighbors

# Initialize a KNN model with cosine similarity metric and brute-force algorithm
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')

In [63]:
# Fit the model to the ratings matrix
model_knn.fit(ratings_matrix)

# **Retrieving the book ID for a given book title from a DataFrame:**

The following code defines a function to retrieve the book ID for a given book title from a DataFrame, and then demonstrates its usage by retrieving and printing the book ID for 'The Hunger Games'.

The query() method is used to filter the DataFrame df by book title. The @ symbol is used to indicate that book_title is a variable defined outside the query string. The filtered DataFrame is then used to extract the corresponding book ID.

In [65]:
def get_book_id(book_title):
    # Use the query method to filter the DataFrame by book title
    target_df = df.query("original_title == @book_title")
    return target_df['book_id'].iloc[0]

TheHungerGames_id = get_book_id('The Hunger Games')
print(TheHungerGames_id)

1


# **Retrieving the original title for a given book ID from the DataFrame df:**

In [68]:
def get_title(book_id):
    # Use the query method to filter the DataFrame by book ID
    target_df = df.query("book_id == @book_id")
    return target_df['original_title'].iloc[0]

print(get_title(1))

The Hunger Games


# **Retrieveing the recommendations for a given book title based on the k-nearest neighbors algorithm using a pre-trained model (model_knn):**

In [83]:
def get_recomm(book_title, num_neighbors=10):
    query_index = get_book_id(book_title) - 1


    if num_neighbors > 0:
        n_neighbors = num_neighbors + 1
    else:
        n_neighbors = 10 + 1

    distances, indices = model_knn.kneighbors(ratings_df.iloc[query_index, :].values.reshape(1, -1), n_neighbors=n_neighbors)

    book_ids = [ratings_df.index[index] for index in indices.flatten()]

    # Create a dictionary to pair indices and book IDs
    indices_book_ids_dict = {i: book_ids[i] for i in range(len(book_ids))}

    # Print the dictionary
    print(f"Dictionary of paired indices and book IDs: {indices_book_ids_dict}")

    # Create a DataFrame to store recommended book IDs
    recommendations_df = pd.DataFrame({'Recommended_Book_ID': book_ids[1:], 'Distance': distances.flatten()[1:]})

    return recommendations_df

# Example usage:
recommendations_df = get_recomm("The Hunger Games", num_neighbors=5)
recommendations_df

# Note that the indexing stars from 0 in python, but the book_id starts from the number = 1. Hence, the first pair of dictionary is 0: 1.
# other dictionary pairs are usual.

Dictionary of paired indices and book IDs: {0: 1, 1: 17, 2: 31, 3: 2, 4: 20, 5: 3}


Unnamed: 0,Recommended_Book_ID,Distance
0,17,0.405326
1,31,0.426746
2,2,0.444474
3,20,0.452303
4,3,0.490848


In [84]:
query_index = get_book_id('The Hunger Games') - 1
query_index

0

In [85]:
query_index = get_book_id('Dark Reunion') - 1
query_index

3472

In [94]:
def get_recomm(book_title, num_neighbors=10):
    # Get the index of the book
    query_index = get_book_id(book_title) - 1

# The purpose of subtracting 1 from the index is to adjust for Python's 0-based indexing.
# query_index = get_book_id(book_title) - 1 adjusts the index obtained from get_book_id to match Python's 0-based indexing convention.

    # Determine the number of neighbors
    n_neighbors = num_neighbors + 1 if num_neighbors > 0 else 11

    # Get the features (values) of the book at the query index and reshape it
    book_features = ratings_df.iloc[query_index, :].values.reshape(1, -1)

    # Compute the nearest neighbors
    distances, indices = model_knn.kneighbors(book_features, n_neighbors=n_neighbors)

    # Create a dictionary to pair indices and book IDs
    indices_book_ids_dict = {i: ratings_df.index[index] for i, index in enumerate(indices.flatten())}

    # Create a dictionary to pair indices with actual book IDs
    actual_book_ids_dict = {i: get_title(book_id) for i, book_id in indices_book_ids_dict.items()}

    # Create a DataFrame to store recommended book IDs and distances
    recommendations_df = pd.DataFrame({'Index': list(actual_book_ids_dict.keys()),
                                       'Book_Title': list(actual_book_ids_dict.values()),
                                       'Recommended_Book_ID': list(indices.flatten()),
                                       'Distance': distances.flatten()})

    # Set the index column
    recommendations_df.set_index('Index', inplace=True)

    # Print the dictionary of paired indices and actual book IDs
    print(f"Dictionary of paired indices and book IDs: {actual_book_ids_dict}")

    return recommendations_df

# Example usage:
recommendations_df = get_recomm("The Hunger Games", num_neighbors=10)
recommendations_df

Dictionary of paired indices and book IDs: {0: 'The Hunger Games', 1: 'Catching Fire', 2: 'The Help', 3: "Harry Potter and the Philosopher's Stone", 4: 'Mockingjay', 5: 'Twilight', 6: 'The Secret Garden', 7: 'The Great Gatsby', 8: 'Män som hatar kvinnor', 9: 'Angels & Demons ', 10: 'The Lion, the Witch and the Wardrobe'}


Unnamed: 0_level_0,Book_Title,Recommended_Book_ID,Distance
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,The Hunger Games,0,0.0
1,Catching Fire,16,0.405326
2,The Help,30,0.426746
3,Harry Potter and the Philosopher's Stone,1,0.444474
4,Mockingjay,19,0.452303
5,Twilight,2,0.490848
6,The Secret Garden,90,0.511991
7,The Great Gatsby,4,0.519767
8,Män som hatar kvinnor,15,0.522912
9,Angels & Demons,8,0.527192


# **Testing the Results based on the 'The Hunger Games'**

In [108]:
#  Top 10 recommendations for The Hunger Games
# Example usage: Recommendations based on the KNN model
# Assuming the "The Hunger Games" as the reference point, 10 books are recommended. and stored in the df:recommendations_for_TheHungerGames.
recommendations_for_TheHungerGames = get_recomm('The Hunger Games', num_neighbors=10)
recommendations_for_TheHungerGames

Dictionary of paired indices and book IDs: {0: 'The Hunger Games', 1: 'Catching Fire', 2: 'The Help', 3: "Harry Potter and the Philosopher's Stone", 4: 'Mockingjay', 5: 'Twilight', 6: 'The Secret Garden', 7: 'The Great Gatsby', 8: 'Män som hatar kvinnor', 9: 'Angels & Demons ', 10: 'The Lion, the Witch and the Wardrobe'}


Unnamed: 0_level_0,Book_Title,Recommended_Book_ID,Distance
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,The Hunger Games,0,0.0
1,Catching Fire,16,0.405326
2,The Help,30,0.426746
3,Harry Potter and the Philosopher's Stone,1,0.444474
4,Mockingjay,19,0.452303
5,Twilight,2,0.490848
6,The Secret Garden,90,0.511991
7,The Great Gatsby,4,0.519767
8,Män som hatar kvinnor,15,0.522912
9,Angels & Demons,8,0.527192


In [109]:
# iterating over the rows of the recommendations_for_TheHungerGames DataFrame, starting from the second row (index 1), and prints the book IDs and titles for each recommendation.
# directly accesses the DataFrame recommendations_for_TheHungerGames to retrieve the recommended book IDs
# and uses the get_title function to retrieve the titles associated with those book IDs.
for index, row in recommendations_for_TheHungerGames.iloc[1:].iterrows():
    recommended_book_id = row['Recommended_Book_ID']
    if recommended_book_id > 0 and recommended_book_id <= len(ratings_df):
        print('id:', recommended_book_id, '\t\tBook: ', get_title(recommended_book_id))
    else:
        print(f"Book ID {recommended_book_id} not found in the dataset.")

id: 16 		Book:  Män som hatar kvinnor
id: 30 		Book:  Gone Girl
id: 1 		Book:  The Hunger Games
id: 19 		Book:   The Fellowship of the Ring
id: 2 		Book:  Harry Potter and the Philosopher's Stone
id: 90 		Book:  The Outsiders
id: 4 		Book:  To Kill a Mockingbird
id: 15 		Book:  Het Achterhuis: Dagboekbrieven 14 juni 1942 - 1 augustus 1944
id: 8 		Book:  The Catcher in the Rye
id: 36 		Book:  The Giver


# **Testing the Results based on the 'Harry Potter and the Philosopher's Stone'**

In [110]:
#  Top 15 recommendations for Harry Potter and the Philosopher's Stone
# Example usage: Recommendations based on the KNN model
# Assuming the "Harry Potter and the Philosopher's Stone" as the reference point, 15 books are recommended and stored in the df:recommendations_for_HarryPotterAndThePhilosophersStone .
recommendations_for_HarryPotterAndThePhilosophersStone = get_recomm("Harry Potter and the Philosopher's Stone", num_neighbors=15)
recommendations_for_HarryPotterAndThePhilosophersStone

Dictionary of paired indices and book IDs: {0: "Harry Potter and the Philosopher's Stone", 1: 'To Kill a Mockingbird', 2: 'Memoirs of a Geisha', 3: 'Nineteen Eighty-Four', 4: 'The Great Gatsby', 5: ' The Fellowship of the Ring', 6: 'Lord of the Flies ', 7: 'Harry Potter and the Prisoner of Azkaban', 8: 'The Hobbit or There and Back Again', 9: 'Het Achterhuis: Dagboekbrieven 14 juni 1942 - 1 augustus 1944', 10: 'Jane Eyre', 11: 'The Hunger Games', 12: 'Harry Potter and the Goblet of Fire', 13: 'Harry Potter and the Chamber of Secrets', 14: 'Pride and Prejudice', 15: 'Animal Farm: A Fairy Story'}


Unnamed: 0_level_0,Book_Title,Recommended_Book_ID,Distance
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Harry Potter and the Philosopher's Stone,1,9.992007e-16
1,To Kill a Mockingbird,3,0.3892592
2,Memoirs of a Geisha,32,0.4024131
3,Nineteen Eighty-Four,12,0.4097615
4,The Great Gatsby,4,0.412056
5,The Fellowship of the Ring,18,0.4146445
6,Lord of the Flies,27,0.4154206
7,Harry Potter and the Prisoner of Azkaban,17,0.4226021
8,The Hobbit or There and Back Again,6,0.4273337
9,Het Achterhuis: Dagboekbrieven 14 juni 1942 - ...,14,0.4317701


# **Testing the Results based on the 'The Great Gatsby'**

In [112]:
#  Top 20 recommendations for Harry Potter and the Philosopher's Stone
# Example usage: Recommendations based on the KNN model
# Assuming the "Harry Potter and the Philosopher's Stone" as the reference point, 20 books are recommended. and stored in the df:recommendations_for_TheGreatGatsby.
recommendations_for_TheGreatGatsby = get_recomm("The Great Gatsby", num_neighbors=20)
recommendations_for_TheGreatGatsby

Dictionary of paired indices and book IDs: {0: 'The Great Gatsby', 1: 'To Kill a Mockingbird', 2: 'The Catcher in the Rye', 3: "Harry Potter and the Philosopher's Stone", 4: 'Memoirs of a Geisha', 5: 'Animal Farm: A Fairy Story', 6: 'Little Women', 7: 'Of Mice and Men ', 8: 'The Scarlet Letter', 9: 'Jane Eyre', 10: 'Nineteen Eighty-Four', 11: 'Pride and Prejudice', 12: 'Lord of the Flies ', 13: 'The Hobbit or There and Back Again', 14: 'Het Achterhuis: Dagboekbrieven 14 juni 1942 - 1 augustus 1944', 15: 'Män som hatar kvinnor', 16: 'The Lovely Bones', 17: 'Brave New World', 18: 'The Grapes of Wrath', 19: 'An Excellent conceited Tragedie of Romeo and Juliet', 20: ' The Fellowship of the Ring'}


Unnamed: 0_level_0,Book_Title,Recommended_Book_ID,Distance
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,The Great Gatsby,4,1.110223e-15
1,To Kill a Mockingbird,3,0.2966652
2,The Catcher in the Rye,7,0.3660262
3,Harry Potter and the Philosopher's Stone,1,0.412056
4,Memoirs of a Geisha,32,0.4127042
5,Animal Farm: A Fairy Story,13,0.4169207
6,Little Women,41,0.4231796
7,Of Mice and Men,31,0.426379
8,The Scarlet Letter,132,0.4280373
9,Jane Eyre,42,0.4400242
