The file **liked_books.csv** contains the list of the books liked by the user. This is used to look for overlapping books for providing user recommendations. These books help the system provide recommendation through *collaborative filtering*

In [1]:
import pandas as pd

my_books = pd.read_csv("liked_books.csv", index_col=0) # Loads the CSV of users-liked books
my_books["book_id"] = my_books["book_id"].astype(str) # Converts the int book_id to str, to maintain consistency with other files

In [2]:
my_books # prints the file of csv of liked books

Unnamed: 0,user_id,book_id,rating,title
0,-1,2517439,5,"The Forever War (The Forever War, #1)"
1,-1,113576,5,The Smartest Guys in the Room: The Amazing Ris...
2,-1,35100,5,Battle Cry of Freedom
3,-1,228221,5,The Mask of Command
5,-1,17662739,5,"2001: A Space Odyssey (Space Odyssey, #1)"
6,-1,356824,5,India After Gandhi: The History of the World's...
7,-1,12125412,5,The Lady or the Tiger?: and Other Logic Puzzles
8,-1,139069,5,Endurance: Shackleton's Incredible Voyage
10,-1,76680,5,"Foundation (Foundation, #1)"
11,-1,1898,5,Into Thin Air: A Personal Account of the Mount...


The file **book_id_map.csv** is used to provide the *mapping* between the ids of the files **book_titles.json** and **goodreads_interactions.csv**         
**book_titles.json** has the book data, i.e book_id, title, ratings, url etc. 

In [3]:
csv_book_mapping = {} # variable for storing the mapping 

with open("book_id_map.csv", "r") as f:
    while True:
        line = f.readline() # line-by-line reading of the csv file
        if not line:
            break # this breaks the loop when the file ends
        csv_id, book_id = line.strip().split(",") #every line in file in splitted on ','
        csv_book_mapping[csv_id] = book_id # csv_id stores the ids for CSV file

In [4]:
book_set = set(my_books["book_id"]) # we use a set to only contain the unique file ids

The file **goodreads_interactions.csv** is used to provide the full history of how a user has rated different books on goodreads.

In [5]:
overlap_users = {} # list of users who have read similar book, dict
# this file has many lines, so we need to read line-by-line to fit in RAM
with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        
        book_id = csv_book_mapping.get(csv_id)
        
        if book_id in book_set: # if we have read this book
            if user_id not in overlap_users:
                overlap_users[user_id] = 1 # this adds the user to overlap user
            else:
                overlap_users[user_id] += 1 # this increments value to show
                # the number of common books between user and overlapped person
                # thus the overlap_users dict has keys as book_id and values as no. of times a user has read book we have in our list

In [6]:
len(overlap_users) # number of users that have read same books as us.

316341

In [7]:
# this lines removes the user who doesn't have overlap above a certain threshold.
# Threshold: overlap is 20% of our total book count 
filtered_overlap_users = set([k for k in overlap_users if overlap_users[k] > my_books.shape[0]/5])

In [8]:
len(filtered_overlap_users) # number of user who are above the filtered threshold

1258

After this cell, the main process of **collaborative filtering** will start from this cell onwards. We will look at people, who have read same books, and recommend books based on thier other common interactions.

In [9]:
interactions_list = []
# this contains the rating give by different users to specific books. 
with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        # if the user is from overlapped users list, let's add them
        if user_id in filtered_overlap_users: 
            book_id = csv_book_mapping[csv_id]
            interactions_list.append([user_id, book_id, rating])

In [10]:
len(interactions_list) 

5638701

In [11]:
interactions_list[0]

['282', '627206', '4']

In [12]:
# this loads the ratings of different users into pandas
interactions = pd.DataFrame(interactions_list, columns=["user_id", "book_id", "rating"])

In [13]:
# this combines our rating with everyone else
interactions = pd.concat([my_books[["user_id", "book_id", "rating"]], interactions])

In [14]:
interactions # display the interaction(everyone combined )

Unnamed: 0,user_id,book_id,rating
0,-1,2517439,5
1,-1,113576,5
2,-1,35100,5
3,-1,228221,5
5,-1,17662739,5
...,...,...,...
5638696,804100,475178,0
5638697,804100,186074,0
5638698,804100,153008,0
5638699,804100,45107,0


In [15]:
interactions["book_id"] = interactions["book_id"].astype(str)
interactions["user_id"] = interactions["user_id"].astype(str)
interactions["rating"] = pd.to_numeric(interactions["rating"])

In [16]:
# this allocates the categorical value to the each row and dividing records into separate categories
# this is due to the excessive number of entries of users
interactions["user_index"] = interactions["user_id"].astype("category").cat.codes

In [17]:
# this allocates the categorical value to the each row and dividing records into separate categories
# this is due to the excessive number of entries of books
interactions["book_index"] = interactions["book_id"].astype("category").cat.codes

In [18]:
# We plot the relation between the user_index and book_index
# We use sparse matrix to save memory
from scipy.sparse import coo_matrix

ratings_mat_coo = coo_matrix((interactions["rating"], (interactions["user_index"], interactions["book_index"])))

In [19]:
ratings_mat_coo.shape

(1259, 802870)

In [20]:
# We convert our column matrix to a compressed space matrix
ratings_mat = ratings_mat_coo.tocsr()

In [21]:
# We can find the row position of our specific user and it's user_index=0
interactions[interactions["user_id"] == "-1"]

Unnamed: 0,user_id,book_id,rating,user_index,book_index
0,-1,2517439,5,0,414880
1,-1,113576,5,0,38971
2,-1,35100,5,0,575858
3,-1,228221,5,0,356004
5,-1,17662739,5,0,214285
6,-1,356824,5,0,581743
7,-1,12125412,5,0,59763
8,-1,139069,5,0,124430
10,-1,76680,5,0,722098
11,-1,1898,5,0,276178


In [22]:
my_index = 0

In [23]:
# We're using cosine similarity to find which users are most similar(like me)
# We find similarity based on what user read and how they rated it.
from sklearn.metrics.pairwise import cosine_similarity
# We'll take our own taste in books and compare it with the rest of the users'.  
similarity = cosine_similarity(ratings_mat[my_index,:], ratings_mat).flatten()

In [24]:
similarity[0] # similarity w.r.t user '0', i.e us

0.9999999999999999

In [25]:
# This cell will help us find the indices of the users who are most similar to us
# We are looking for 15 people who have the most similar taste
import numpy as np

indices = np.argpartition(similarity, -15)[-15:]

In [26]:
indices

array([1188,  942,  218,  129,  496,  435, 1208,  795, 1213, 1210, 1143,
        321,  294,  862,    0], dtype=int64)

In [27]:
# We try to find their user_index based on indices(found earlier)
similar_users = interactions[interactions["user_index"].isin(indices)].copy()

In [28]:
# We remove our entry from similar_users
similar_users = similar_users[similar_users["user_id"]!="-1"]

In [29]:
similar_users

Unnamed: 0,user_id,book_id,rating,user_index,book_index
45312,4133,5359,3,942,632143
45313,4133,10464963,4,942,13492
45314,4133,3858,3,942,593622
45315,4133,11827808,4,942,51904
45316,4133,7913305,4,942,732465
...,...,...,...,...,...
5638521,712588,32388712,3,1143,543119
5638522,712588,16322,5,1143,183365
5638523,712588,860543,0,1143,759827
5638524,712588,853510,5,1143,756768


In [30]:
# We find the book_ids, which have been rated by similar_users.
# The count of recommended book's appearance and it's average rating is calculated
book_recs = similar_users.groupby("book_id").rating.agg(['count', 'mean'])

In [31]:
book_recs

Unnamed: 0_level_0,count,mean
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6,3.833333
100322,1,0.000000
100365,1,0.000000
10046142,1,0.000000
1005,3,0.000000
...,...,...
99561,2,2.500000
99610,1,3.000000
99664,1,4.000000
9969571,3,2.333333


In [32]:
# Load book_titles into Dataframe
books_titles = pd.read_json("books_titles.json")
books_titles["book_id"] = books_titles["book_id"].astype(str) # to avoid inconsistency

In [33]:
# Merge the recommendations with book details (combine 2 datasets)
book_recs = book_recs.merge(books_titles, how="inner", on="book_id")

In [34]:
book_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title
0,1,6,3.833333,Harry Potter and the Half-Blood Prince (Harry ...,1713866,https://www.goodreads.com/book/show/1.Harry_Po...,https://images.gr-assets.com/books/1361039191m...,harry potter and the halfblood prince harry po...
1,100322,1,0.000000,Assata: An Autobiography,11057,https://www.goodreads.com/book/show/100322.Assata,https://images.gr-assets.com/books/1328857268m...,assata an autobiography
2,100365,1,0.000000,The Mote in God's Eye,48736,https://www.goodreads.com/book/show/100365.The...,https://images.gr-assets.com/books/1399490037m...,the mote in gods eye
3,10046142,1,0.000000,Dancing in the Glory of Monsters: The Collapse...,2391,https://www.goodreads.com/book/show/10046142-d...,https://images.gr-assets.com/books/1328757755m...,dancing in the glory of monsters the collapse ...
4,1005,3,0.000000,Think and Grow Rich,87634,https://www.goodreads.com/book/show/1005.Think...,https://s.gr-assets.com/assets/nophoto/book/11...,think and grow rich
...,...,...,...,...,...,...,...,...
2843,99561,2,2.500000,Looking for Alaska,804587,https://www.goodreads.com/book/show/99561.Look...,https://images.gr-assets.com/books/1394798630m...,looking for alaska
2844,99610,1,3.000000,The Best Laid Plans,17434,https://www.goodreads.com/book/show/99610.The_...,https://images.gr-assets.com/books/1353374848m...,the best laid plans
2845,99664,1,4.000000,The Painted Veil,24606,https://www.goodreads.com/book/show/99664.The_...,https://images.gr-assets.com/books/1320421719m...,the painted veil
2846,9969571,3,2.333333,Ready Player One,376328,https://www.goodreads.com/book/show/9969571-re...,https://images.gr-assets.com/books/1500930947m...,ready player one


In [35]:
# Adjusted count is normalized count by the factor of  no. similar users liking the book to the total #people 
book_recs["adjusted_count"] = book_recs["count"] * (book_recs["count"] / book_recs["ratings"])
# score indicates how much we might like each book
book_recs["score"] = book_recs["mean"] * book_recs["adjusted_count"]
# Now we'll remove any books that we have already read
# First we remove books by book_id
book_recs = book_recs[~book_recs["book_id"].isin(my_books["book_id"])]
# Then, we remove books by titles
my_books["mod_title"] = my_books["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True).str.lower()
my_books["mod_title"] = my_books["mod_title"].str.replace("\s+", " ", regex=True)
book_recs = book_recs[~book_recs["mod_title"].isin(my_books["mod_title"])]
# Shows only books with a avg. rating of 4 or more
book_recs = book_recs[book_recs["mean"] >=4]
# Remove any book which appeared 2 or less times in our recommendations
book_recs = book_recs[book_recs["count"]>2]
# We show the recommendation based
top_recs = book_recs.sort_values("mean", ascending=False)

In [36]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

top_recs.style.format({'url': make_clickable, 'cover_image': show_image})

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,score
2260,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,Goodreads,,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
600,157993,3,4.333333,The Little Prince,763309,Goodreads,,the little prince,1.2e-05,5.1e-05
1100,22034,3,4.333333,The Godfather,259150,Goodreads,,the godfather,3.5e-05,0.00015
1173,2318271,3,4.333333,The Last Lecture,245804,Goodreads,,the last lecture,3.7e-05,0.000159
1906,4381,3,4.333333,Fahrenheit 451,591506,Goodreads,,fahrenheit 451,1.5e-05,6.6e-05
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,Goodreads,,the golden compass his dark materials 1,1.6e-05,7e-05
1441,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,Goodreads,,the hero of ages mistborn 3,0.000107,0.000456
2558,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,Goodreads,,kane and abel kane and abel 1,0.000213,0.000904
244,119324,3,4.0,"The Subtle Knife (His Dark Materials, #2)",246697,Goodreads,,the subtle knife his dark materials 2,3.6e-05,0.000146
398,13497,4,4.0,"A Feast for Crows (A Song of Ice and Fire, #4)",437398,Goodreads,,a feast for crows a song of ice and fire 4,3.7e-05,0.000146
