<a href="https://www.kaggle.com/code/captaindylan/book-recommender-by-topics?scriptVersionId=109587293" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/books-movies-reviews/books_films_reviews.sqlite3


# Books and movies dataset
## Overview
This notebook serves as an example of how to process the books-movies-reviews dataset.  The dataset contains a set of books and movies made from those books.

### The Data
The raw data was scraped from Wikipedia, Goodreads, and imdb.  It was cleaned and organized.  A gensim topic model was created from the book corpus and the movie corpus, and topics were assigned to the books and movies.

### Sample code
In this notebook we read the data for the books and the topics assigned to books.  We take the dataset of books' topics and make a matrix where each row represents the array of topic percentages.  Then we create a distance matrix for each book and run a test of finding similar books

In [2]:
import sqlite3

# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("/kaggle/input/books-movies-reviews/books_films_reviews.sqlite3")
df = pd.read_sql_query("SELECT * from book_topics_by_book_id order by book_id", con)

In [3]:
# Important: Some of the types in the dataset do not match.  In this case we want book_id to always be an integer
df["book_id"] = pd.to_numeric(df["book_id"])

Pandas provides a handy function to create a pivot table, along with parameters to identify a fill value when there is a missing value in the data.  This will create a table with N rows (1 for each book) and 25 columns

In [4]:
book_df = df.pivot_table(values='topicprob', index=['book_id'],
                    columns=['topicid'], aggfunc=np.sum, fill_value=0)

In [5]:
book_df.head()

topicid,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.059232,0.212774,0.014783,0.026242,0.05015,0.0,0.035432,0.021628,0.11986,...,0.0,0.248951,0.010361,0.0,0.0,0.0,0.06209,0.017426,0.019287,0.017191
2,0.0,0.043166,0.24952,0.0,0.020014,0.059316,0.0,0.03089,0.011256,0.103281,...,0.013226,0.248836,0.0,0.0,0.0,0.0,0.087513,0.012659,0.018427,0.011505
3,0.0,0.054286,0.269742,0.012997,0.020168,0.107609,0.0,0.047831,0.012898,0.063758,...,0.013647,0.146733,0.010919,0.0,0.0,0.0,0.05025,0.016362,0.022101,0.0
5,0.0,0.060975,0.246407,0.010719,0.047789,0.072775,0.0,0.075175,0.027414,0.092118,...,0.015072,0.167944,0.0,0.0,0.0,0.0,0.032467,0.012964,0.013444,0.011853
6,0.0,0.045385,0.263036,0.011657,0.031252,0.074425,0.0,0.04441,0.014734,0.071082,...,0.016286,0.230089,0.0,0.0,0.0,0.0,0.06612,0.013635,0.020587,0.010801


Turn the data frame into a matrix, then use the metrics module to calculate the distances, treating each row as an "embedding" in the topic space.

In [6]:
matrix = np.array(book_df)

In [7]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
similarities = 1-pairwise_distances(matrix, metric="cosine")


Here we retrieve the book details so we can examine the titles of similar books

In [8]:
df_bookinfo = pd.read_sql_query("SELECT * from books order by book_id", con)
df_bookinfo.head()

Unnamed: 0,book_id,title,year,avg_rating,rating_count,review_count,series,series_num,author,description,length,five_stars,four_stars,three_stars,two_stars,one_star,cover_image,standardized_rating,normalized_rating
0,1.0,Harry Potter and the Half-Blood Prince,2005,4.56,2056192.0,32941.0,Harry Potter,6,J.K. Rowling,When Harry Potter and the Half-Blood Prince op...,652.0,1366613.0,510219.0,147878.0,23181.0,8301.0,https://images.gr-assets.com/books/1361039191l...,2.483514,0.853333
1,2.0,Harry Potter and the Order of the Phoenix,2003,4.48,2106674.0,34749.0,Harry Potter,5,J.K. Rowling,There is a door at the end of a silent corrido...,870.0,1315756.0,550836.0,195471.0,34007.0,10604.0,https://images.gr-assets.com/books/1546910265l...,2.205437,0.826667
2,3.0,Harry Potter and the Sorcerer's Stone,1997,4.46,5799402.0,93027.0,Harry Potter,1,J.K. Rowling,Harry Potter's life is miserable. His parents ...,320.0,3719830.0,1353816.0,516455.0,117095.0,92206.0,https://images.gr-assets.com/books/1474154022l...,2.135918,0.82
3,5.0,Harry Potter and the Prisoner of Azkaban,1999,4.55,2298966.0,44970.0,Harry Potter,3,J.K. Rowling,Harry Potter's third year at Hogwarts is full ...,435.0,1516987.0,571921.0,179958.0,22200.0,7900.0,https://images.gr-assets.com/books/1499277281l...,2.448755,0.85
4,6.0,Harry Potter and the Goblet of Fire,2000,4.55,2151870.0,37860.0,Harry Potter,4,J.K. Rowling,Harry Potter is midway through his training as...,734.0,1406331.0,551538.0,164462.0,21925.0,7614.0,https://images.gr-assets.com/books/1361482611l...,2.448755,0.85


Use argsort to get the index of the similarity index of the first row (*Harry Potter and the Half-Blood Prince*).  The matches array will have the similarity values from lowest to highest.

In [9]:
matches = np.argsort(similarities[0])
matches[-10:]

array([195, 344, 115,   3, 217, 568, 364,   4,   1,   0])

Here are the similaritiy values.

In [10]:
similarities[0][matches[-10:]]

array([0.9340955 , 0.93617758, 0.95043715, 0.95256799, 0.96938585,
       0.9742625 , 0.97508604, 0.97584823, 0.98876016, 1.        ])

We use the index values of the matches to get the related book in the book data frame index:

In [11]:
df_bookinfo.iloc[matches[-10:]][["title","year","author"]]

Unnamed: 0,title,year,author
195,Twilight,2008,Stephenie Meyer
344,Prince Caspian,1955,C.S. Lewis
115,Harry Potter and the Chamber of Secrets,1998,J.K. Rowling
3,Harry Potter and the Prisoner of Azkaban,1999,J.K. Rowling
217,New Moon,2008,Stephenie Meyer
568,The Unknown Soldier,1954,V
364,Harry Potter and the Deathly Hallows,2007,J.K. Rowling
4,Harry Potter and the Goblet of Fire,2000,J.K. Rowling
1,Harry Potter and the Order of the Phoenix,2003,J.K. Rowling
0,Harry Potter and the Half-Blood Prince,2005,J.K. Rowling


Our experiment is a success.  The process revealed books similar to *Harry Potter and the Half-Blood Prince* such as other Harry Potter books!

Let's try one more example such as James Bond books.

In [12]:
df_bookinfo.iloc[38][["title","year","author"]]

title       Moonraker
year             1955
author    Ian Fleming
Name: 38, dtype: object

In [13]:
matches = np.argsort(similarities[38])
df_bookinfo.iloc[matches[-10:]][["title","year","author"]]

Unnamed: 0,title,year,author
169,Clear and Present Danger,1989,Tom Clancy
45,From Russia With Love,1957,Ian Fleming
1248,The Bishop's Wife,1928,Robert Nathan
43,On Her Majesty's Secret Service,1963,Ian Fleming
134,The Hunt for Red October,1984,Tom Clancy
41,Diamonds Are Forever,1956,Ian Fleming
410,Three-Ten to Yuma and Other Stories,1953,Elmore Leonard
42,You Only Live Twice,1964,Ian Fleming
411,The Man With the Golden Gun,1965,Ian Fleming
38,Moonraker,1955,Ian Fleming


Again the process yielded close matches.  This shows that the topics and their assignments are useful enough to provide matches to books with a similar topic profiles.