## Producing book recommendations 

To ground my learning in a practical problem, I'll be using the [Goodbooks-10k dataset](http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/) 

[Goodbooks](https://www.goodbooks.io/)  is an online book recommendation service that pairs readers with their next favourite read. The dataset contains information on 10,000 books from the service’s catalogue, along with ~80,000 reviews generated by site visitors. I’ll use this rich information to try our best to recommend what good books you (or your friends/family) should read next.


### Dataset overview: Brief EDA

I'll be making use of two main files derived from the dataset$^*$;
 
 - **Books_with_tags.csv**: I created this file for the convenience of this train. It contains book_id, title, author, date, etc. data from the original `books.csv` file, along with user tags merged from the `book_tags.csv` and `tags.csv` files. 
 
 
 - **Book_ratings.csv**: This is a subset of the `ratings.csv` file, with a field for the book titles added for convenience. This file contains the important mapping between users and item ratings.
 
The full dataset can be found [here](https://github.com/zygmuntz/goodbooks-10k).

In [36]:
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings('ignore')

In [13]:
# Load the datasets
books_df = pd.read_csv('books.csv')
book_tags_df = pd.read_csv('book_tags.csv')
tags_df = pd.read_csv('tags.csv')
ratings_df = pd.read_csv('ratings.csv')

In [14]:
# Display column names for debugging
print("Books DataFrame columns:", books_df.columns)
print("Book Tags DataFrame columns:", book_tags_df.columns)
print("Tags DataFrame columns:", tags_df.columns)

Books DataFrame columns: Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
       'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url'],
      dtype='object')
Book Tags DataFrame columns: Index(['goodreads_book_id', 'tag_id', 'count'], dtype='object')
Tags DataFrame columns: Index(['tag_id', 'tag_name'], dtype='object')


In [20]:
# Merge book_tags with tags to get tag names
book_tags_merged = pd.merge(book_tags_df, tags_df, left_on='tag_id', right_on='tag_id')

# Aggregate tags for each book
book_tags_aggregated = book_tags_merged.groupby('goodreads_book_id')['tag_name'].apply(list).reset_index()

# Display the first few rows for debugging
print("Aggregated Book Tags DataFrame:", book_tags_aggregated.head())

# Merge with books data
books_with_tags_df = pd.merge(books_df, book_tags_aggregated, left_on='book_id', right_on='goodreads_book_id', how='left')

# Check for the existence of 'goodreads_book_id' before dropping
if 'goodreads_book_id' in books_with_tags_df.columns:
    books_with_tags_df.drop(columns=['goodreads_book_id'], inplace=True)

# Merge ratings with books to get book titles
book_ratings_df = pd.merge(ratings_df, books_df[['book_id', 'title']], on='book_id', how='left')

Aggregated Book Tags DataFrame:    goodreads_book_id                                           tag_name
0                  1  [to-read, fantasy, favorites, currently-readin...
1                  2  [to-read, fantasy, favorites, currently-readin...
2                  3  [to-read, fantasy, favorites, currently-readin...
3                  5  [to-read, fantasy, favorites, currently-readin...
4                  6  [to-read, fantasy, young-adult, fiction, harry...


In [16]:
# Save the dataframes to CSV
books_with_tags_df.to_csv('Books_with_tags.csv', index=False)
print("Books_with_tags.csv created successfully!")

book_ratings_df.to_csv('Book_ratings.csv', index=False)
print("Book_ratings.csv created successfully!")

Books_with_tags.csv created successfully!
Book_ratings.csv created successfully!


In [10]:
books = pd.read_csv('Books_with_tags.csv')
books.head(3)

Unnamed: 0,book_id,goodreads_book_id_x,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,goodreads_book_id_y,tag_name
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,1.0,"['to-read', 'fantasy', 'favorites', 'currently..."
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,2.0,"['to-read', 'fantasy', 'favorites', 'currently..."
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,3.0,"['to-read', 'fantasy', 'favorites', 'currently..."


In [11]:
print (f'Number of books in dataset: {books.shape[0]}')

Number of books in dataset: 10000


In [34]:
book_ratings = pd.read_csv('Book_ratings.csv')
book_ratings.head(3)

Unnamed: 0,user_id,book_id,rating,title
0,1,258,5,The Shadow of the Wind (The Cemetery of Forgot...
1,2,4081,4,I am Charlotte Simmons
2,2,260,5,How to Win Friends and Influence People


In [35]:
print (f'Number of ratings in dataset: {book_ratings.shape[0]}')

Number of ratings in dataset: 5976479


Let's look at the distribution of the ratings given by users. Here, we see that readers generally are on the kinder end of the rating spectrum, with a far higher proportion of positive reviews (> 3) being given over negative ones (< 3):