# Goodreads EDA and Recommendations Algorithm Development

## Setup
---

In [1]:
import getpass
import pandas as pd
import tensorflow as tf
from pymongo import MongoClient

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

2024-09-09 06:22:32.954142: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-09 06:22:32.954185: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-09 06:22:32.955038: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-09 06:22:32.960437: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
password = getpass.getpass("MongoDB password: ")

MongoDB password:  ········


## Pull Goodreads Data
---

In [3]:
client = MongoClient(f'mongodb://book_group:{password}@macragge.reika.io:47017/?authSource=books')

In [4]:
db = client['books']
collection = db['books']

In [5]:
# Fetch data from MongoDB
data = list(collection.find(limit=10000))  # Retrieve all documents as a list of dictionaries

In [6]:
# Convert to Pandas DataFrame
df = pd.DataFrame(data)

In [7]:
client.close()

## Data Preprocessing & Exploration
---
### Understand the Data

In [8]:
df.head()

Unnamed: 0,_id,isbn,text_reviews_count,series,country_code,language_code,popular_shelves,asin,is_ebook,average_rating,...,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series
0,66da49047084538b3e00f9c2,312853122.0,1,[],US,,"[{'count': '3', 'name': 'to-read'}, {'count': ...",,False,4.0,...,9.0,,1984.0,https://www.goodreads.com/book/show/5333265-w-...,https://images.gr-assets.com/books/1310220028m...,5333265,3,5400751,W.C. Fields: A Life on Film,W.C. Fields: A Life on Film
1,66da49047084538b3e00f9c3,743509986.0,6,[],US,,"[{'count': '2634', 'name': 'to-read'}, {'count...",,False,3.23,...,10.0,Abridged,2001.0,https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...,1333909,10,1323437,Good Harbor,Good Harbor
2,66da49047084538b3e00f9c4,,7,[189911],US,eng,"[{'count': '58', 'name': 'to-read'}, {'count':...",B00071IKUY,False,4.03,...,,Book Club Edition,1987.0,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,7327624,140,8948723,"The Unschooled Wizard (Sun Wolf and Starhawk, ...","The Unschooled Wizard (Sun Wolf and Starhawk, ..."
3,66da49047084538b3e00f9c5,743294297.0,3282,[],US,eng,"[{'count': '7615', 'name': 'to-read'}, {'count...",,False,3.49,...,7.0,,2009.0,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,6066819,51184,6243154,Best Friends Forever,Best Friends Forever
4,66da49047084538b3e00f9c6,850308712.0,5,[],US,,"[{'count': '32', 'name': 'to-read'}, {'count':...",,False,3.4,...,,,,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,287140,15,278577,Runic Astrology: Starcraft and Timekeeping in ...,Runic Astrology: Starcraft and Timekeeping in ...


In [9]:
# available columns
for c in df.columns:
    print(c)

_id
isbn
text_reviews_count
series
country_code
language_code
popular_shelves
asin
is_ebook
average_rating
kindle_asin
similar_books
description
format
link
authors
publisher
num_pages
publication_day
isbn13
publication_month
edition_information
publication_year
url
image_url
book_id
ratings_count
work_id
title
title_without_series


### Data Cleaning

Drop
- _id - identifier
- isbn - identifier
- link - URL link to Goodread's entry of the book
- url
- image_url
- book_id
- work_id

Features
- text_reviews_count
- series
- country_code
- language_code
- popular_shelves
- is_ebook
- average_rating
- description
- format
- authors
- publisher
- num_pages
- publication_day
- publication_month
- edition_information
- publication_year
- ratings_count
- title
- title_without_series

Target
- needs to be added? user preference?
- similar_books - see if we can make our model match goodreads?

Unknown
- asin - unknown
- kindle_asin - unknown
- isbn13

### Explore and Analyze the Data

In [11]:
df['similar_books'].value_counts()

similar_books
[]                                                                                                                                                                                    5276
[31242, 374380, 20564, 383206, 7891, 6335178, 31175, 372811, 77395, 856190, 686278, 5797, 32110, 3102, 264, 99329, 31667]                                                                5
[8359929, 723742, 297130, 7570244, 397904, 22889, 89395, 1688926, 64694, 89115, 126816]                                                                                                  4
[87580, 837422, 429024, 12923, 588747, 472966, 207313, 175516, 1137702, 1275404, 6138, 733957, 29981, 1153738, 189746, 2677, 272751, 535856]                                             3
[160010, 16810, 3102, 606805, 517188, 18799, 91494, 7628, 11013, 11230, 24100, 18846, 53064, 5849, 438078, 144073, 11899, 53101]                                                         3
                                                   

## Recommendation System
---
Implement the Recommendation System with TensorFlow

### Prepare the Data for TensorFlow

In [10]:
feature_columns = [
    'text_reviews_count', 'series', 'country_code', 'language_code', 'popular_shelves', 'is_ebook',
    'average_rating', 'description', 'format', 'authors', 'publisher', 'num_pages',
    'publication_day', 'publication_month', 'edition_information', 'publication_year', 'ratings_count', 'title',
    'title_without_series'
]

# Remove similar_books target from features data
y = df['similar_books'].values
X = df[feature_columns]

# Split training/test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

### Build and Train the Model

### Evaluate the Model

### Make Recommendations