# Data preparation

The goal of this notebook is to prepare the data for the project. In particular, we map the book's metadata (especially the summary) that comes from one dataset to the list of user-book ratings.

In [1]:
import pandas as pd

### Extract books complete metadata, including description (books_1.Best_Books_Ever.csv)

In [2]:
books_full_metadata = pd.read_csv('../books_1.Best_Books_Ever.csv')
books_full_metadata.head(1)
len(books_full_metadata)

52478

Keep only boks that are in english

In [3]:
books_full_metadata = books_full_metadata[books_full_metadata["language"] == "English"]
len(books_full_metadata)

42661

Get only the features we are interested in

In [4]:
books_full_metadata = books_full_metadata[["title","series","author","description","genres","pages", "publisher","firstPublishDate","awards","setting","coverImg"]]

Parse titles

In [5]:
books_full_metadata['mod_title'] = books_full_metadata['title'].str.replace("\s+", " ", regex=True) #Remove multiple spaces in a row
books_full_metadata['mod_title'] = books_full_metadata['mod_title'].str.replace("[^\w\s]", "", regex=True).str.lower() #Remove punctuation and change to lower case
books_full_metadata[books_full_metadata["mod_title"] == "mobydick or the whale"]

Unnamed: 0,title,series,author,description,genres,pages,publisher,firstPublishDate,awards,setting,coverImg,mod_title
100,"Moby-Dick or, the Whale",,"Herman Melville, Andrew Delbanco (Introduction...","""It is the horrible texture of a fabric that s...","['Classics', 'Fiction', 'Literature', 'Adventu...",654,Penguin Classics,10/18/51,['Audie Award for Solo Narration - Male (2006)...,"['Nantucket Island, Massachusetts (United Stat...",https://i.gr-assets.com/images/S/compressed.ph...,mobydick or the whale


Remove books that have the same title 

In [6]:
duplicate_book = books_full_metadata[books_full_metadata["mod_title"].isin(books_full_metadata["mod_title"][books_full_metadata["mod_title"].duplicated()])].sort_values("mod_title").index
books_full_metadata = books_full_metadata.drop(duplicate_book)

In [7]:
len(books_full_metadata)

38829

### Extract goodreads books incomplete metadata (books_titles.json)

In [19]:
books_partial_metadata = pd.read_json("../books_titles.json")
books_partial_metadata["book_id"] = books_partial_metadata["book_id"].astype(str)
books_partial_metadata = books_partial_metadata.drop(columns=["title", "ratings"])
books_partial_metadata.head()

Unnamed: 0,book_id,url,cover_image,mod_title
0,1333909,https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...,good harbor
1,7327624,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
2,6066819,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
3,287140,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...
4,287141,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls


In [20]:
len(books_partial_metadata["mod_title"].unique())

1227673

### Import mapping between book ids in the csv and books_titles.json file (book_id_map.csv)

In [21]:
csv_book_mapping = {}

with open("../book_id_map.csv", "r") as file: #Reading through large file
    next(file) #Skip header
    while (line := file.readline().rstrip()):
        csv_id, book_id = line.strip().split(",")
        csv_book_mapping[csv_id] = book_id

### Find the intersetion between the two book descriptions dataframes

In [22]:
books_intersection_full_partial = pd.merge(books_full_metadata, books_partial_metadata, how ='inner', on =['mod_title'])

In [23]:
books_intersection_full_partial.head(1)
books_intersection_full_partial.set_index("book_id")

Unnamed: 0_level_0,title,series,author,description,genres,pages,publisher,firstPublishDate,awards,setting,coverImg,mod_title,url,cover_image
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
14796360,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,the hunger games,https://www.goodreads.com/book/show/14796360-t...,https://images.gr-assets.com/books/1355036953m...
11534111,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,the hunger games,https://www.goodreads.com/book/show/11534111-t...,https://images.gr-assets.com/books/1328214586m...
15784152,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,the hunger games,https://www.goodreads.com/book/show/15784152-t...,https://images.gr-assets.com/books/1344000603m...
14289293,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,the hunger games,https://www.goodreads.com/book/show/14289293-t...,https://images.gr-assets.com/books/1337792923m...
16051061,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,the hunger games,https://www.goodreads.com/book/show/16051061-t...,https://images.gr-assets.com/books/1363545717m...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270435,Heal Your Body: The Mental Causes for Physical...,,Louise L. Hay,Heal Your Body is a fresh and easy step-by-ste...,"['Self Help', 'Health', 'Nonfiction', 'Spiritu...",96,Hay House,May 1st 1976,[],[],https://i.gr-assets.com/images/S/compressed.ph...,heal your body the mental causes for physical ...,https://www.goodreads.com/book/show/270435.Hea...,https://images.gr-assets.com/books/1404193356m...
15840361,Heal Your Body: The Mental Causes for Physical...,,Louise L. Hay,Heal Your Body is a fresh and easy step-by-ste...,"['Self Help', 'Health', 'Nonfiction', 'Spiritu...",96,Hay House,May 1st 1976,[],[],https://i.gr-assets.com/images/S/compressed.ph...,heal your body the mental causes for physical ...,https://www.goodreads.com/book/show/15840361-h...,https://images.gr-assets.com/books/1345590708m...
11115191,Attracted to Fire,,DiAnn Mills (Goodreads Author),Special Agent Meghan Connors' dream of one day...,"['Christian Fiction', 'Christian', 'Suspense',...",416,Tyndale House Publishers,September 16th 2011,['HOLT Medallion by Virginia Romance Writers N...,['West Texas (United States)'],https://i.gr-assets.com/images/S/compressed.ph...,attracted to fire,https://www.goodreads.com/book/show/11115191-a...,https://s.gr-assets.com/assets/nophoto/book/11...
602931,Anasazi,Sense of Truth #2,Emma Michaels,"'Anasazi', sequel to 'The Thirteenth Chime' by...","['Mystery', 'Young Adult']",190,Bokheim Publishing,August 3rd 2011,[],[],https://i.gr-assets.com/images/S/compressed.ph...,anasazi,https://www.goodreads.com/book/show/602931.Ana...,https://images.gr-assets.com/books/1287546026m...


In [25]:
intersection_book_id = set(books_intersection_full_partial["book_id"])
len(intersection_book_id)

130808

### Extract book ratings for books fo which we have an actual descritption

In [14]:
known_book_ratings = []
i=0
with open("../goodreads_interactions.csv", 'r') as file:
    next(file) #Skip header
    while (line := file.readline().rstrip()):
        #Retrieve user, book id and associated rating
        user_id, csv_book_id, _, rating, _ = line.split(",")
        book_id = csv_book_mapping.get(csv_book_id)
        if book_id in intersection_book_id:
            known_book_ratings.append([user_id, book_id, rating])
        i+=1
        if(i%5000000==0):
            print(f"{round(i/229000000*100,1)}% completed")

2.2% completed
4.4% completed
6.6% completed
8.7% completed
10.9% completed
13.1% completed
15.3% completed
17.5% completed
19.7% completed
21.8% completed
24.0% completed
26.2% completed
28.4% completed
30.6% completed
32.8% completed
34.9% completed
37.1% completed
39.3% completed
41.5% completed
43.7% completed
45.9% completed
48.0% completed
50.2% completed
52.4% completed
54.6% completed
56.8% completed
59.0% completed
61.1% completed
63.3% completed
65.5% completed
67.7% completed
69.9% completed
72.1% completed
74.2% completed
76.4% completed
78.6% completed
80.8% completed
83.0% completed
85.2% completed
87.3% completed
89.5% completed
91.7% completed
93.9% completed
96.1% completed
98.3% completed


In [15]:
len(known_book_ratings) 
#63825044

56572131

Put the user book-ratings in a dataframe

In [16]:
users_ratings = pd.DataFrame(known_book_ratings, columns=["user_id", "book_id", "rating"])
users_ratings["rating"] = pd.to_numeric(users_ratings["rating"])
users_ratings.head()

Unnamed: 0,user_id,book_id,rating
0,0,21,5
1,0,30,5
2,0,1022863,5
3,0,830,4
4,0,835,4


### Export the dataframes to csv

Export the user ratings of books we know the description of

In [18]:
users_ratings.to_csv("./data/user_book_ratings.csv")

Export the list of rated books with there full metadata

In [26]:
books_intersection_full_partial.to_csv("./data/books_metadata.csv")