**6. Exploring book rating data**

In [1]:
liked_books = ["4408", "31147619", "29983711", "9401317", "9317691", "8153988", "20494944"]

**Find all the users that like the same books as us then find all the books that they liked (Assumption: They have similar taste).**

In [2]:
import pandas as pd
data = pd.read_csv("book_id_map.csv")
data.head(n = 9)

Unnamed: 0,book_id_csv,book_id
0,0,34684622
1,1,34536488
2,2,34017076
3,3,71730
4,4,30422361
5,5,33503613
6,6,33517540
7,7,34467031
8,8,6383669


In [3]:
csv_book_mapping = {}
with open("book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.split(",")
        csv_book_mapping[csv_id] = book_id

In [4]:
len(csv_book_mapping)

2360651

In [5]:
!wc -l goodreads_interactions.csv

 228648343 goodreads_interactions.csv


**Size of goodreads_interactions.csv**

In [6]:
!ls -lh | grep goodreads_interactions.csv

-rw-r--r--@ 1 britneyagius  staff   4.0G Mar  8 15:47 goodreads_interactions.csv


**7. Finding users who like the same books as us**

**A set is a python data structure where every element is unique.**

**if user_id in overlap_users: continue - means that if we already added this user to our overlap user set we don't need to keep processing.**

In [7]:
data = pd.read_csv("goodreads_interactions.csv")
data.head(n=9)

Unnamed: 0,user_id,book_id,is_read,rating,is_reviewed
0,0,948,1,5,0
1,0,947,1,5,1
2,0,946,1,5,0
3,0,945,1,5,0
4,0,944,1,5,0
5,0,943,1,5,0
6,0,942,1,5,0
7,0,941,1,5,0
8,0,940,1,5,0


In [8]:
overlap_users = set()

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        
        if user_id in overlap_users:
            continue

        try:
            rating = int(rating)
        except ValueError:
            continue
        
        book_id = csv_book_mapping[csv_id]
        
        if book_id in liked_books and rating >= 4:
            overlap_users.add(user_id)

**8. Finding what books those users liked**

**rec_lines will only contin books that users who likes the same books as us have read.**

**rec_lines will contain books we might want to read.**

In [9]:
rec_lines = []
with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        
        if user_id in overlap_users:
            book_id = csv_book_mapping[csv_id]
            rec_lines.append([user_id, book_id, rating])

In [10]:
len(overlap_users)

0

In [11]:
len(rec_lines)

0

In [12]:
import pandas as pd

recs = pd.DataFrame(rec_lines, columns=["user_id", "book_id", "rating"])
recs["book_id"] = recs["book_id"].astype(str)

**top_recs = recs("book_id").value_counts() - It counts up how many times each book id occured and it shows you the most common ones so our top recommendations will be our book ids that occured the most frequently.**

In [13]:
top_recs = recs["book_id"].value_counts().head(10)
top_recs = top_recs.index.values

**To get from the book id to a title: Read in from the file books_titles.json**

In [14]:
books_titles = pd.read_json("books_titles.json")
books_titles["book_id"] = books_titles["book_id"].astype(str)

In [15]:
books_titles.head()

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,alls fairy in love and war avalon web of magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devils notebook


**9. Creating initial book recommendations**

**Finds all of the book titles where the book id is in the top recommendations.**

In [16]:
books_titles[books_titles["book_id"].isin(top_recs)]

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title


In [17]:
top_recs

array([], dtype=object)

**10. Improving our book recommendations**

In [18]:
all_recs = recs["book_id"].value_counts()

In [19]:
all_recs

Series([], Name: count, dtype: int64)

**Make index a column because the index is the actual book id.**

In [20]:
all_recs = all_recs.to_frame().reset_index()

In [21]:
all_recs

Unnamed: 0,book_id,count


In [22]:
all_recs.columns = ["book_id", "book_count"]

In [23]:
all_recs

Unnamed: 0,book_id,book_count


**Inner merge means that if the data disen't exist in both get rid of the row.**

In [24]:
all_recs = all_recs.merge(books_titles, how="inner", on="book_id")

In [25]:
all_recs

Unnamed: 0,book_count,book_id,title,ratings,url,cover_image,mod_title


**A score will be created which we be used to sort these recommendations.**

In [26]:
all_recs["score"] = all_recs["book_count"] * (all_recs["book_count"] / all_recs["ratings"])

In [27]:
all_recs.sort_values("score", ascending=False).head(10)

Unnamed: 0,book_count,book_id,title,ratings,url,cover_image,mod_title,score


**Popular recommendations.**

In [28]:
popular_recs = all_recs[all_recs["book_count"] > 75].sort_values("score", ascending=False)

In [29]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)


popular_recs[~popular_recs["book_id"].isin(liked_books)].head(10).style.format({'url': make_clickable, 'cover_image': show_image})

Unnamed: 0,book_count,book_id,title,ratings,url,cover_image,mod_title,score
