# Assignment Goodreads

In this assignment, you are going to work with datasets from Goodreads to automatically collect comics/graphic novels that could potentially be relevant for Graphic Medicine Archive. Goodreads offers a giant collection of Comics and Graphic novels. Each title in this collection comes with meta-data (i.e. additional information). This information contains (amongst other aspects):

* a desription
* a list of 'popular shelves' (something like sub-genre lists created by users)
* an author id (can be mapped to author information in the Goodreads Authors dataset)


## Task
Your task is to use a keyword search in the popular shelves and book descriptions to extract potentially relevant titles. This notebook contains a couple of steps and hints that can help you along the way. 

## Goal

I made a quick start (using a very sloppy list of keywords and a sloppy search strategy) to create a first version of the result we are aiming for (see `../results/graphic_medicine_goodreads.csv`). I want you to improve this result in several ways:

* extend/improve the list of keywords I used (see below)
* improve the search strategy (I tokenized the shelf names and descriptions and directly searched in them; it would be much better to use lemmatization as well)
* if you have time and energy: take descriptions in different languages into account
* any additional information you want to include and can obtain from the data (or other data on the Goodreads website)


## Data download

Please download the following files from Goodreads (this page https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html) and put them in a directory called `data` (see code below to create the directory in the right location). Once you have downloaded the files, please unpack them (by double-clicking on the file). 

* goodreads_books_comics_graphic.json.gz (under By Genre) - henceforth referred to as **books data**
* goodreads_book_authors.json.gz (under Meta-Data of Books) - henceforth referred to as **authors data**

In [18]:
# packages you can/will want to use

import json
from collections import defaultdict
import csv
from nltk import tokenize
import os

In [19]:
# Creata a data dir (if it doesn't exist already) and place downloads there. 
try:
    os.mkdir('../data')
except OSError as error:
    print(error, 'no need to create the directoy again')

[Errno 17] File exists: '../data' no need to create the directoy again


In [20]:
# Filepaths
path_dir = '../data/goodreads/'
path_comics = f'{path_dir}/goodreads_books_comics_graphic.json'
path_authors = f'{path_dir}/goodreads_book_authors.json'

## Step 1: Keywords

Define keywords that could help you find relevant titles when searching shelves and the descriptions. Feel free to use the list below as a starting point, but please also feel free to deviate from it. 

In [21]:
# my sloppy keywords - to be extended and improved

keywords = ['illness', 'mental', 'health', 'sickness', 
            'ill', 'sick', 'cancer', 'depression', 'ocd', 'trauma', 
            'suicide', 'anxiety', 'disorder']

## Step 2: Understand the data structure from Goodreads

Both datasets we work with are structured as json-lists. Json lists are essentially lists of dictionaries. Each dictionary represents one book (in the books data) and one author (in the authors data). 

In [22]:
# load the files (code provided)

#comics dataset
with open(path_comics) as infile:
    books = [json.loads(line) for line in infile]

In [23]:
# authors dataset
with open(path_authors) as infile:
    authors = [json.loads(line) for line in infile]

In [24]:
# run this and appreciate how massive these datasets are
print(len(books))
print(len(authors))

89411
829529


In [25]:
# Explore the structure of one dictionary in the books data
test_book = books[0]
for k, v in test_book.items():
    print(k, '\t', v)

isbn 	 
text_reviews_count 	 1
series 	 []
country_code 	 US
language_code 	 
popular_shelves 	 [{'count': '228', 'name': 'to-read'}, {'count': '2', 'name': 'graphic-novels'}, {'count': '1', 'name': 'ff-re-2011-till-2015'}, {'count': '1', 'name': 'calibre-list'}, {'count': '1', 'name': 'linseyschussan'}, {'count': '1', 'name': '1-person-narrative'}, {'count': '1', 'name': 'lgbtq-ya'}, {'count': '1', 'name': 'watchlist'}, {'count': '1', 'name': 'next-to-read'}, {'count': '1', 'name': 'sf'}, {'count': '1', 'name': 'sachiko'}, {'count': '1', 'name': 'giveaway-add'}, {'count': '1', 'name': 'friends-in-mind'}, {'count': '1', 'name': 'free-to-read-or-preview-on-goodread'}, {'count': '1', 'name': 'fantasy'}, {'count': '1', 'name': 'dystopian'}, {'count': '1', 'name': 'ck-library'}, {'count': '1', 'name': '23089-ya-fantasy-sf-w-major-lgbt'}]
asin 	 B00NLXQ534
is_ebook 	 true
average_rating 	 4.12
kindle_asin 	 
similar_books 	 ['25653153', '25699172', '23530486', '12984185', '25538377', '23525

In [26]:
# Explore the structure of one dictionary in the authors data
test_author = authors[0]
for k, v in test_author.items():
    print(k, '\t', v)

average_rating 	 3.98
author_id 	 604031
text_reviews_count 	 7
name 	 Ronald J. Fields
ratings_count 	 49


## Step 3: Map author ids to author names (warm-up)

As you may have noticed in the previous step, the books data contain author information in the form of an author ID (a number). To find the name associated with the ID, we have to look up the ID in the authors data. To make things a bit easier, we will create a dictionary mapping IDs to names, so we can easily obtain the name of an author whenever we have an ID. The dictionary should look like this:

```
{
    123: 'first_name1 last_name1,
    456: 'first_name2 last_name2'
    
}
```

Complete the code below to fill the dictionary called `author_dict`:

In [27]:
author_dict = dict()
for author in authors:
    aid = author['author_id']
    name = author['name']

## Step 4: Search for relevant shelves

Each book in the books data comes with a list of popular shelves. Use your keywords to search for potentially relevant shelves. We will include all books associated with any of the potentially relevant shelves in the final dataset. You can play with this step a bit to check if your keywords work. It will be helpful to explore what kinds of shelf-names you can find and whether they are likely to contain relevant books. 

Please also keep track of what keyword you found in a shelf-name. To do this, please store your results in a dictionary called `target_shelves` mapping each shelf to the keyword you identified in its name:

`{'shelf_name1': 'keyword2', 'shelf_name2': 'keyword3', 'shelf_name3': 'keyword1'}`

Use the code below to get started:

In [28]:
# First extract all shelf names:
all_shelves = set()
for b in books: 
    shelves = b['popular_shelves']
    for sh in shelves:
        name = sh['name']
        all_shelves.add(name)

In [30]:
# Print some names to get a feeling for what they look like:
n = 0
for sh in all_shelves:
    print(sh)
    n+=1 
    if n == 30:
        break

fantasy-supernatural-paranormal

books-to-read-not-in-library
social-emotional
2011fiction
500-essential-gn
koleksi-anak-anak
read-in-8-17
other-2016
w-2000-2100
m-m-yaoi
2017-books-finished
one-shots-no-pair
book-girlfriends
books-that-fucked-me-up
boxed
couldnt_finish
most-wanted-books
lit-france
sept-2015
gladiatorial-or-arena-games
nonfiction-adult
peanut-butter-sandwich
vertaald
new-check-this-out
13-white
fantasy-mystery
survival
rock-he-kim
mlnavidad


Now it's time to apply the first keyword search! To search for keywords, tokenize the shelf names. As you can see above, they are joined by a '-' character. Some are also joined by a '\_' character (not shown in the examples). Check if the name contains any of these charaters and then split the name using a string method. In the next step, iterate over your keywords and check if any of them is in your shelf name. Tip: try it out on one name first, then apply it to all. 

In [32]:
# Dummy example: 
keywords_example = ['illness', 'example', 'games']
shelf_name = 'gladiatorial-or-arena-games'
shelf_name_words = shelf_name.split('-')
for keyword in keywords_example:
    if keyword in shelf_name_words:
        print('found word', keyword, shelf_name)

found word games gladiatorial-or-arena-games


In [None]:
# now apply it to all shelf_names
target_shelves = dict()
for sh in all_shelves:
    pass 

## Step 5: Find potentially relevant comics using your shelves

Now it's time to go through all books and extract the ones linked to relevant shelves! Store the results in such a way that the books are sorted by the keywords you used to identify the shelves (remember, we sored them in such a way that we mapped the target shelves to their respective keywords. 

Please store the results in a dictionary called `target_comics` whose keys are the keywords used to identify the shelves and values lists containing all books associated with any of the shelves identified by the keyword. 

```
{
'keyword1' : [book_dict1, book_dict2, book_dict3, ...],
 'keyword2' : [book_dict4, book_dict5, book_dict6, ...],
 ...
 }
```

Tip: You can use defaultdict to define a dictionary whose values are lists. See toy example below:

In [33]:
# Toy example: Sort words by first letter
toy_dict = defaultdict(list)

words = ['hi', 'toy', 'game', 'hotel', 'thing', 'great']

for word in words:
    first_letter = word[0]
    toy_dict[first_letter].append(word)
toy_dict

defaultdict(list,
            {'h': ['hi', 'hotel'],
             't': ['toy', 'thing'],
             'g': ['game', 'great']})

In [34]:
# target_comics:
target_comics = defaultdict(list)

for b in books:
    shelves =  b['popular_shelves']
    # Go through the shelves one by one and check if they are in your target shelves.
    # If yes, retrieve the keyword associated with the shelf from your dictionary target_shelves
    # Add the keyword and book to the dictionary target_comics in such a way that the book will be appendeded 
    # to a list associated with the keyword

In [None]:
# check how many books you collected per keyword:

for kw, comics in target_comics.items():
    print(kw, len(comics))

## Step 6: Create the final csv file

Have a look at the example file: '../results/graphic_medicine_goodreads.csv'. Use the books you have collected to create such a file. Add whatever other information you would like to add. 