In [44]:
import pandas as pd
import json

#### Read & inspect data

In [45]:
df_books = pd.read_csv('data/books_metadata.csv')

In [46]:
print(len(df_books))

26764


In [47]:
df_books.head()

Unnamed: 0,book_id,title,series,author,description,genres,pages,publisher,firstPublishDate,awards,setting,coverImg
0,14796360,The Hunger Games,The Hunger Games #1,Suzanne Collins,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",374,Scholastic Press,,['Locus Award Nominee for Best Young Adult Boo...,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...
1,7743507,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",870,Scholastic Inc.,06/21/03,['Bram Stoker Award for Works for Young Reader...,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...
2,23390821,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...",324,Harper Perennial Modern Classics,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...","['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...
3,1555826,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Historical...",279,Modern Library,01/28/13,[],"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...
4,28109798,The Book Thief,,Markus Zusak (Goodreads Author),Librarian's note: An alternate cover edition c...,"['Historical Fiction', 'Fiction', 'Young Adult...",552,Alfred A. Knopf,09/01/05,['National Jewish Book Award for Children’s an...,"['Molching (Germany)', 'Germany']",https://i.gr-assets.com/images/S/compressed.ph...


### Create dictionary with the item profiles

Set book id as id in dataframe, remove non-named left-most column

In [48]:
df_books = df_books.set_index("book_id")

Create the dictionary from the above dataframe

In [49]:
books_map = df_books.to_dict(orient='index')

In [50]:
books_map[14796360] #test

{'title': 'The Hunger Games',
 'series': 'The Hunger Games #1',
 'author': 'Suzanne Collins',
 'description': "WINNING MEANS FAME AND FORTUNE.LOSING MEANS CERTAIN DEATH.THE HUNGER GAMES HAVE BEGUN. . . .In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and once girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV.Sixteen-year-old Katniss Everdeen regards it as a death sentence when she steps forward to take her sister's place in the Games. But Katniss has been close to dead before—and survival, for her, is second nature. Without really meaning to, she becomes a contender. But if she is to win, she will have to start making choices that weight survival against humanity and life against love.",
 'genres': "['Young Adult', 'Fiction', 'Dysto

**Problem**: The lists in some cells are stored as strings that looks like lists; we want actual lists in the item profiles.<br/>

Columns where we have lists in the cells are: genres, awards, setting.<br/>
Formats:<br/> 
genres = "['item1', 'item2', ...]"<br/>
awards = "[\\'item1\\', \\'item2\\', ...]"<br/>
setting = "['item1', 'item2', ...]"<br/>

We create functions to turn string into list for the two formats above.

In [51]:
def string_to_list(s, awards=False):
    if awards:
        s_ = s.strip("[]").split(", ")
        s = []
        for aw in s_:
            s.append(aw[1:-1]) # Chop off the " from front and back, somewhat tideous but works
    else:
        s = s.strip("[]").split("', '")
        # Remove "'" from start of first and end of last item
        s[0] = s[0][1:]
        s[-1] = s[-1][:-1]
    return s

Clean the lists in books map:

In [52]:
for book_id, metadata in books_map.items():
    books_map[book_id]["awards"] = string_to_list(books_map[book_id]["awards"], awards=True) 
    books_map[book_id]["genres"] = string_to_list(books_map[book_id]["genres"], awards=False) 
    books_map[book_id]["setting"] = string_to_list(books_map[book_id]["setting"], awards=False) 

In [53]:
with open('item_profiles.txt', 'w') as outfile:
    for book_id, metadata in books_map.items():
        d_ = {}
        d_[book_id] = metadata
        json.dump(d_, outfile)
        outfile.write('\n')
        d_= {}

books_map is now our set of item profiles