# Goodreads Data 

## Summary

### Goals
* Data collection 
* Data extraction from Goodreads
* Feature Extraction 

### Data Collection
The data was obtained from users choice for their favorite books in a following forum for udacity bertelsmann scholarship. Then the books were listed in the [Bertelsmann Data Science book](https://www.goodreads.com/group/show/603467-bertelsmann-data-science-book-readers)readers group under [udacians-favorites shelf](https://www.goodreads.com/group/bookshelf/603467-bertelsmann-data-science-book-readers?order=d&per_page=30&shelf=udacians-favorites&sort=date_added&view=main). 

### Goodreads API access 

* __what is an APIs?__ 
APIs are tools or the mean that allow us to connect as a third party directly to another code or data, in our case we are trying to access goodreads data.
* __what we need to access goodreads API?__ register for an API developer to access the API. this provides the developer with the authorization tokens required to access goodreads website and extract data such as user books, user reviews, authors details, book information such as title, generes and so on. some of the methods that require accessing user profiles or writing would require oauth2 authorization protocol to get access tokens which have limited lifetime. But for our application we won't need the access tokens. 
There are two options to access the API after acquiring the access: first to use  goodread directly api which return the data as xml and then one has to parse the data to the required format( an example code for this is provided below with function get_access_token). The second option is to use a ready python library which what we used here.
* __package used for goodreads API:__ we used a python package for goodreads API interface [link](url), [github](https://github.com/sefakilic/goodreads). 

* __goodreads API limitations:__ it does not provide the following: first, book generes listed in the book page. second, we can't access books listed in a goodreads group shelves. both of these were tackled as will be shown.   

In [1]:
CLIENT_ID = 'doDe9gQzQN1nhSLAkBAGQ'
CLIENT_SECRET = 'dZIiAKXZ1gRU1oWxxBy5ED1S6l78B65VgRV2RCS0'

In [2]:
import oauth2


In [3]:
# example code to access the Goodreads API 
def get_access_token(client_key,client_secret):
    url = 'http://www.goodreads.com'
    request_token_url = '%s/oauth/request_token' % url
    authorize_url = '%s/oauth/authorize' % url
    access_token_url = '%s/oauth/access_token' % url
    
    # consumer authentication using oauth2
    consumer = oauth2.Consumer(key= CLIENT_ID, secret= CLIENT_SECRET)
    client = oauth2.Client(consumer) #create the client
    # The OAuth Client request to get access token
    response, content = client.request(request_token_url, 'GET')
    if response['status'] != '200':
        raise Exception('Invalid response: %s' % response['status'])
    
    #get access token    
    request_token = dict(urllib.parse.parse_qsl(content))
    oauth_token_key,oauth_token_secret = request_token[b'oauth_token'], request_token[b'oauth_token_secret']
    return oauth_token_key, oauth_token_secret


# Step 1: Data extraction from Goodreads
as mentioned above. Extracting a group bookshelves through API is not possible so an alternative is to scrap the page using beautiful_soup library in python.
#### this step include:
* Extracting [Bertelsmann Data Science book](https://www.goodreads.com/group/show/603467-bertelsmann-data-science-book-readers) group bookshelves. which is not possible through the API , so an alternative was to scrap the page using beautiful_soup module.
* Read the book ids from the html table which had an id= ‘groupBooks’ then use the Goodreads API to extract each book details. Here we are creating a list of book links. an example of a book link is as follows 
https://www.goodreads.com/book/show/68428.The_Final_Empire. we know from this link that each book id follows /show/
the id for the book is what comes after (https://www.goodreads.com/book/show/book_id). 

In [4]:
#extract the books list from goodreads group website 
from urllib.request import urlopen
from bs4 import BeautifulSoup
def extract_books_list():
    books_list = []
    BASE_URL = "https://www.goodreads.com/group/bookshelf/603467-bertelsmann-data-science-book-readers?order=d&per_page=200&shelf=udacians-favorites&sort=date_added&view=main"
    html = urlopen(BASE_URL).read()
    soup = BeautifulSoup(html,'html.parser')
    table = soup.find(id='groupBooks')
    for row in table.findAll("tr"):
        if row.td:
            td = row.findAll("td")[0]
            if td.find('a'):
                link = td.find('a')
                books_list.append(link.get('href'))
                continue
    return books_list

In [5]:
books_list = extract_books_list()
books_list[0]

'/book/show/68428.The_Final_Empire'

In [6]:
# imports
import pandas as pd
import re
import goodreads_api_client as gr

In [7]:
# create data frame to hold relevant info
books_df = pd.DataFrame({'link': books_list})

In [8]:
# extract book ids from link 
books_df['id'] = books_df['link'].apply(lambda x: re.findall(r'\d+', x)[0])

# Step 2: Feature Extraction

* we use the Goodreads API python interface libray to extract book attributes from the ids extracted in the previous step.
* save the books in a dictionary with the following attributes: title,  reviews_count, ratings_count, original_publication_year, first author name, first author average rating, second author name, second author average rating, third author name, third author average rating, genres.
* defining book genres

#### book genres 
* book genres in Goodreads are selected based on each book top shelves. So from the API we got the top shelves for each book and then these shelves were preprocessed to include shelf names representing a proper book genre.
* we used 4 types of filtering:
    * the first is the usage of stop-words or in other words shelves that don’t qualify as book genres such as  to-read, audiobook, read, book-club, read-2017,...etc this list was a created manually to add new words to the stop word. 
    * The second filtering was the synonym list (correction list ),so for example: child-book, children-books, children, childrens all will be listed under children genre as a way to minimize the data and genres redundancy. 
    * the third filtering is choosing 10 shelves as book genres. after this step all books will have the same number of books
    * the fourth filtering is choosing only top 25 common genres as book genres (minimize the overall number of genres in the list).

In [9]:
# connect. Set up connection to Good Reads
gr_client = gr.Client(developer_key=CLIENT_ID)

In [10]:
# This list of words will be used as a filtering mechansim to remove
# shelves that don't qualify as a book genre (filter 1)
# this code get the stop words from the csv file
import csv
STOP_WORDS = []
with open('stop_list.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ')
    for row in reader:
        STOP_WORDS.append(row[0].lower()) 

In [11]:
# This list of words will be used as a filtering mechansim to remove
# and group similar shelves (filter 2)
# this code get the synonym list from the csv file
STEM_DICT = {}
with open('corrections.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    STEM_DICT = {rows[0]:rows[2] for rows in reader}

In [12]:
# This function return the book attribute if it exists,
# otherwise it returns None
def try_or_none(obj, attribute):
    try: 
        result = obj[attribute]
        return result
    except: 
        return None

In [13]:
#Make the corrections in the file according to dicitonary 
def get_standardized_genres(genre):
    return STEM_DICT.get(genre) if genre in STEM_DICT else genre

In [14]:
# take the top n shelves from  a book 
# this function call the STEM_dicitionary and the stop words

from collections import OrderedDict
 
def take_top_n_meaningful_shelves(shelves: list, n: int) -> list:
    result = []
    for x in shelves: 
        if (try_or_none(x,'@name').lower() not in STOP_WORDS and try_or_none(x,'@name') is not None):
            result.append(get_standardized_genres(try_or_none(x, '@name')))
    result = list(OrderedDict.fromkeys(result))
    return result[:n]


In [15]:
# this function is used to extract the fields from the book
NUMBER_OF_FIELDS_FOR_AUTHOR = 9
def extract_fields(book,num_genres):
    book_id = int(book['id'])
    title = book['title']
    reviews_count = try_or_none(book['work']['reviews_count'],'#text')
    ratings_count = try_or_none(book['work']['ratings_count'],'#text')
    publication_year = try_or_none(book['work']['original_publication_year'],'#text')
    average_rating = float(book['average_rating'])
    ratings_sum = try_or_none(book['work']['ratings_sum'],'#text')
    ratings_count = try_or_none(book['work']['ratings_count'],'#text')
    if (ratings_sum is not None) and (ratings_count is not None) and (int(ratings_count) > 0):
        alt_averate_rating = int(ratings_sum)/int(ratings_count)
    else: 
        alt_averate_rating = None
    n_authors = len(book['authors']['author'])
    single_author = False
    if n_authors == NUMBER_OF_FIELDS_FOR_AUTHOR:
        single_author = True
    if n_authors > 0:
        if single_author:
            author1 = book['authors']['author']
        else:
            author1 = book['authors']['author'][0]
        author1_name = author1['name']
        author1_role = author1['role']
        author1_average_rating = author1['average_rating']
    else: 
        author1 = None
        author1_name = None
        author1_role = None
        author1_average_rating = None
    if (not single_author) and (n_authors > 1):
        author2 = book['authors']['author'][1]
        author2_name = author2['name']
        author2_role = author2['role']
        author2_average_rating = author2['average_rating']
    else: 
        author2 = None
        author2_name = None
        author2_role = None
        author2_average_rating = None
    if (not single_author) and (n_authors > 2):
        author3 = book['authors']['author'][2]
        author3_name = author3['name']
        author3_role = author3['role']
        author3_average_rating = author3['average_rating']
    else:
        author3 = None
        author3_name = None
        author3_role = None
        author3_average_rating = None 
    if (try_or_none(book, 'popular_shelves') is not None) and (try_or_none(book['popular_shelves'], 'shelf') is not None):
        genres = ','.join(take_top_n_meaningful_shelves(book['popular_shelves']['shelf'],num_genres))
    else:
        genres = None                                      
    result = pd.DataFrame({'id': [book_id],
                           'title': [title],
                           'reviews_cnt': [reviews_count],
                           'ratings_cnt': [ratings_count],
                           'pub_year': [publication_year],
                           'avg_rating': [average_rating],
                           'alt_avg_rating': [alt_averate_rating],
                           'author1_name': [author1_name],
                           'author1_role': [author1_role], 
                           'author1_avg_rating': [author1_average_rating],
                           'author2_name': [author2_name],
                           'author2_role': [author2_role], 
                           'author2_avg_rating': [author2_average_rating],
                           'author3_name': [author3_name],
                           'author3_role': [author3_role], 
                           'author3_avg_rating': [author3_average_rating],
                           'genres': [genres] 
    })
    return result

In [16]:
# this function call the extract field function for each book in the list of fields
def build_books_data_frame(num_genres):
    book = gr_client.Book.show(books_df.id[0])
    result = extract_fields(book,num_genres)
    for book_id in books_df.id[1:]:
        book = gr_client.Book.show(book_id)
        row = extract_fields(book,num_genres)
        result = pd.concat([result, row])
    return result

In [17]:
df = build_books_data_frame(10)

In [18]:
df.head()

Unnamed: 0,id,title,reviews_cnt,ratings_cnt,pub_year,avg_rating,alt_avg_rating,author1_name,author1_role,author1_avg_rating,author2_name,author2_role,author2_avg_rating,author3_name,author3_role,author3_avg_rating,genres
0,68428,"The Final Empire (Mistborn, #1)",508741,273394,2006.0,4.44,4.435617,Brandon Sanderson,,4.38,,,,,,,"fantasy,fiction,epic-fantasy,magic,brandon-san..."
0,37424706,The Art of Gathering: How We Meet and Why It M...,2273,98,,4.17,4.173469,Priya Parker,,4.19,,,,,,,"non-fiction,economics,personal-development,to-..."
0,117833,The Master and Margarita,329523,176610,1967.0,4.31,4.312202,Mikhail Bulgakov,,4.25,Katherine Tiernan O'Connor,Translator,4.17,Ellendea Proffer,Annotations and Afterword,4.09,"fiction,classics,russian,fantasy,russia,litera..."
0,18632929,Kaip atpažinti psichopatą,299310,97864,2012.0,3.92,3.92275,Jon Ronson,,3.9,Linas Vasara,Translator,4.0,,,,"non-fiction,psychology,science,Mental-illness,..."
0,1953,A Tale of Two Cities,1247849,713767,1859.0,3.82,3.821206,Charles Dickens,,3.87,Richard Maxwell,"Editor, Introduction",3.81,Hablot Knight Browne,Illustrator,3.83,"classics,fiction,literature,history,novel,brit..."


In [19]:
df.to_csv('gr_books_v2.csv', index=False)

In [20]:
# extract most common genres_list dictionary 
from collections import Counter
genres_dict = {}
def get_most_common_genres(df):
    genres_list =[genre for genre_list in df['genres'] for genre in genre_list.split(',')]
    genres_dict = (Counter(genres_list).most_common())
    return genres_dict
genres_dict = get_most_common_genres(df)    

In [21]:
# new dataframe with number of selected most common genres
# so now the books genre list will limited to the
# subset of 25 most common genres
def update_df_with_most_common_genres(original_df, most_common_num, genres_list):
    genres_to_keep = [key for key,value in Counter(genres_list).most_common(most_common_num)[:]]
    df2 = original_df.copy()
    df2.reset_index(inplace=True)
    for index,row in df2.iterrows():
        original_genres = row['genres'].split(',')
        new_genres = [genre for genre in original_genres if genre in genres_to_keep]
        df2.at[index,'genres']=','.join (new_genres)
    return df2

In [23]:
genres_list = [genre for genre_list in df['genres'] for genre in genre_list.split(',')]
df2 = update_df_with_most_common_genres(df, 25, genres_list)
df2.head()

Unnamed: 0,index,id,title,reviews_cnt,ratings_cnt,pub_year,avg_rating,alt_avg_rating,author1_name,author1_role,author1_avg_rating,author2_name,author2_role,author2_avg_rating,author3_name,author3_role,author3_avg_rating,genres
0,0,68428,"The Final Empire (Mistborn, #1)",508741,273394,2006.0,4.44,4.435617,Brandon Sanderson,,4.38,,,,,,,"fantasy,fiction,magic,adult,science-fiction,yo..."
1,0,37424706,The Art of Gathering: How We Meet and Why It M...,2273,98,,4.17,4.173469,Priya Parker,,4.19,,,,,,,"non-fiction,economics,personal-development"
2,0,117833,The Master and Margarita,329523,176610,1967.0,4.31,4.312202,Mikhail Bulgakov,,4.25,Katherine Tiernan O'Connor,Translator,4.17,Ellendea Proffer,Annotations and Afterword,4.09,"fiction,classics,fantasy,literature,novel"
3,0,18632929,Kaip atpažinti psichopatą,299310,97864,2012.0,3.92,3.92275,Jon Ronson,,3.9,Linas Vasara,Translator,4.0,,,,"non-fiction,psychology,science"
4,0,1953,A Tale of Two Cities,1247849,713767,1859.0,3.82,3.821206,Charles Dickens,,3.87,Richard Maxwell,"Editor, Introduction",3.81,Hablot Knight Browne,Illustrator,3.83,"classics,fiction,literature,history,novel,british"


In [24]:
# map the genres into new columns 
# create new columns for the genres 
columns = ['Genre_1','Genre_2','Genre_3','Genre_4','Genre_5','Genre_6','Genre_7','Genre_8','Genre_9','Genre_10','Genre_11']
df3 = df2.copy()
df55 = df['genres'].str.split(',', expand=True)
# df3[columns] = pd.DataFrame(df2['genres'].str.split(',').tolist()) this not working ??
column_dict = {i:columns[i]  for i in  df55.columns.values}
df55.rename(index=str, columns=column_dict)
df55.reset_index(inplace=True)
df3[columns] =pd.DataFrame(df55,columns = columns)

In [25]:
df3.head()

Unnamed: 0,index,id,title,reviews_cnt,ratings_cnt,pub_year,avg_rating,alt_avg_rating,author1_name,author1_role,...,Genre_2,Genre_3,Genre_4,Genre_5,Genre_6,Genre_7,Genre_8,Genre_9,Genre_10,Genre_11
0,0,68428,"The Final Empire (Mistborn, #1)",508741,273394,2006.0,4.44,4.435617,Brandon Sanderson,,...,,,,,,,,,,
1,0,37424706,The Art of Gathering: How We Meet and Why It M...,2273,98,,4.17,4.173469,Priya Parker,,...,,,,,,,,,,
2,0,117833,The Master and Margarita,329523,176610,1967.0,4.31,4.312202,Mikhail Bulgakov,,...,,,,,,,,,,
3,0,18632929,Kaip atpažinti psichopatą,299310,97864,2012.0,3.92,3.92275,Jon Ronson,,...,,,,,,,,,,
4,0,1953,A Tale of Two Cities,1247849,713767,1859.0,3.82,3.821206,Charles Dickens,,...,,,,,,,,,,
