# EAs Goodreads Analysis
### Questions:
- ~~what are most popular books~~
- ~~what are most to-read but not read books~~
- ~~highest/lowest rated~~
- average books read per year
- books that many people read before EA was a thing
- ~~books that are relatively fringe but read by EAs~~
- express the numbers as percentages, too

### ToDo:
- it looks like for a few profiles, e.g. "117194676" and "52226471" my program didn't return the books even though their profiles are public
    - I might have messed up something when I split up the scraping into multiple sessions

### Scraping ethics
- as far as I can see, all pages I scrape are not disallowed: https://www.goodreads.com/robots.txt
    - (weirdly there seem to be ~200 books that are individually not allowed to be scraped)

In [111]:
from bs4 import BeautifulSoup
import datetime
from dateutil import parser
import lxml
import pandas as pd
import random
import requests
import time

In [198]:
url = 'https://www.goodreads.com/group/151274-effective-altruists/members'
#response = requests.get(url)
#response

In [199]:
#soup = BeautifulSoup(response.text, "html.parser")

### Get all user sites

In [16]:
users = []
for i in range(1,13):
    time.sleep(random.uniform(16,34))
    response = requests.get(url + "?page=" + str(i))
    soup = BeautifulSoup(response.text, "html.parser")
    all_tags = soup.findAll("a", attrs={"class": "userName"})
    all_refs = []
    for tag in all_tags:
        all_refs.append(tag.get("href"))
    users = users + all_refs

In [18]:
len(users)

333

In [19]:
#df_users = pd.DataFrame(users, columns=["URL"])
#df_users["ID"] = df_users["URL"].str.replace(r"\D", '', regex=True)
#df_users.to_csv("GoodreadsEAs.csv")
df_users = pd.read_csv("GoodreadsEAs.csv")
df_users.head()

Unnamed: 0,URL,ID
0,/user/show/85957352-kiim,85957352
1,/user/show/92096951-max,92096951
2,/user/show/38695642-fin-moorhouse,38695642
3,/user/show/5583842-radovan-kavick,5583842
4,/user/show/124437386-fiona,124437386


### Get books
#### Plan
- go through each EA one by one
- go through their books
- save
    - read or not
    - if rated: rating
    - if read: (first) date it was read
- data structure:
    - one df for each user
    - one row for each book
    - columns for read/to-read, rating, date
    

#### How to navigate along the page
- Format of list of books: https://www.goodreads.com/review/list/{userID}
- Master list doesn't have reading status info, seems like I have to go seperately through "read", "currently reading" and "to read"
    - ?shelf=read
    - ?shelf=currently-reading
    - ?shelf=to-read
- &page={i}
    - starting with 1
    - ends with "No books matching"
- &per_page=100
    - Goodreads gives me 30 books per page no matter what, weird

In [153]:
def extract_books(soup, userID, shelf):
    """Gets soup of whole page, returns df with books of that page."""
    all_books = soup.findAll("tr", attrs={"class": "bookalike review"})
    
    titles = []
    alt_titles = []
    authors = []
    avg_ratings = []
    num_reviews_list = []
    ratings = []
    dates_added = []
    dates_read = []
    
    
    for book in all_books:
        title = book.findAll("td", attrs={"class": "field title"})[0].findAll("a")[0].string
        titles.append(title)

        alt_title = book.findAll("td", attrs={"class": "field title"})[0].findAll("a")[0]["title"]
        alt_titles.append(alt_title)

        author = book.findAll("td", attrs={"class": "field author"})[0].findAll("a")[0].string
        authors.append(author)

        avg_rating = book.findAll("td", attrs={"class": "avg_rating"})[0].findAll("div")[0].string
        avg_rating = float(avg_rating)
        avg_ratings.append(avg_rating)

        num_reviews = book.findAll("td", attrs={"class": "field num_ratings"})[0].findAll("div")[0].string
        num_reviews = int(num_reviews.replace(",", ""))
        num_reviews_list.append(num_reviews)

        # rating - catching unrated books
        try:
            rating = book\
                    .findAll("td", attrs={"class": "field rating"})[0]\
                    .findAll("span", attrs={"class": "staticStars notranslate"})[0]["title"]
        except KeyError:
            rating = None
        ratings.append(rating)

        # date added - catching undated books
        date_added = book.findAll("td", attrs={"class": "field date_added"})[0].findAll("span")[0].string
        try:
            date_added = parser.parse(date_added)
        except:
            date_added = "not set"
        dates_added.append(date_added)

        # date read - catching unread books
        date_read = book.findAll("td", attrs={"class": "field date_read"})[0].findAll("span")[0].string
        try:
            date_read = parser.parse(date_read)
        except:
            date_read = "not set"
        dates_read.append(date_read)

        
    # these all should be the same length
    assert len(titles) == len(alt_titles) == len(authors) == len(avg_ratings) == len(num_reviews_list) == len(ratings) == len(dates_added) == len(dates_read)
    
    d = {"userID": [userID]*len(titles), "shelf": [shelf]*len(titles), "title": titles, "alt_title": alt_titles, "author": authors, "avg_rating": avg_ratings, "num_reviews": num_reviews_list,
        "rating": ratings, "date_added": dates_added, "date_read": dates_read}
    
    return pd.DataFrame(d)    

In [207]:
d = {"userID": ["Test ID"], "shelf": "to-test", "title": ["Test Title"], "alt_title": ["Test alt. Title"], "author": ["Test Author"], "avg_rating": ["3.33"], "num_reviews": ["69"],
        "rating": ["liked it"], "date_added": ["March 3rd, 1933"], "date_read": ["May 4th, 1999"]}
df_books = pd.DataFrame(d)
#df_books
#df_books = pd.read_csv("GoodreadsEAs_books.csv")

### Go through all users
This got a little messy because Goodreads now and then bounced me so I split up the userIDs in multiple lists.
to_scrape should just be list(set(df_users["ID"])) if one wants to start this again. 

In [175]:
#to_scrape = list(set(df_users["ID"])^set(df_books["userID"].unique()))
#test = pd.Series(to_scrape)
#test.to_csv("to_scrape.csv")
#to_scrape = read_csv("to_scrape.csv")

In [240]:
#to_scrape1 = to_scrape[0:60]
#to_scrape2 = to_scrape[60:120]
#to_scrape3 = to_scrape[120:180]
#to_scrape4 = to_scrape[180:240]
#to_scrape5 = to_scrape[240:300]
#to_scrape6 = to_scrape[300:]
#missing = ["68316850"]

In [242]:
base = "https://www.goodreads.com/review/list/"
for userID in missing:
    print("\n", userID)
    userURL = base + userID
    for shelf in ["read", "currently-reading", "to-read"]:
        shelfURL = userURL + "?shelf=" + shelf
        for page in range(1, 200): # don't want to do a while-loop, the super bookworms require ~150 loops
            print(page, end="-")
            pageURL = shelfURL + "&page=" + str(page)
            # pageURL += "&per_page=100" # for some reason Goodreads returns 30 per page no matter what I call
            
            time.sleep(random.uniform(22,32))
            
            response = requests.get(pageURL)
            soup = BeautifulSoup(response.text, "html.parser")
            extracted = extract_books(soup, userID, shelf)
            df_books = df_books.append(extracted)
            
            if extracted.shape[0] < 5:
                break
    df_books.to_csv("books1.csv")
    time.sleep(random.uniform(12,24))


 68316850
1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-1-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-