## Web Scraping Exercise: NYT Best Seller's List

Below I attempted an exercise to learn about and practice web scraping. I haven't had any previous experience with web scraping, so this is very much a learning exercise! I wanted to create a small dataframe with the top books from a section of the [New York Time's best sellers list](https://www.nytimes.com/books/best-sellers/). The main page shows the top 5 books broken down into different categories based on book type and format (hardcover, paperback, e-book), so I thought it would be more interesting to focus in on the [Combined Print & E-Book Fiction](https://www.nytimes.com/books/best-sellers/combined-print-and-e-book-fiction/) page, which lists the top 15 books, to make things more interesting. In hindsight, I probably would have had an easier time if I chose the Hardcover Fiction category, as I found that most of the inconsistencies I had to work around were based on a few e-books that were listed. 

### References:

I referenced the following website ([https://ezzeddinabdullah.com/post/scrape-amazon-bestseller/](https://ezzeddinabdullah.com/post/scrape-amazon-bestseller/)) to help me get started with my import statements, requests, and using the .find and .find_all functions. 

### Reflection

Because I wanted to bring in additional information for each book, I made an additional request for each book to get the html for the Amazon page linked for each book. This slowed down the run time significantly, but did allow me to bring in extra information such as the number of pages, language, publication date, and ISBNs for most of the books. 

As I was iterating through the information on the Amazon page for each book, I was hoping to use the section near the top that listed the above additional information. At first I thought that length, language, publisher, publication date were alwayas the first 4 items listed, and then it varied with between ISBNs if it was a hardcover or paperbook book and having file size and other information for ebooks. But I found even if I accounted for ebooks here and skipped ISBNs, the first four pieces weren't always the same. So while I had originally hoped to iterate through the first four objects, I opted to instead compare each item with if statements to see what they were. Otherwise I would end up with '5 Years and Up' recommended reading age for George R.R. Martin's Fire & Blood in Pages column! 

Similarly I ran into issues with one book not listing the publisher in that section of the website. This gave me errors with my list of Publishers being shorter than the other lists when I tried to make my dataframe. As such I put in a workaround to add a null value there if that happened. Once I added this section to put in a null value if one wasn't found, I decided to add in grabbing the ISBNs for all the non e-books listed, passing in a null value if they weren't found for the e-books.

In [94]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd
import datetime as dt

In [100]:
# New York Time's Best Seller's page - Combined Print & E-Book Fiction
url = 'https://www.nytimes.com/books/best-sellers/combined-print-and-e-book-fiction/'
h = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0'}

request = Request(url, headers=h)
html = urlopen(request)

soup = BeautifulSoup(html, 'html.parser')

books = soup.find_all('li', class_='css-13y32ub')

In [101]:
ranks = []
titles = []
authors = []
weeks = []
descriptions = []
amazon_links = []
lengths = []
languages = []
publishers = []
publish_date = []
isbn_10s = []
isbn_13s = []


rank = 1

for book in books:
    ranks.append(rank)
    
    ## grab Title
    title = book.find('h3', class_='css-5pe77f').get_text().title()
    titles.append(title)
    
    ## grab Author
    author = book.find('p', class_='css-hjukut').get_text().replace('by ', '')
    authors.append(author)
    
    ## grab how many weeks it's been on the best seller's list
    onlist = book.find('p', class_='css-1o26r9v').get_text()
    if onlist == 'New this week':
        w = 1
    else: w = onlist.strip(' weeks on the list')
    weeks.append(w)
    
    ## grab description
    blurb = book.find('p', class_='css-14lubdp').get_text()
    descriptions.append(blurb)
    

    ## grab link to amazon page to get further info...
    amazon_link = book.find('a', class_='css-114t425', href=True)['href']
   
    ## if amazon_link is http:// instead of https://, change it
    ## was getting HTTP Error 308: Permanent Redirect errors from a couple http:// links
    if 'http://' in amazon_link:
        amazon_link = amazon_link.replace('http://', 'https://')
        
    ## first four will NOT always be length, language, publisher, publication date 
    ## so I had to iterate through and compare with if statements
    request = Request(amazon_link, headers=h)
    html = urlopen(request)
    book_page = BeautifulSoup(html, 'html.parser')
    cards = book_page.find_all('li', class_='a-carousel-card rpi-carousel-attribute-card')
    
    ##print(rank)
    
    ## one book didn't list the publisher in the same area of the page,
    ## and the e-book links didn't list ISBNs, so I used this is a work around for those that didn't
    publisher = ''
    isbn_10 = ''
    isbn_13 = ''
    
    for card in cards:
        label = card.find('div', class_='a-section a-spacing-small a-text-center rpi-attribute-label').get_text().strip()
        attribute = card.find('div', class_='a-section a-spacing-none a-text-center rpi-attribute-value').get_text().strip()
        
        if label == 'Print length':
            lengths.append(attribute.replace(' pages',''))
            
        elif label == 'Language':
            languages.append(attribute)
           
        elif label == 'Publisher':
            publisher = attribute
            publishers.append(publisher)
           
        elif label == 'Publication date':
            publish_date.append(pd.to_datetime(attribute))
           
        elif label == 'ISBN-10':
            isbn_10 = attribute
            isbn_10s.append(isbn_10)
            
        elif label == 'ISBN-13':
            isbn_13 = attribute
            isbn_13s.append(isbn_13)
    
    ## if Publisher or ISBNs weren't found, add a null value
    ## otherwise I'd run into errors of the lists not being the same lenght when trying to make the dataframe
    if publisher == '':
        publishers.append(None) 
    if isbn_10 == '':
        isbn_10s.append(None)  
    if isbn_13 == '':
        isbn_13s.append(None)               
                
    rank += 1
    


In [98]:
top_15 = pd.DataFrame({
    'Rank': ranks,
    'Title': titles,
    'Author': authors,
    'Weeks on List': weeks,
    'Description': descriptions,
    'Pages': lengths,
    'Language': languages,
    'Publisher': publishers,
    'Publication Date': publish_date,
    'ISBN-10': isbn_10s,
    'ISBN-13': isbn_13s
})

top_15

Unnamed: 0,Rank,Title,Author,Weeks on List,Description,Pages,Language,Publisher,Publication Date,ISBN-10,ISBN-13
0,1,Fairy Tale,Stephen King,1,A high school kid inherits a shed that is a po...,608,English,Scribner,2022-09-06,1668002175,978-1668002179
1,2,Desperation In Death,J.D. Robb,1,The 55th book of the In Death series. Eve Dall...,368,English,St. Martin's Press,2022-09-06,1250278236,978-1250278234
2,3,Verity,Colleen Hoover,40,Lowen Ashleigh is hired by the husband of an i...,331,English,,2018-12-10,1791392792,978-1791392796
3,4,It Ends With Us,Colleen Hoover,65,A battered wife raised in a violent home attem...,381,English,Atria Books,2016-08-02,,
4,5,Where The Crawdads Sing,Delia Owens,177,In a quiet town on the North Carolina coast in...,384,English,G.P. Putnam's Sons,2018-08-14,0735219095,978-0735219090
5,6,A Court Of Silver Flames,Sarah J. Maas,3,The fifth book in the Court of Thorns and Rose...,768,English,Bloomsbury Publishing,2021-02-16,168119628X,978-1681196282
6,7,Ugly Love,Colleen Hoover,35,"Tate Collins and Miles Archer, an airline pilo...",333,English,Atria Books,2014-08-05,,
7,8,The Seven Husbands Of Evelyn Hugo,Taylor Jenkins Reid,63,A movie icon recounts stories of her loves and...,400,English,Washington Square Press,2018-05-29,1501161938,978-1501161933
8,9,November 9,Colleen Hoover,25,Is Ben using his relationship with Fallon as f...,314,English,Atria Books,2015-11-10,,
9,10,The American Roommate Experiment,Elena Armas,1,A romance writer goes on experimental dates wi...,400,English,Atria Books,2022-09-06,1668002779,978-1668002773


In [99]:
top_15.isna().sum()

Rank                0
Title               0
Author              0
Weeks on List       0
Description         0
Pages               0
Language            0
Publisher           1
Publication Date    0
ISBN-10             3
ISBN-13             3
dtype: int64