# Amazon webscrapping books for kids

# 1. Aims, objectives and background

## 1.1. Introduction

In today's digital age, children's books play a crucial role in nurturing young minds and fostering a love for reading. With the advent of online marketplaces, like Amazon, accessing a vast selection of children's books has become easier than ever before.

This project aims to explore and gather valuable information about children's books available on Amazon. By leveraging web scraping techniques and data analysis, we can delve into various aspects of these books, including their genres, ratings, reviews, and popularity.

## 1.2. Aims and objectives

The objective of this project is to create a comprehensive dataset that provides insights into the world of children's books on Amazon. By extracting information such as book titles, authors, publication dates, age target, descriptions, language, and pricing for the different presentations, we can gain a deeper understanding of the landscape of children's literature. Additionally, we will analyze customer reviews and ratings to evaluate the reception and quality of these books.

In [1]:
import pandas as pd
import numpy as np
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

## Get book's attributes

In [2]:
driver = webdriver.Chrome(ChromeDriverManager().install())

[WDM] - Downloading: 100%|██████████| 6.81M/6.81M [00:00<00:00, 7.56MB/s]
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [3]:
books_list = []

for pageNo in range(1, 10):
    print(pageNo)
    # Get the url page
    page_url = 'https://www.amazon.es/-/pt/gp/bestsellers/books/902621031/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo)
    driver.get(page_url)

    if pageNo == 1:
        time.sleep(3)
        driver.find_element(By.CLASS_NAME, 'a-button-inner').click()

    books = driver.find_elements(By.CLASS_NAME, 'a-link-normal')

    books_link = []

    for i in range(len(books)):
        books_link.append(books[i].get_attribute('href'))

    books_link = list(dict.fromkeys(books_link))
    books_reviews_links = [r for r in books_link if ('reviews' in r)] 
    books_link = [r for r in books_link if ('reviews' not in r)] 

    characteristics_list = []
    
    for i in range(len(books_link)):
        # get the attributes per book
        driver.get(books_link[i])
        title = driver.find_elements(By.ID, 'productTitle')     
        score = driver.find_elements(By.ID, 'acrPopover')
        Num_reviews = driver.find_elements(By.ID, 'acrCustomerReviewText')
        age = driver.find_elements(By.ID, 'rpi-attribute-book_details-customer_recommended_age')
        pages = driver.find_elements(By.ID, 'rpi-attribute-book_details-ebook_pages')
        pages2 = driver.find_elements(By.ID, 'rpi-attribute-book_details-fiona_pages')
        language = driver.find_elements(By.ID, 'rpi-attribute-language')
        date = driver.find_elements(By.ID, 'rpi-attribute-book_details-publication_date')
        price_tapa_dura =  driver.find_elements(By.ID, 'a-autoid-2')
        price_tapa_blanda =  driver.find_elements(By.ID, 'a-autoid-3')
        price_kindle = driver.find_elements(By.ID, 'a-autoid-1')
        
        '''
        If a book is missing any of the above attributes, it is important to handle such cases 
        by assigning the value as NaN. This allows for consistency within the dataframe and 
        facilitates further analysis
        '''
        
        if price_tapa_dura == []:
            tapa_dura = np.nan
        else:
            tapa_dura = price_tapa_dura[0].text

        if price_tapa_blanda == []:
            tapa_blanda = np.nan
        else:
            tapa_blanda = price_tapa_blanda[0].text

        if price_kindle == []:
            kindle = np.nan
        else:
            kindle = price_kindle[0].text

        if language == []:
            lang = np.nan
        else:
            lang = language[0].text

        if (pages == []) & (pages2 == []):
            pag = np.nan
        elif pages == []:
            pag = pages2[0].text
        else:
            pag = pages[0].text

        if age == []:
            edad = np.nan
        else:
            edad = age[0].text

        if date == []:
            dat = np.nan
        else:
            dat = date[0].text

        if score == []:
            sc = np.nan
        else:
            sc = score[0].get_attribute('title')

        if Num_reviews == []:
            reviews = np.nan
        else:
            reviews = Num_reviews[0].text

    

        # Create a dictionary with the book's attributes
        books_list.append({'Title':title[0].text,
                                    'Score': sc,
                                    'Num_reviews':reviews,
                                    'Age': edad,
                                    'Pages': pag,
                                    'Language': lang,
                                    'Date': dat,
                                    'Tapa_Dura': tapa_dura,
                                    'Tapa_Blanda': tapa_blanda,
                                    'Kindle': kindle                              
                                    })
    
    
    

1
2
3
4
5
6
7
8
9


In [89]:
# Create a dataframe with the books
books_df = pd.DataFrame(books_list)

In [90]:
books_df.head()

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Date,Tapa_Dura,Tapa_Blanda,Kindle
0,Aprender a leer en la Escuela de Monstruos 1 -...,4.7 de 5 estrelas,"1,020 avaliações",,,,,Seguir,"Kindle\n3,32 €",Vender na Amazon
1,Encuentra tu persona vitamina (F. COLECCION),4.6 de 5 estrelas,"4,090 avaliações",Idade de leitura\nIdade sugerida pelo cliente:...,Comprimento da Impressão\n328 páginas,Idioma\nEspañol,,Seguir,"Kindle\n8,54 €",Vender na Amazon
2,Tu cuerpo es tuyo (11ªED) (ESPAÑOL SOMOS8),4.7 de 5 estrelas,571 avaliações,Idade de leitura\nIdade sugerida pelo cliente:...,Comprimento da Impressão\n40 páginas,Idioma\nEspañol,,,,
3,My Hero Academia nº 35 (Manga Shonen),,,Idade de leitura\n9 anos ou mais,Comprimento da Impressão\n192 páginas,Idioma\nEspañol,,"Tapa blanda\n7,70 €",,Vender na Amazon
4,Animales de la granja (Mi primer libro de pega...,4.5 de 5 estrelas,102 avaliações,Idade de leitura\nIdade sugerida pelo cliente:...,Comprimento da Impressão\n8 páginas,Idioma\nEspañol,,,,


Cleaning of the database

In [92]:
# Remove extra strings in the score, Num_reviews, Age, Pages, Language,	Date. Such as: de 5 estrelas,  avaliações, Idioma, etc.
books_df.loc[:,'Score'] = books_df['Score'].str.replace('de 5 estrelas| ','')
books_df.loc[:,'Num_reviews'] = books_df['Num_reviews'].str.replace(',|avaliações| ','')

pattern = '|'.join(["Idade de leitura\n", 
                    'Idade sugerida pelo cliente',
                    "Idade de leitura\n", ':'])
books_df.loc[:,'Age'] = books_df['Age'].str.replace(pattern,'')

pattern = '|'.join(["Comprimento da Impressão\n", 
                    ' ',
                    "páginas"])

books_df.loc[:,'Pages'] = books_df['Pages'].str.replace(pattern,'')
books_df.loc[:,'Language'] = books_df['Language'].str.replace(" |Idioma\n",'')
books_df.loc[:,'Date'] = books_df['Date'].str.replace(" |Data de publicação\n",'')

  books_df.loc[:,'Score'] = books_df['Score'].str.replace('de 5 estrelas| ','')
  books_df.loc[:,'Num_reviews'] = books_df['Num_reviews'].str.replace(',|avaliações| ','')
  books_df.loc[:,'Age'] = books_df['Age'].str.replace(pattern,'')
  books_df.loc[:,'Pages'] = books_df['Pages'].str.replace(pattern,'')
  books_df.loc[:,'Language'] = books_df['Language'].str.replace(" |Idioma\n",'')
  books_df.loc[:,'Date'] = books_df['Date'].str.replace(" |Data de publicação\n",'')


Check unique values for the columns Tapa dura, Tapa Blanda and Kindle

In [93]:
unique_values = books_df[['Tapa_Dura', 'Tapa_Blanda', 'Kindle']].values.flatten()
unique_values = pd.unique(unique_values)
unique_values

array(['Seguir', 'Kindle\n3,32 €', 'Vender na Amazon', 'Kindle\n8,54 €',
       nan, 'Tapa blanda\n7,70 €', '', 'Kindle\n7,59 €', 'Kindle\n1,49 €',
       'Kindle\n6,64 €', 'Kindle\n5,69 €', 'Tapa dura\n19,04 €',
       'Tapa blanda\n17,37 €', 'Tapa dura\n5,76 €', 'Tapa dura\n10,60 €',
       'Kindle\n0,00 €', 'Tapa dura\ndesde 1,99 €', 'Tapa dura\n185,11 €',
       'Livro em cartão\n9,64 €', 'Tapa dura\n12,09 €',
       'Tapa dura\n13,16 €', 'Tapa dura\n12,54 €', 'Kindle\n3,79 €',
       'Tapa dura\n7,70 €', 'Tapa blanda\ndesde 12,30 €',
       'Kindle\n4,74 €', 'Tapa dura\n9,64 €', 'Tapa blanda\n12,54 €',
       'Tapa blanda\n10,60 €', 'Tapa dura\ndesde 15,95 €',
       'Tapa dura\n19,31 €', 'Livro em cartão\n10,60 €',
       'Tapa blanda\n7,12 €', 'Tapa dura\ndesde 2,80 €',
       'Tapa blanda\n1,98 €', 'Tapa blanda\n3,81 €',
       'Tapa blanda\ndesde 13,95 €', 'Tapa dura\ndesde 13,25 €',
       'Tapa dura\n24,16 €', 'Livro em cartão\n12,54 €',
       'Tapa blanda\n6,73 €', 'Kindle

Create new column with the corresponding price for Tapa Dura, Tapa Blanda and Kindle

In [94]:
books_df['Price_tapa_dura'] = np.where(books_df['Tapa_Dura'].str.contains('Tapa dura') , books_df['Tapa_Dura'], 
                                       np.where(books_df['Tapa_Blanda'].str.contains('Tapa dura'), books_df['Tapa_Blanda'],
                                       np.where(books_df['Kindle'].str.contains('Tapa dura'), books_df['Kindle'], '0')))

books_df['Price_tapa_blanda'] = np.where(books_df['Tapa_Dura'].str.contains('Tapa blanda|Livro em cartão') , books_df['Tapa_Dura'], 
                                       np.where(books_df['Tapa_Blanda'].str.contains('Tapa blanda|Livro em cartão'), books_df['Tapa_Blanda'],
                                       np.where(books_df['Kindle'].str.contains('Tapa blanda'), books_df['Kindle'], '0')))

books_df['Price_Kindle'] = np.where(books_df['Tapa_Dura'].str.contains('Kindle') , books_df['Tapa_Dura'], 
                                       np.where(books_df['Tapa_Blanda'].str.contains('Kindle'), books_df['Tapa_Blanda'],
                                       np.where(books_df['Kindle'].str.contains('Kindle'), books_df['Kindle'], '0')))

In [95]:
books_df.loc[:,'Price_tapa_dura'] = books_df['Price_tapa_dura'].str.replace("€|desde|Tapa dura\n| ",'')
books_df.loc[:,'Price_tapa_dura'] = books_df['Price_tapa_dura'].str.replace(",",'.')
books_df.loc[:,'Price_tapa_blanda'] = books_df['Price_tapa_blanda'].str.replace(" |€|desde|Tapa blanda\n|Livro em cartão\n",'')
books_df.loc[:,'Price_tapa_blanda'] = books_df['Price_tapa_blanda'].str.replace(",",'.')
books_df.loc[:,'Price_Kindle'] = books_df['Price_Kindle'].str.replace(" |€|Kindle\n",'')
books_df.loc[:,'Price_Kindle'] = books_df['Price_Kindle'].str.replace(",",'.')

  books_df.loc[:,'Price_tapa_dura'] = books_df['Price_tapa_dura'].str.replace("€|desde|Tapa dura\n| ",'')
  books_df.loc[:,'Price_tapa_blanda'] = books_df['Price_tapa_blanda'].str.replace(" |€|desde|Tapa blanda\n|Livro em cartão\n",'')
  books_df.loc[:,'Price_Kindle'] = books_df['Price_Kindle'].str.replace(" |€|Kindle\n",'')


Remove the extra columns

In [96]:
books_df.drop(['Tapa_Dura',	'Tapa_Blanda', 'Kindle'], axis=1, inplace=True)

In [97]:
books_df.head()

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Date,Price_tapa_dura,Price_tapa_blanda,Price_Kindle
0,Aprender a leer en la Escuela de Monstruos 1 -...,4.7,1020.0,,,,,0.0,0.0,3.32
1,Encuentra tu persona vitamina (F. COLECCION),4.6,4090.0,13 anos ou mais,328.0,Español,,0.0,0.0,8.54
2,Tu cuerpo es tuyo (11ªED) (ESPAÑOL SOMOS8),4.7,571.0,3 - 6 anos,40.0,Español,,,,
3,My Hero Academia nº 35 (Manga Shonen),,,9 anos ou mais,192.0,Español,,0.0,7.7,0.0
4,Animales de la granja (Mi primer libro de pega...,4.5,102.0,1 - 2 anos,8.0,Español,,,,


Replace the books with no pages to 0

In [98]:
books_df['Pages'] = np.where(books_df['Pages']=='', 0, books_df['Pages'] )

Change data type to float for the numeric columns

In [99]:
books_df[['Score', 'Num_reviews', 'Pages', 'Price_tapa_dura', 'Price_tapa_blanda', 'Price_Kindle']] = books_df[['Score', 'Num_reviews', 'Pages', 'Price_tapa_dura', 'Price_tapa_blanda', 'Price_Kindle']].astype(float)

Check date of publishing

In [88]:
books_df['Date'].unique()

array(['', '20janeiro2023'], dtype=object)

In [100]:
books_df[books_df['Date']=='20janeiro2023']

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Date,Price_tapa_dura,Price_tapa_blanda,Price_Kindle
136,I SPEAK ENGLISH! LIBRO PREESCOLAR MAXI: 110 pá...,4.9,30.0,,0.0,Español,20janeiro2023,,,


The date is not a good feature since it's empty or there is only one date of publication, so we will exclude this variable.

In [101]:
books_df.drop('Date', axis=1, inplace=True)

In [102]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Title              244 non-null    object 
 1   Score              236 non-null    float64
 2   Num_reviews        236 non-null    float64
 3   Age                231 non-null    object 
 4   Pages              240 non-null    float64
 5   Language           244 non-null    object 
 6   Price_tapa_dura    139 non-null    float64
 7   Price_tapa_blanda  139 non-null    float64
 8   Price_Kindle       139 non-null    float64
dtypes: float64(6), object(3)
memory usage: 17.3+ KB


 Verify that the values of the features make sense. This involves ensuring that there are no books with excessively high prices, an excessive number of pages, or negative numbers

In [103]:
books_df.describe(include='all')

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Price_tapa_dura,Price_tapa_blanda,Price_Kindle
count,244,236.0,236.0,231,240.0,244.0,139.0,139.0,139.0
unique,239,,,77,,4.0,,,
top,Cuentos Clásicos (Cuentos clásicos con pictogr...,,,9 anos ou mais,,,,,
freq,3,,,14,,132.0,,,
mean,,4.641525,1532.211864,,148.65,,4.553525,2.158633,2.484245
std,,0.253596,6941.732364,,200.156681,,16.586921,4.32955,3.040846
min,,2.3,1.0,,0.0,,0.0,0.0,0.0
25%,,4.6,134.0,,32.0,,0.0,0.0,0.0
50%,,4.7,416.0,,96.0,,0.0,0.0,0.0
75%,,4.8,1024.25,,192.0,,3.08,1.885,5.69


In [105]:
books_df[books_df['Num_reviews']== max(books_df['Num_reviews'])]

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Price_tapa_dura,Price_tapa_blanda,Price_Kindle
232,Harry Potter and the Philosopher's Stone (Engl...,4.7,101753.0,8 anos ou mais,345.0,,18.3,0.0,0.0


In [108]:
books_df[books_df['Pages']== max(books_df['Pages'])]

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Price_tapa_dura,Price_tapa_blanda,Price_Kindle
210,"La Santa Biblia, surtido: colores aleatorios (...",4.5,119.0,9 anos ou mais,2016.0,Español,,,


In [109]:
books_df[books_df['Price_tapa_dura']== max(books_df['Price_tapa_dura'])]

Unnamed: 0,Title,Score,Num_reviews,Age,Pages,Language,Price_tapa_dura,Price_tapa_blanda,Price_Kindle
31,Un beso antes de dormir [Español],4.5,1679.0,1 - 2 anos,0.0,,185.11,9.64,0.0


According to the tables above, we can see that Harry Potter has the highest number of reviews, which makes sense since it is a well-known book. Additionally, the Bible has the most pages among the books mentioned.

In [110]:
# save the clean database as a csv file for future analysis 
books_df.to_csv('books_df.csv', mode='w', index=False, header=True, encoding='utf-8-sig')