# Web scraping products using La Central 📚

In this workbook, we will see an example of how to webscrape prodcut info from a website. This is useful when comparing / monitoring prices, collecting product characteristics or studying competitors. In this example, we will also store all the product info that we collect in a dataframe so that it is more useful to us.  

In [1]:
# First step: import the libraries 
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import os

In [2]:
# Setting the url of the page we want to scrape (again, feel free to change the page from La Central)
url = 'https://www.lacentral.com/recomendados/?mat=S'

In [3]:
# Passing the url to the requests library 
page = requests.get(url)
page.content

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es"><head>\n  <title>ATRIL - La Central - 2020</title>\n  <meta name="google-site-verification" content="vMwf5tWiU8JsXdg2vJZGnd1zGY1XPbxjxsTCdQPTUPY" />\n  <meta name="google-site-verification" content="zecGJ_65TrdLPSDcUEpkv8D9HVmDgSmhLI-OLDrerok" />\n  <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />\n  <meta http-equiv="content-language" content="es" />\n  <meta name="revisit-after" content="7 days" />\n  <meta name="distribution" content="global" />\n  <meta name="rating" content="general" />\n  <meta name="DC.language" content="es" />\n  <meta name="autor" content="Llibreria La Central" />\n  <meta name="description" content="ATRIL - La Central - 2020" />\n  <meta name="keywords" content="companyia,central,llibretera,lacentral,barcelona,llibreria,librer\xc3\xada,humanidades,arte,teatro,litera

In [4]:
# Creating the soup object 
soup = BeautifulSoup(page.content, 'html5lib')

In [5]:
# Extracting info from the soup object using tags
soup.title, type(soup.title), soup.title.get_text()

(<title>ATRIL - La Central - 2020</title>,
 bs4.element.Tag,
 'ATRIL - La Central - 2020')

Now, again, let's go to the page and use the inspector tool to look at how we can extract information on the products.

In [6]:
# Getting the names of the authors 
authors = soup.find_all('h4')
for name in authors: 
    print(name.find('a').get_text())

Marrero Rocha, Inmaculada
Robinson, Andy
Preciado, Paul B.
Quinones, Sam
Rich, Nathaniel
Vargas, Fred
Esteban MuÃ±oz, JosÃ©
Frase, Peter
Horvat, Srecko
Innerarity, Daniel
Mansbridge, Jane
Tolentino, Jia
AlemÃ¡n, Jorge
Rich, Adrienne
Rich, Adrienne
Zizek, Slavoj
Maalouf, Amin 
Frankopan, Peter
Forti, Steven
Abdo FÃ©rez, Cecilia
Preciado, Paul B.
Preciado, Paul B.

Wallace-Wells, David


In [7]:
# Getting the book titles 
titles = soup.find_all('h5')
for title in titles: 
    print(title.get_text())


                  Soldados del terrorismo global. Los nuevos combatientes extranjeros
                

                  Oro, petrÃ³leo y aguacates
                

                  Manifiesto contrasexual
                

                  Tierra de sueÃ±os
                

                  Perdiendo la tierra
                

                  La humanidad en peligro
                

                  UtopÃ­a queer
                

                  Cuatro futuros
                

                  PoesÃ­a del futuro
                

                  Una teorÃ­a de la democracia compleja
                

                  Feminismo: breve introducciÃ³n a una ideologÃ­a polÃ­tica
                

                  Falso espejo
                

                  RazÃ³n fronteriza y sujeto del inconsciente
                

                  Nacemos de mujer
                

                  Ensayos esenciales. Cultura, polÃ­tica y el arte de la poesÃ­a
               

As we can see, we have an issue here in that the book titles and authors do not follow the same pattern. Let's go page to the webpage with the inspector tool to see if we can find a way to extract this info using hierarchy and iteration. 

In [8]:
# Zooming in on the info that we want and getting the category 
body = soup.find('div', id='multiCol3')
body.find('h3').get_text()

'Ciencias Sociales'

In [9]:
# Creating a list of all the products and checking the length 
items = body.find_all('li')
len(items)

24

In [10]:
# Let's quickly inspect one element 
items[0]

<li>
                <p class="imgLeft">
                  <a href="/book/?id=9788430977833"><img src="/9788430977833.jpg" style="border:1px solid black; width:76px"/></a>
                </p>
                <h4>
                  <a href="/book/?id=9788430977833">Marrero Rocha, Inmaculada</a>
                </h4>
                <h5>
                  <a href="/book/?id=9788430977833">Soldados del terrorismo global. Los nuevos combatientes extranjeros</a>
                </h5>
                <p>Esta obra tiene como misión ofrecer una visión más profunda y completa de la nat...</p>
              </li>

In the above example, we can see that the author name is stored in the 'h4' tag, the title in the 'h5' tag and the description in 'p'. So now let's separate out this info and extract it for the example item.

In [13]:
# Getting author 
items[0].find('h4').get_text().strip()

'Marrero Rocha, Inmaculada'

In [14]:
# Getting title 
items[0].find('h5').get_text().strip()

'Soldados del terrorismo global. Los nuevos combatientes extranjeros'

In [15]:
# Getting description 
items[0].find('p', class_="").get_text().strip()

'Esta obra tiene como misión ofrecer una visión más profunda y completa de la nat...'

In [16]:
# Looping through each item in the list we have created and extracting the info for each one
authors = []
titles = []
descriptions = []

for n in range(len(items)):
    authors.append(items[n].find('h4').get_text().strip())
    titles.append(items[n].find('h5').get_text().strip())
    descriptions.append(items[n].find('p', class_="").get_text().strip())

In [17]:
# Checking that it looks ok
authors

['Marrero Rocha, Inmaculada',
 'Robinson, Andy',
 'Preciado, Paul B.',
 'Quinones, Sam',
 'Rich, Nathaniel',
 'Vargas, Fred',
 'Esteban MuÃ±oz, JosÃ©',
 'Frase, Peter',
 'Horvat, Srecko',
 'Innerarity, Daniel',
 'Mansbridge, Jane',
 'Tolentino, Jia',
 'AlemÃ¡n, Jorge',
 'Rich, Adrienne',
 'Rich, Adrienne',
 'Zizek, Slavoj',
 'Maalouf, Amin',
 'Frankopan, Peter',
 'Forti, Steven',
 'Abdo FÃ©rez, Cecilia',
 'Preciado, Paul B.',
 'Preciado, Paul B.',
 '',
 'Wallace-Wells, David']

In [18]:
# Storing the info in a dataframe 
books_df = pd.DataFrame({'author': authors, 
                         'title': titles, 
                         'description': descriptions})
books_df

Unnamed: 0,author,title,description
0,"Marrero Rocha, Inmaculada",Soldados del terrorismo global. Los nuevos com...,Esta obra tiene como misión ofrecer una visión...
1,"Robinson, Andy","Oro, petrÃ³leo y aguacates",Andy Robinson cuenta en estas crónicas persona...
2,"Preciado, Paul B.",Manifiesto contrasexual,«En el principio era el dildo. El dildo antece...
3,"Quinones, Sam",Tierra de sueÃ±os,Quinones teje dos historias clásicas sobre el ...
4,"Rich, Nathaniel",Perdiendo la tierra,En 1979 sabíamos casi todo lo que entendemos h...
5,"Vargas, Fred",La humanidad en peligro,"Hace diez años, Fred Vargas publicó un breve t..."
6,"Esteban MuÃ±oz, JosÃ©",UtopÃ­a queer,Lo queer aún no ha llegado. Es una idealidad. ...
7,"Frase, Peter",Cuatro futuros,¿CÓMO SERÁ LA VIDA DESPUÉS DEL CAPITALISMO?\nP...
8,"Horvat, Srecko",PoesÃ­a del futuro,Cómo un movimiento de liberación global es la ...
9,"Innerarity, Daniel",Una teorÃ­a de la democracia compleja,La principal amenaza de la democracia no es la...


Now that we have seen how to scrape products websites, you could: 
* Apply this to different websites which maybe have info about pricing and ratings as well 
* Build a recommendation engine based on popular products
* Continue the analysis over a period of time to see how tastes change