<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/01_DataScienceUsingPython/blob/main/09_WebScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web scraping using Python

In this file we will show how to automatize the extraction of information from the web. To that end, we will use Python libraries. This information is very important for data scientists, as they need information to build their models and gain knowledge. 

Thus, we will draw information from the site:

 https://alura-site-scraping.herokuapp.com/index.php

which contains information via text and image. 

## Importing libraries and checking versions

To assist us in performing our scraping, we will use three packages:

In [1]:
import bs4
import urllib.request as urllib_request
import pandas as pd

Let's check the version of each package:

In [2]:
print(f"BeautifulSoup -> {bs4.__version__}")
print(f"urllib -> {urllib_request.__version__}")
print(f"pandas -> {pd.__version__}")

BeautifulSoup -> 4.6.3
urllib -> 3.7
pandas -> 1.3.5


In [3]:
%reset -f

# Our first scraping

So, we will make our scraping on a website for luxury vehicles.

https://alura-site-scraping.herokuapp.com/index.php

The website has information from 246 vehicles, but each page only contains 10 vehicles. During this course, we will perform scraping of both text and images, as one may desire different types of data for their model.

First, let's scrap a text from page:

https://alura-site-scraping.herokuapp.com/hello-world.php

which shows a basic text. Let's see how we can do this. Later, we will explain why we are using each method.

First thing we need to know: it is important to **inspect** our website. This gives us information about how to website is formatted as a code. 

Let's import our packages:

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

Now let's store the url from our site:

In [17]:
url = "https://alura-site-scraping.herokuapp.com/hello-world.php"

And then let's open our url:

In [18]:
response = urlopen(url)

Now, we can read our response. However, it does not mean very much right now:

In [19]:
html = response.read( )
html

b'<!DOCTYPE html>\r\n<html lang="pt-br">\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\r\n\r\n    <title>Alura Motors</title>\r\n\t<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">\r\n\t<link rel="stylesheet" href="css/styles.css" media="all">\r\n\r\n\t<script src="https://code.jquery.com/jquery-1.12.4.js"></script>\r\n\t<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>\r\n\t<script type="text/javascript" src="js/index.js"></script>\r\n\r\n</head>\r\n<body cz-shortcut-listen="true">\r\n    <noscript>You need to enable JavaScript to run this app.</noscript>\r\n\r\n    <div id="root">\r\n        <h

To make it "readable", we can use beautiful soup:

In [20]:
soup = BeautifulSoup(html, 'html.parser')
soup

<!DOCTYPE html>

<html lang="pt-br">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<title>Alura Motors</title>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" rel="stylesheet"/>
<link href="css/styles.css" media="all" rel="stylesheet"/>
<script src="https://code.jquery.com/jquery-1.12.4.js"></script>
<script crossorigin="anonymous" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<script src="js/index.js" type="text/javascript"></script>
</head>
<body cz-shortcut-listen="true">
<noscript>You need to enable JavaScript to run this app.</noscript>
<div id="root">
<header>
<nav class="navbar navbar-inverse" style="margin-bottom: 0;">
<div class="container" styl

Now, our html is better organized. Also, we can find text from our html:

In [21]:
soup.find('h1', id = "hello-world").get_text( )

'Hello World!!!'

In [22]:
soup.find('p').get_text( )

'Web Scraping é o termo utilizado para definir a prática de coletar automaticamente informações na Internet. Isto é feito, geralmente, por meio de programas que simulam a navegação humana na Web.'

In [23]:
soup.find('h1', {'class' : 'sub-header'}).get_text( )

'Curso de Web Scraping'

Now, let's start to understand what did we just do. 

# Obtaining and cleaning our HTML

Ok, so: how exactly do we obtain our data from web scraping?

When a **client** wants to access a page, it makes a **request** to the **server**. Then, if everything is working, the server sends a **response**: which, on a webpage, is usually a **html** file.

So, we basically have to receive this response using Python, and get data from it.

## Obtaining the HTML

To obtain our html, we use the **urllib.request** package.

In [24]:
from urllib.request import urlopen

url = 'https://alura-site-scraping.herokuapp.com/index.php'

Then, we may open our url using:

In [25]:
response = urlopen(url)

Finally, to get our html file:

In [26]:
html = response.read( )

In [27]:
html

b'<!DOCTYPE html>\r\n<html lang="pt-br">\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\r\n\r\n    <title>Alura Motors</title>\r\n\r\n\t<style>\r\n\t\t/*Regra para a animacao*/\r\n\t\t@keyframes spin {\r\n\t\t\t0% { transform: rotate(0deg); }\r\n\t\t\t100% { transform: rotate(360deg); }\r\n\t\t}\r\n\t\t/*Mudando o tamanho do icone de resposta*/\r\n\t\tdiv.glyphicon {\r\n\t\t\tcolor:#6B8E23;\r\n\t\t\tfont-size: 38px;\r\n\t\t}\r\n\t\t/*Classe que mostra a animacao \'spin\'*/\r\n\t\t.loader {\r\n\t\t\tborder: 16px solid #f3f3f3;\r\n\t\t\tborder-radius: 50%;\r\n\t\t\tborder-top: 16px solid #3498db;\r\n\t\t\twidth: 80px;\r\n\t\t\theight: 80px;\r\n\t\t\t-webkit-animation: spin 2s linear infinite;\r\n\t\t\tanimation: spin 2s linear infinite;\r\n\t\t}\r\n\t</style>\r\n\t<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuH

Great! Our html variable is not very organized, but we were able to get the html. 

### Dealing with forbidden accesses

In some webpages, one may find an error when trying to access it via Python script:

In [28]:
from urllib.request import urlopen

url = 'https://www.alura.com.br'

response = urlopen(url)
html = response.read( )
html

HTTPError: ignored

To deal with this problem, we have to make a request. Basically, we have to say that we are accessing the page via a browser. Thus, we can make:

In [None]:
from urllib.request import Request, urlopen
from urllib.error   import URLError, HTTPError

url = 'https://www.alura.com.br'

user = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'}

req = Request(url, headers = user)

We have made our request. Now, we can just access the webpage as we did before:

In [None]:
response = urlopen(req)
html = response.read( )
html

Note that, now, we have opened our requested variable *req*, and not *url*. 

For a clean code, we can print an error message. For instance:

In [29]:
from urllib.request import Request, urlopen
from urllib.error   import URLError, HTTPError

url = 'https://www.alura.com.br'
user = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'}

try:
  response = urlopen(url)
  url = response.read( )
  print(url)
except HTTPError as e:      # If we get an error due to not being able to assess our url
  print(e.status, e.reason)
except URLError as e:       # If we get an error due to having a misspelled url
  print(e.reason)

403 Forbidden


Now, if we make our request:

In [30]:
from urllib.request import Request, urlopen
from urllib.error   import URLError, HTTPError

url = 'https://www.alura.com.br'
user = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'}

try:
  req = Request(url, headers = user)
  response = urlopen(req)
  url = response.read( )
  print(url)
except HTTPError as e:      # If we get an error due to not being able to assess our url
  print(e.status, e.reason)
except URLError as e:       # If we get an error due to having a misspelled url
  print(e.reason)

b'\n\n<!DOCTYPE html><html\nlang="pt-BR"><head><meta\ncharset="UTF-8"><meta\nname="viewport" content="width=device-width,initial-scale=1,minimum-scale=1.0"><title>Alura | Cursos online de Tecnologia</title><meta\nname="description" content="Aprenda Programa\xc3\xa7\xc3\xa3o, Front-end, Data Science, UX, DevOps, Marketing, Inova\xc3\xa7\xc3\xa3o e Gest\xc3\xa3o na maior plataforma de tecnologia do Brasil"><link\nrel="canonical" href="https://www.alura.com.br"><link\nrel="icon" href="/assets/favicon.1647533642.ico" /><link\nhref="https://fonts.googleapis.com/css?display=swap&family=Inter:wght@400;700;900" rel="stylesheet" crossorigin><link\nrel="preconnect" href="https://fonts.gstatic.com/" crossorigin><link\nrel="stylesheet" href="/bundle,base/_reset,base/base,base/buttons,base/colors-apostilas,base/colors,base/titulos.1650916444.css"><link\nrel="stylesheet" href="/bundle,home/homeNova/career-colors,home/homeNova/careers,home/homeNova/cases,home/homeNova/companies,home/homeNova/features

Great! Everything worked out fine.

In [31]:
%reset -f

## Treating our string

Ok, so we were able to read our html using:

In [32]:
from urllib.request import urlopen

url = 'https://alura-site-scraping.herokuapp.com/index.php'

response = urlopen(url)
html = response.read( )
html

b'<!DOCTYPE html>\r\n<html lang="pt-br">\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\r\n\r\n    <title>Alura Motors</title>\r\n\r\n\t<style>\r\n\t\t/*Regra para a animacao*/\r\n\t\t@keyframes spin {\r\n\t\t\t0% { transform: rotate(0deg); }\r\n\t\t\t100% { transform: rotate(360deg); }\r\n\t\t}\r\n\t\t/*Mudando o tamanho do icone de resposta*/\r\n\t\tdiv.glyphicon {\r\n\t\t\tcolor:#6B8E23;\r\n\t\t\tfont-size: 38px;\r\n\t\t}\r\n\t\t/*Classe que mostra a animacao \'spin\'*/\r\n\t\t.loader {\r\n\t\t\tborder: 16px solid #f3f3f3;\r\n\t\t\tborder-radius: 50%;\r\n\t\t\tborder-top: 16px solid #3498db;\r\n\t\t\twidth: 80px;\r\n\t\t\theight: 80px;\r\n\t\t\t-webkit-animation: spin 2s linear infinite;\r\n\t\t\tanimation: spin 2s linear infinite;\r\n\t\t}\r\n\t</style>\r\n\t<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuH

However, it is hard to understand what does this variable means. Let's look more into it. What is its typing?

In [33]:
type(html)

bytes

So, our *html* variable is of type *bytes*. Let's try to transform our html into a string, to make it simpler to understand. This can be performed using a decode:

In [34]:
html = html.decode('utf-8')
html

'<!DOCTYPE html>\r\n<html lang="pt-br">\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\r\n\r\n    <title>Alura Motors</title>\r\n\r\n\t<style>\r\n\t\t/*Regra para a animacao*/\r\n\t\t@keyframes spin {\r\n\t\t\t0% { transform: rotate(0deg); }\r\n\t\t\t100% { transform: rotate(360deg); }\r\n\t\t}\r\n\t\t/*Mudando o tamanho do icone de resposta*/\r\n\t\tdiv.glyphicon {\r\n\t\t\tcolor:#6B8E23;\r\n\t\t\tfont-size: 38px;\r\n\t\t}\r\n\t\t/*Classe que mostra a animacao \'spin\'*/\r\n\t\t.loader {\r\n\t\t\tborder: 16px solid #f3f3f3;\r\n\t\t\tborder-radius: 50%;\r\n\t\t\tborder-top: 16px solid #3498db;\r\n\t\t\twidth: 80px;\r\n\t\t\theight: 80px;\r\n\t\t\t-webkit-animation: spin 2s linear infinite;\r\n\t\t\tanimation: spin 2s linear infinite;\r\n\t\t}\r\n\t</style>\r\n\t<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHA

Now, if we check our typing:

In [35]:
type(html)

str

The *decode( )* method is also important to make our variable **readable**. For instance, before, we have a piece of text as:

*C\xc3\xa2mera de estacionamento*

Now, after decoding, we have it correctly showing us:

*Câmera de estacionamento*



To improve the readability of our string, we can also do:

In [36]:
" ".join(html.split( ))

'<!DOCTYPE html> <html lang="pt-br"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>Alura Motors</title> <style> /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao \'spin\'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: 16px solid #3498db; width: 80px; height: 80px; -webkit-animation: spin 2s linear infinite; animation: spin 2s linear infinite; } </style> <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"> <link rel="stylesheet" href="css/styles.css" media="all"> <script src="https://code.jquery.com/jquery-1.12.4.js"></script> <script src="https://

Also, it is common to erase the space between *> <*. This can be done with:

In [37]:
" ".join(html.split( )).replace('> <', '><')

'<!DOCTYPE html><html lang="pt-br"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><title>Alura Motors</title><style> /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao \'spin\'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: 16px solid #3498db; width: 80px; height: 80px; -webkit-animation: spin 2s linear infinite; animation: spin 2s linear infinite; } </style><link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"><link rel="stylesheet" href="css/styles.css" media="all"><script src="https://code.jquery.com/jquery-1.12.4.js"></script><script src="https://maxcdn.boo

Great!

In [38]:
%reset -f

Finally, to simplify things out for our future scrapings, we can simply define a function to treat our html. Thus, we may define:

In [39]:
def treat_html(input):
  input = input.decode('utf-8')
  return " ".join(input.split( )).replace('> <', '><')

Now, we can simply do:

In [40]:
from urllib.request import urlopen

url = 'https://alura-site-scraping.herokuapp.com/index.php'

response = urlopen(url)
html = response.read( )

html = treat_html(html)

And we get, as output:

In [41]:
html

'<!DOCTYPE html><html lang="pt-br"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><title>Alura Motors</title><style> /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao \'spin\'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: 16px solid #3498db; width: 80px; height: 80px; -webkit-animation: spin 2s linear infinite; animation: spin 2s linear infinite; } </style><link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"><link rel="stylesheet" href="css/styles.css" media="all"><script src="https://code.jquery.com/jquery-1.12.4.js"></script><script src="https://maxcdn.boo

Nice! Now we can continue our scraping.

# Introduction to BeautifulSoup

BeautifulSoup is the main library we will use to do Web Scrapping. First, let's understand a bit more about html.

Html is a markup language, which is made of different tags. These tags determine the "goal" for each part of our document. These tags introduce the contents of our page. Some examples of tags are:

*   \<html\>  - Start of the document
*   \<head\>  - Header
*   \<title\> - Title
*   \<body\>  - Body of the document
*   \<h1\>    - Section
*   \<h2\>    - Subsection
*   \<p\>     - Paragraph
*   \<a\>     - Hyperlink
*   \<img\>   - Figure
*   \<table\> - Table

Note that tags are indicated with a \<\>

These tags also can have an *id* or a *class*. We may need to use these to access our data.

## Creating a BeautifulSoup object

BeautifulSoup is a tool used to **interpret** our html. The documentation for BeautifulSoup can be found in: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [45]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

Now, we have a soup object. Let's check its typing:

In [47]:
type(soup)

bs4.BeautifulSoup

We can take a look at out object using:

In [48]:
print(soup.prettify( ))

<!DOCTYPE html>
<html lang="pt-br">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <title>
   Alura Motors
  </title>
  <style>
   /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao 'spin'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: 16px solid #3498db; width: 80px; height: 80px; -webkit-animation: spin 2s linear infinite; animation: spin 2s linear infinite; }
  </style>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" rel="stylesheet"/>
  <link href="css/styles.css" media="all" rel="stylesheet"/>
  <script src="https://code.jquery.com/jquery-1.12.4.js">
  

## Accessing tags

Ok, we have created a soup object with our html. Now, we have to access our tags from our BeautifulSoup object.

For instance, let's get the title of our page. Basically, we have to run from all tags until we get to the title. If we inspect our page:

https://alura-site-scraping.herokuapp.com/index.php

we can see that we have three tags until getting to the title: 

html -> head -> title

So, let's get the title using:

In [49]:
soup.html.head.title

<title>Alura Motors</title>

Great! Instead, since we have only one title, we can simply do:

In [50]:
soup.title

<title>Alura Motors</title>

Note that, when we do this, we get the text and its tag. To get only the text, we can do:

So, to access text from our tags, we simply have to navigate from them. Note that, if we have two of the same tag and we don't pass the correct path to it, the first entry will be returned.

Now, sometimes, our tags have a lot of information:

In [51]:
soup.h5

<h5 class="modal-title" id="loadingModal_label"><span class="glyphicon glyphicon-refresh"></span>Aguarde... </h5>

How can we get only the information we need?

## Accessing information from our tags

In the previous example, we got information from tag \<h5\>. However, how can we get only the string "Aguarde...", in the end of our string?

To get only the text from the tag, we can simply do:

In [53]:
soup.h5.get_text( )

'Aguarde... '

This can also be done in the other examples to remove the tags from our output:

In [54]:
soup.title.get_text( )

'Alura Motors'

## Getting attributes from our tags

Often, we want information that is hidden behind some attributes from our tags. For instance:

In [62]:
soup.img

<img alt="Alura" class="d-inline-block align-top" src="img/alura-logo.svg"/>

So, we have an image with some attributes, such as *class* and *src*. We can see these attributes using:

In [63]:
soup.img.attrs

{'alt': 'Alura',
 'class': ['d-inline-block', 'align-top'],
 'src': 'img/alura-logo.svg'}

Note that this is a dictionary. Thus, we can also use:

In [64]:
soup.img.attrs.keys( )

dict_keys(['src', 'class', 'alt'])

In [65]:
soup.img.attrs.values( )

dict_values(['img/alura-logo.svg', ['d-inline-block', 'align-top'], 'Alura'])

Also, we can get our attributes from:

In [68]:
soup.img['class']

'img/alura-logo.svg'

or we can use the *get( )* function:

In [67]:
soup.img.get('src')

'img/alura-logo.svg'

# Searching our HTML using BeautifulSoup

Let's see more elegant methods to search using BeautifulSoup. These can be done use *find( )* and *findAll( )* methods.

First, *find( )* works very similar to simply passing the tag to our object:

In [69]:
soup.find('img')

<img alt="Alura" class="d-inline-block align-top" src="img/alura-logo.svg"/>

In [70]:
soup.img

<img alt="Alura" class="d-inline-block align-top" src="img/alura-logo.svg"/>

However, *findAll( )* function is very useful: it creates a list with all information found using the tag:

In [71]:
soup.findAll('img')

[<img alt="Alura" class="d-inline-block align-top" src="img/alura-logo.svg"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/bmw-m2/bmw-m2-2970882__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/alfa/alfa-1823056__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/puech/puech-4055386__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-murcielago/lamborghini-murcielago-2872974__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/

We can also pass the number of values searched:

In [75]:
soup.findAll('img', limit = 2)

[<img alt="Alura" class="d-inline-block align-top" src="img/alura-logo.svg"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/>]

A shortcut for the method *findAll( )* is to simply do:

In [76]:
soup('img')

[<img alt="Alura" class="d-inline-block align-top" src="img/alura-logo.svg"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/bmw-m2/bmw-m2-2970882__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/alfa/alfa-1823056__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/puech/puech-4055386__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-murcielago/lamborghini-murcielago-2872974__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/

## Using a list of tags

If we want to search multiple tags, we can simply pass a list of tags to our method:

In [77]:
soup(['h1', 'h2', 'h3', 'h4', 'h5'])

[<h5 class="modal-title" id="loadingModal_label"><span class="glyphicon glyphicon-refresh"></span>Aguarde... </h5>,
 <h4><b id="loadingModal_content"></b></h4>,
 <h1 class="sub-header">Veículos de Luxo Novos e Usados - Todas as Marcas</h1>]

## Using attributes

Also, we can use attributes to search using *findAll( )*

In [78]:
soup.findAll('p', {'class' : "txt-value"})

[<p class="txt-value">R$ 338.000</p>,
 <p class="txt-value">R$ 346.000</p>,
 <p class="txt-value">R$ 480.000</p>,
 <p class="txt-value">R$ 133.000</p>,
 <p class="txt-value">R$ 175.000</p>,
 <p class="txt-value">R$ 239.000</p>,
 <p class="txt-value">R$ 115.000</p>,
 <p class="txt-value">R$ 114.000</p>,
 <p class="txt-value">R$ 75.000</p>,
 <p class="txt-value">R$ 117.000</p>]

Thus, we found all cases where we had tag *p* with *class = 'txt-value'*

## Using text

Also, we can find information using their text:

In [79]:
soup.findAll('p', text = "USADO")

[<p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>]

## Extra 01: Getting images from our html

Usually, to get the figures from our document, we use the tag \<img\>. In our case, we want to get the pictures from cars in document:

https://alura-site-scraping.herokuapp.com/index.php

If we inspect each figure, we will see that, in all of them, we have an attribute:

*alt = "Foto"*

So, let's try to search all of our figures:

In [80]:
soup.findAll('img', {"alt": "Foto"})

[<img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/bmw-m2/bmw-m2-2970882__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/alfa/alfa-1823056__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/puech/puech-4055386__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-murcielago/lamborghini-murcielago-2872974__340.jpg" width="220"/>,
 <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/aston-martin/aston-martin-2977916__340.jpg" width="220"/>,
 <img al

If we only want the figures, we can also do:

In [82]:
for item in soup.findAll('img', {"alt": "Foto"}):
  print(item.get('src'))

https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/bmw-m2/bmw-m2-2970882__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/alfa/alfa-1823056__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/puech/puech-4055386__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-murcielago/lamborghini-murcielago-2872974__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/aston-martin/aston-martin-2977916__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/tvr/tvr-2943925__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/excalibur/excalibur-2916730__340.jpg
https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/mclaren/mclaren-2855240__340.jpg
htt

Great! We were able to scrap our figures using Python.

So, basically, to perform an efficient Web Scrapping, we simply have to **understand** our html document, look for the tags and classes we need, and then access those using BeautifulSoup.

## Extra 02: More search methods

For an efficient search, we can also use the concepts of Parents and Siblings (previous or next sibling). They are entirely related to the html hierarchy. 

    <html>
        <body>
            <div id="container-a">
                <h1>Título A</h1>
                <h2 class="ref-a">Sub título A</h2>
                <p>Texto de conteúdo A</p>
            </div>
            <div id="container-b">
                <h1>Título B</h1>
                <h2 class="ref-b">Sub título B</h2>
                <p>Texto de conteúdo B</p>
            </div>
        </body>
    </html>

For instance, here, looking at tag h2:

*   h1 is its previous sibling.
*   p is its next sibling.
*   html, body, and div are its parents.



In [84]:
from bs4 import BeautifulSoup

html = """
    <html>
        <body>
            <div id="container-a">
                <h1>Título A</h1>
                <h2 class="ref-a">Sub título A</h2>
                <p>Texto de conteúdo A</p>
            </div>
            <div id="container-b">
                <h1>Título B</h1>
                <h2 class="ref-b">Sub título B</h2>
                <p>Texto de conteúdo B</p>
            </div>
        </body>
    </html>
"""
soup = BeautifulSoup(html, 'html.parser')

In [85]:
soup.find('h2')

<h2 class="ref-a">Sub título A</h2>

Ok, we have already learned how to find h2. However, we can also add a command to find its parents:

In [91]:
soup.find('h2').find_parents( )

[<div id="container-a">
 <h1>Título A</h1>
 <h2 class="ref-a">Sub título A</h2>
 <p>Texto de conteúdo A</p>
 </div>, <body>
 <div id="container-a">
 <h1>Título A</h1>
 <h2 class="ref-a">Sub título A</h2>
 <p>Texto de conteúdo A</p>
 </div>
 <div id="container-b">
 <h1>Título B</h1>
 <h2 class="ref-b">Sub título B</h2>
 <p>Texto de conteúdo B</p>
 </div>
 </body>, <html>
 <body>
 <div id="container-a">
 <h1>Título A</h1>
 <h2 class="ref-a">Sub título A</h2>
 <p>Texto de conteúdo A</p>
 </div>
 <div id="container-b">
 <h1>Título B</h1>
 <h2 class="ref-b">Sub título B</h2>
 <p>Texto de conteúdo B</p>
 </div>
 </body>
 </html>, 
 <html>
 <body>
 <div id="container-a">
 <h1>Título A</h1>
 <h2 class="ref-a">Sub título A</h2>
 <p>Texto de conteúdo A</p>
 </div>
 <div id="container-b">
 <h1>Título B</h1>
 <h2 class="ref-b">Sub título B</h2>
 <p>Texto de conteúdo B</p>
 </div>
 </body>
 </html>]

It returned us a list with all of its parents.

Now, for the siblings:

In [92]:
soup.find('h2').findNext( )

<p>Texto de conteúdo A</p>

In [93]:
soup.find('h2').findPrevious( )

<h1>Título A</h1>

# Capturing data from an ad

Ok, we have already learned how to navigate in a soup object. Now, let's try to capture data from an ad, to use in our future analyses.

In [95]:
from urllib.request import urlopen

url = 'https://alura-site-scraping.herokuapp.com/index.php'

response = urlopen(url)
html = response.read( )

html = treat_html(html)

soup = BeautifulSoup(html, 'html.parser')

So, in our webpage:

https://alura-site-scraping.herokuapp.com/index.php

we have a lot of ads from different luxury cars. Also, there is a lot of information of each car, such as: name, year, engine, km, price...

If we inspect our page, we can see that all of this information is under a tag named *div*, using the class *container-cards*. Information of each car is stored under under tag *div*, using the class *well-card*. Thus, we can get the information for the first card using:

In [103]:
ad = soup.find('div', {'class': 'well card'})
print(ad.prettify( ))

<div class="well card">
 <div class="col-md-3 image-card">
  <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/>
 </div>
 <div class="col-md-6 body-card">
  <p class="txt-name inline">
   LAMBORGHINI AVENTADOR
  </p>
  <p class="txt-category badge badge-secondary inline">
   USADO
  </p>
  <p class="txt-motor">
   Motor 1.8 16v
  </p>
  <p class="txt-description">
   Ano 1993 - 55.286 km
  </p>
  <ul class="lst-items">
   <li class="txt-items">
    ► 4 X 4
   </li>
   <li class="txt-items">
    ► Câmera de estacionamento
   </li>
   <li class="txt-items">
    ► Controle de tração
   </li>
   <li class="txt-items">
    ► Sensor de estacionamento
   </li>
   <li class="txt-items">
    ...
   </li>
  </ul>
  <p class="txt-location">
   Belo Horizonte - MG
  </p>
 </div>
 <div class="col-md-3 value-card">
  <div class="value">
   <p class="txt-value">
    R$ 338

Note that our ad has information about the name of the car, the items that it has, its value, its year, its km, its location, and others. Let's get the value of our car:

In [105]:
ad.find('p', {'class':'txt-value'})

<p class="txt-value">R$ 338.000</p>

Great! Now, to eliminate the tags:

In [107]:
ad.find('p', {'class':'txt-value'}).get_text( )

'R$ 338.000'

Other example: to get the items from our car, we can use:

In [108]:
ad.find('li', {'class':'txt-items'}).get_text( )

'► 4 X 4'

However, note that our car has multiple items. To get all of them at once, we may use:

In [110]:
ad.findAll('li', {'class':'txt-items'})

[<li class="txt-items">► 4 X 4</li>,
 <li class="txt-items">► Câmera de estacionamento</li>,
 <li class="txt-items">► Controle de tração</li>,
 <li class="txt-items">► Sensor de estacionamento</li>,
 <li class="txt-items">...</li>]

and, to only get the text:

In [111]:
for item in ad.findAll('li', {'class':'txt-items'}):
  print(item.get_text( ))

► 4 X 4
► Câmera de estacionamento
► Controle de tração
► Sensor de estacionamento
...


## Getting a lot of information at once

Also, we may note that those information usually comes from the same tag. For instance, if we search for tag \<p\> under tag \<div\>, class "body-card":

In [113]:
infos = ad.find('div', {'class': 'body-card'}).findAll('p')
infos

[<p class="txt-name inline">LAMBORGHINI AVENTADOR</p>,
 <p class="txt-category badge badge-secondary inline">USADO</p>,
 <p class="txt-motor">Motor 1.8 16v</p>,
 <p class="txt-description">Ano 1993 - 55.286 km</p>,
 <p class="txt-location">Belo Horizonte - MG</p>]

We got a lot of information at once.

Another method to get multiple values at once is to store cards in auxiliary variables. Usually, we use a dictionary. Let's understand step by step how to do this.

First, we can see each of the items from our list using:

In [114]:
for info in infos:
  print(info)

<p class="txt-name inline">LAMBORGHINI AVENTADOR</p>
<p class="txt-category badge badge-secondary inline">USADO</p>
<p class="txt-motor">Motor 1.8 16v</p>
<p class="txt-description">Ano 1993 - 55.286 km</p>
<p class="txt-location">Belo Horizonte - MG</p>


However, there is a lot of information there we don't want. Let's try to get only the class and the text of each item:

In [115]:
for info in infos:
  print(info.get('class'), ' - ', info.get_text( ))

['txt-name', 'inline']  -  LAMBORGHINI AVENTADOR
['txt-category', 'badge', 'badge-secondary', 'inline']  -  USADO
['txt-motor']  -  Motor 1.8 16v
['txt-description']  -  Ano 1993 - 55.286 km
['txt-location']  -  Belo Horizonte - MG


That looks a bit better. However, our class is a list, where only the first value is important. Thus, we can do:

In [117]:
for info in infos:
  print(info.get('class')[0], ' - ', info.get_text( ))

txt-name  -  LAMBORGHINI AVENTADOR
txt-category  -  USADO
txt-motor  -  Motor 1.8 16v
txt-description  -  Ano 1993 - 55.286 km
txt-location  -  Belo Horizonte - MG


Now that looks a lot better. To make things more clear, we can also ignore the 'txt-' part from our class using:

In [118]:
for info in infos:
  print(info.get('class')[0].split("-")[-1], ' - ', info.get_text( ))

name  -  LAMBORGHINI AVENTADOR
category  -  USADO
motor  -  Motor 1.8 16v
description  -  Ano 1993 - 55.286 km
location  -  Belo Horizonte - MG


Good! Now, let's create a dictionary using these ideas:

In [120]:
card = {}

for info in infos:
  card[info.get('class')[0].split("-")[-1]] = info.get_text( )

In [121]:
card

{'category': 'USADO',
 'description': 'Ano 1993 - 55.286 km',
 'location': 'Belo Horizonte - MG',
 'motor': 'Motor 1.8 16v',
 'name': 'LAMBORGHINI AVENTADOR'}

Great! Everything worked out. However, there is still some other information we can get.

As we showed before, we can do so using:

In [123]:
for item in ad.findAll('li', {'class':'txt-items'}):
  print(item)

<li class="txt-items">► 4 X 4</li>
<li class="txt-items">► Câmera de estacionamento</li>
<li class="txt-items">► Controle de tração</li>
<li class="txt-items">► Sensor de estacionamento</li>
<li class="txt-items">...</li>


Ok. Now, we have to clean this output, and remove what we don't want. First, to eliminate the last item of the list:

In [131]:
items = ad.findAll('li', {'class':'txt-items'})

In [133]:
items.pop( )

<li class="txt-items">...</li>

In [134]:
items

[<li class="txt-items">► 4 X 4</li>,
 <li class="txt-items">► Câmera de estacionamento</li>,
 <li class="txt-items">► Controle de tração</li>,
 <li class="txt-items">► Sensor de estacionamento</li>]

Now, let's store the text on a list:

In [137]:
list_items = []
for item in items:
  list_items.append(item.get_text( ))

list_items

['► 4 X 4',
 '► Câmera de estacionamento',
 '► Controle de tração',
 '► Sensor de estacionamento']

However, let's make our text more clean:

In [138]:
list_items = []
for item in items:
  list_items.append(item.get_text( ).replace('► ', ''))

list_items

['4 X 4',
 'Câmera de estacionamento',
 'Controle de tração',
 'Sensor de estacionamento']

Great! Now, let's add a new key on our dictionary:

In [139]:
card['items'] = list_items

In [140]:
card

{'category': 'USADO',
 'description': 'Ano 1993 - 55.286 km',
 'items': ['4 X 4',
  'Câmera de estacionamento',
  'Controle de tração',
  'Sensor de estacionamento'],
 'location': 'Belo Horizonte - MG',
 'motor': 'Motor 1.8 16v',
 'name': 'LAMBORGHINI AVENTADOR'}

Now, let's get the final information: the value of our car.

In [141]:
ad.find('p', {'class':'txt-value'})

<p class="txt-value">R$ 338.000</p>

Ok, so let's just treat our string:

In [144]:
print(ad.find('p', {'class':'txt-value'}).get('class')[0].split('-')[-1], ' - ', ad.find('p', {'class':'txt-value'}).get_text( ))

value  -  R$ 338.000


Nice! Now let's add to our dictionary:

In [145]:
card[ad.find('p', {'class':'txt-value'}).get('class')[0].split('-')[-1]] = ad.find('p', {'class':'txt-value'}).get_text( )

In [146]:
card

{'category': 'USADO',
 'description': 'Ano 1993 - 55.286 km',
 'items': ['4 X 4',
  'Câmera de estacionamento',
  'Controle de tração',
  'Sensor de estacionamento'],
 'location': 'Belo Horizonte - MG',
 'motor': 'Motor 1.8 16v',
 'name': 'LAMBORGHINI AVENTADOR',
 'value': 'R$ 338.000'}

## Creating a DataFrame from our dictionary

After getting our information, it is common to send it to a Pandas DataFrame, so that analyses can be performed more easily. So, we can do:

In [147]:
import pandas as pd

data = pd.DataFrame(card)

In [148]:
data

Unnamed: 0,name,category,motor,description,location,items,value
0,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,4 X 4,R$ 338.000
1,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,Câmera de estacionamento,R$ 338.000
2,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,Controle de tração,R$ 338.000
3,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,Sensor de estacionamento,R$ 338.000


However, note that our dataframe is not right: it considered that each acessory was related to a different entry. To fix this, we can create our DataFrame using:

In [152]:
data = pd.DataFrame.from_dict(card, orient = 'index').T

In [153]:
data

Unnamed: 0,name,category,motor,description,location,items,value
0,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,"[4 X 4, Câmera de estacionamento, Controle de ...",R$ 338.000


Nice! Now, everything worked out. From a pandas DataFrame, it is also easy to export our data from a csv:

In [154]:
data.to_csv('test.csv', sep = ';', index = False, encoding = 'utf-8')

## Geting the picture from our ad

Ok, so we have captured almost all information from our card. However, we still miss one: the image. Getting the image may be interest if you want to train a Convolutional Neural Network for object recognition, for example.

If we inspect the image from our card, we can see that the image is under the tag:

In [155]:
ad.find('div', {'class' : 'image-card'})

<div class="col-md-3 image-card"><img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/></div>

Actually, we want to get attribute *src* from tag \<img\>. Thus, we can do:

In [156]:
ad.find('div', {'class' : 'image-card'}).img.get('src')

'https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg'

To visualize our figure here, in our notebook, we can do:

In [158]:
from IPython.core.display import display, HTML

display(HTML(str(ad.find('div', {'class' : 'image-card'}).img)))

To download our figure, we can do:

In [166]:
from urllib.request import urlretrieve

image = ad.find('div', {'class' : 'image-card'}).img

urlretrieve(image.get('src'), image.get('src').split('/')[-1])

('lamborghini-aventador-2932196__340.jpg',
 <http.client.HTTPMessage at 0x7fadbfcf2250>)

# Repeating the steps for our other cards

Ok, we got information from the first card. However, we need to do the same thing for all 10 cards from the same page. First, let's understand what we want to get. If we inspect the page:

https://alura-site-scraping.herokuapp.com/index.php

we are able to see that our first card was under the tag *div*, under class *well card*. Now, this card itself is under the tag *div*, id *container-cards*. Thus, we can do:



In [169]:
container = soup.find('div', {'id':'container-cards'})

In [170]:
container

<div id="container-cards" style="height: 100%"><div class="well card"><div class="col-md-3 image-card"><img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/></div><div class="col-md-6 body-card"><p class="txt-name inline">LAMBORGHINI AVENTADOR</p><p class="txt-category badge badge-secondary inline">USADO</p><p class="txt-motor">Motor 1.8 16v</p><p class="txt-description">Ano 1993 - 55.286 km</p><ul class="lst-items"><li class="txt-items">► 4 X 4</li><li class="txt-items">► Câmera de estacionamento</li><li class="txt-items">► Controle de tração</li><li class="txt-items">► Sensor de estacionamento</li><li class="txt-items">...</li></ul><p class="txt-location">Belo Horizonte - MG</p></div><div class="col-md-3 value-card"><div class="value"><p class="txt-value">R$ 338.000</p></div></div></div><div class="well card"><div class="col-md-3 image-card"><img alt="Foto" h

Now, we can see that all cards are listed under the tag *div*, class *well card*. Thus, to get all cards, we can do:

In [178]:
ads = container.findAll('div', {'class':'well card'})

In [179]:
ads

[<div class="well card"><div class="col-md-3 image-card"><img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/></div><div class="col-md-6 body-card"><p class="txt-name inline">LAMBORGHINI AVENTADOR</p><p class="txt-category badge badge-secondary inline">USADO</p><p class="txt-motor">Motor 1.8 16v</p><p class="txt-description">Ano 1993 - 55.286 km</p><ul class="lst-items"><li class="txt-items">► 4 X 4</li><li class="txt-items">► Câmera de estacionamento</li><li class="txt-items">► Controle de tração</li><li class="txt-items">► Sensor de estacionamento</li><li class="txt-items">...</li></ul><p class="txt-location">Belo Horizonte - MG</p></div><div class="col-md-3 value-card"><div class="value"><p class="txt-value">R$ 338.000</p></div></div></div>,
 <div class="well card"><div class="col-md-3 image-card"><img alt="Foto" height="155" src="https://caelum-online-publ

Great! To see the first card, which we used up until now, we can do:

In [180]:
print(ads[0].prettify( ))

<div class="well card">
 <div class="col-md-3 image-card">
  <img alt="Foto" height="155" src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/img-cars/lamborghini-aventador/lamborghini-aventador-2932196__340.jpg" width="220"/>
 </div>
 <div class="col-md-6 body-card">
  <p class="txt-name inline">
   LAMBORGHINI AVENTADOR
  </p>
  <p class="txt-category badge badge-secondary inline">
   USADO
  </p>
  <p class="txt-motor">
   Motor 1.8 16v
  </p>
  <p class="txt-description">
   Ano 1993 - 55.286 km
  </p>
  <ul class="lst-items">
   <li class="txt-items">
    ► 4 X 4
   </li>
   <li class="txt-items">
    ► Câmera de estacionamento
   </li>
   <li class="txt-items">
    ► Controle de tração
   </li>
   <li class="txt-items">
    ► Sensor de estacionamento
   </li>
   <li class="txt-items">
    ...
   </li>
  </ul>
  <p class="txt-location">
   Belo Horizonte - MG
  </p>
 </div>
 <div class="col-md-3 value-card">
  <div class="value">
   <p class="txt-value">
    R$ 338

Ok. So, we have all 10 cards for the first page. Now, we have to do the same procedure we did for the first card, but now for all 10 cards. To do so, let's create a function to extract data from a single card:

In [196]:
def ExtractData(ad):
  card = {}

  # Basic information

  infos = ad.find('div', {'class': 'body-card'}).findAll('p')
  for info in infos:
    card[info.get('class')[0].split("-")[-1]] = info.get_text( )

  # Items

  items = ad.findAll('li', {'class':'txt-items'})
  list_items = []
  for item in items:
    list_items.append(item.get_text( ).replace('► ', ''))

  card['accessories'] = list_items

  # Value

  card[ad.find('p', {'class':'txt-value'}).get('class')[0].split('-')[-1]] = ad.find('p', {'class':'txt-value'}).get_text( )

  # Figure

  image = ad.find('div', {'class' : 'image-card'}).img

  urlretrieve(image.get('src'), image.get('src').split('/')[-1])

  return card

In [197]:
cards = []
for i in range(len(ads)):
  card = ExtractData(ads[i])
  cards.append(ExtractData(ads[i]))

In [198]:
cards

[{'accessories': ['4 X 4',
   'Câmera de estacionamento',
   'Controle de tração',
   'Sensor de estacionamento',
   '...'],
  'category': 'USADO',
  'description': 'Ano 1993 - 55.286 km',
  'location': 'Belo Horizonte - MG',
  'motor': 'Motor 1.8 16v',
  'name': 'LAMBORGHINI AVENTADOR',
  'value': 'R$ 338.000'},
 {'accessories': ['Câmera de estacionamento',
   'Controle de estabilidade',
   'Travas elétricas',
   'Freios ABS',
   '...'],
  'category': 'USADO',
  'description': 'Ano 2018 - 83.447 km',
  'location': 'Belo Horizonte - MG',
  'motor': 'Motor 3.0 32v',
  'name': 'BMW M2',
  'value': 'R$ 346.000'},
 {'accessories': ['Central multimídia',
   'Bancos de couro',
   'Rodas de liga',
   'Câmera de estacionamento',
   '...'],
  'category': 'USADO',
  'description': 'Ano 2004 - 19.722 km',
  'location': 'Rio de Janeiro - RJ',
  'motor': 'Motor 1.8 16v',
  'name': 'ALFA',
  'value': 'R$ 480.000'},
 {'accessories': ['Bancos de couro',
   'Freios ABS',
   'Rodas de liga',
   'Câmbio 

Now, let's create a DataFrame using this information:

In [199]:
cards_df = pd.DataFrame(cards)

In [200]:
cards_df

Unnamed: 0,name,category,motor,description,location,accessories,value,opportunity
0,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,"[4 X 4, Câmera de estacionamento, Controle de ...",R$ 338.000,
1,BMW M2,USADO,Motor 3.0 32v,Ano 2018 - 83.447 km,Belo Horizonte - MG,"[Câmera de estacionamento, Controle de estabil...",R$ 346.000,
2,ALFA,USADO,Motor 1.8 16v,Ano 2004 - 19.722 km,Rio de Janeiro - RJ,"[Central multimídia, Bancos de couro, Rodas de...",R$ 480.000,
3,PUECH,USADO,Motor Diesel V8,Ano 1992 - 34.335 km,São Paulo - SP,"[Bancos de couro, Freios ABS, Rodas de liga, C...",R$ 133.000,
4,LAMBORGHINI MURCIELAGO,USADO,Motor 1.0 8v,Ano 1991 - 464 km,Belo Horizonte - MG,"[Central multimídia, Teto panorâmico, Sensor c...",R$ 175.000,
5,ASTON MARTIN,USADO,Motor Diesel V6,Ano 2004 - 50.189 km,Belo Horizonte - MG,"[Painel digital, Controle de tração, Teto pano...",R$ 239.000,OPORTUNIDADE
6,TVR,USADO,Motor 4.0 Turbo,Ano 2014 - 17.778 km,Belo Horizonte - MG,"[4 X 4, Teto panorâmico, Central multimídia, C...",R$ 115.000,
7,EXCALIBUR,USADO,Motor 3.0 32v,Ano 2009 - 81.251 km,Rio de Janeiro - RJ,"[Painel digital, Câmbio automático, Sensor de ...",R$ 114.000,
8,MCLAREN,NOVO,Motor Diesel,Ano 2019 - 0 km,São Paulo - SP,"[Central multimídia, Câmera de estacionamento,...",R$ 75.000,
9,TOYOTA,USADO,Motor 4.0 Turbo,Ano 1999 - 12.536 km,São Paulo - SP,"[Bancos de couro, Freios ABS, Piloto automátic...",R$ 117.000,OPORTUNIDADE


Nice! We were able to get the information from our 10 cars into a single DataFrame! Note that we have also downloaded the image of each card.

# Getting all information from the website

Ok, so we were able to get information from the 10 cars of our page. However, our website has information about 246 cars! How can we get information for all of them?

Our 246 cars are listed in 25 pages on our website. If we pass to page 2, we get a different url:

https://alura-site-scraping.herokuapp.com/index.php?page=2

Note that, in the end of the url, there is the number of the page we are in.

Thus, to run through all pages, we can simply change the parameter in the end of the url.

First: How can we, automatically, get the number of pages in our website?

Well, can do a scrap.

If we inspect our page one more time, we can see that information regarding the page is stored under the tag *\<span\>*, class *info-pages*. Thus, we can do:

In [201]:
soup.find('span', {'class': 'info-pages'})

<span class="info-pages">Página 1 de 25</span>

Now, cleaning our scrap:

In [203]:
int(soup.find('span', {'class': 'info-pages'}).get_text( ).split(' ')[-1])

25

Nice!

Before going further into the code, let's re-implement our function to extract data, now removing the part where we download the pictures (as this may take too much time right now):

In [206]:
def ExtractData(ad):
  card = {}

  # Basic information

  infos = ad.find('div', {'class': 'body-card'}).findAll('p')
  for info in infos:
    card[info.get('class')[0].split("-")[-1]] = info.get_text( )

  # Items

  items = ad.findAll('li', {'class':'txt-items'})
  list_items = []
  for item in items:
    list_items.append(item.get_text( ).replace('► ', ''))

  card['accessories'] = list_items

  # Value

  card[ad.find('p', {'class':'txt-value'}).get('class')[0].split('-')[-1]] = ad.find('p', {'class':'txt-value'}).get_text( )

  return card

Now, let's create a method to perform scrapping over all of our pages!

In [207]:
num_pages = int(soup.find('span', {'class': 'info-pages'}).get_text( ).split(' ')[-1])  # Get the number of pages

cards = []

for page in range(num_pages):                                                           # Run through all pages
  url = "https://alura-site-scraping.herokuapp.com/index.php?page=" + str(page + 1)     # Create the url

  response = urlopen(url)
  html = response.read( )
  soup = BeautifulSoup(html, 'html.parser')

  # Getting information from the cards in our page

  container = soup.find('div', {'id':'container-cards'})
  ads = container.findAll('div', {'class':'well card'})

  # Extracting data from each ad

  for i in range(len(ads)):
    card = ExtractData(ads[i])
    cards.append(ExtractData(ads[i]))

Now, creating a dataframe from our dictionary:

In [208]:
cards_df = pd.DataFrame(cards)

In [209]:
cards_df.head( )

Unnamed: 0,name,category,motor,description,location,accessories,value,opportunity
0,LAMBORGHINI AVENTADOR,USADO,Motor 1.8 16v,Ano 1993 - 55.286 km,Belo Horizonte - MG,"[4 X 4, Câmera de estacionamento, Controle de ...",R$ 338.000,
1,BMW M2,USADO,Motor 3.0 32v,Ano 2018 - 83.447 km,Belo Horizonte - MG,"[Câmera de estacionamento, Controle de estabil...",R$ 346.000,
2,ALFA,USADO,Motor 1.8 16v,Ano 2004 - 19.722 km,Rio de Janeiro - RJ,"[Central multimídia, Bancos de couro, Rodas de...",R$ 480.000,
3,PUECH,USADO,Motor Diesel V8,Ano 1992 - 34.335 km,São Paulo - SP,"[Bancos de couro, Freios ABS, Rodas de liga, C...",R$ 133.000,
4,LAMBORGHINI MURCIELAGO,USADO,Motor 1.0 8v,Ano 1991 - 464 km,Belo Horizonte - MG,"[Central multimídia, Teto panorâmico, Sensor c...",R$ 175.000,


In [210]:
cards_df.info( )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246 entries, 0 to 245
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         246 non-null    object
 1   category     246 non-null    object
 2   motor        246 non-null    object
 3   description  246 non-null    object
 4   location     246 non-null    object
 5   accessories  246 non-null    object
 6   value        246 non-null    object
 7   opportunity  39 non-null     object
dtypes: object(8)
memory usage: 15.5+ KB


Great! Now, we have information from all 246 cars!