# Section 9 - Web Scraping and Web Crawling
Data from websites are not always neatly organized. We call this **unstructured data**. Most webpages are rendered in **HTML**, a programming language designed for layout and formatting. But in order to understand how to work with unstructured data from websites we need to know a little bit of HTML. CodeAcademy has a great free [HTML tutorial](https://www.codecademy.com/learn/learn-html), which covers even more than what we need for this section. We are also going to use a little bit of **regular expressions** (or RegEx), which is a special string describing a search pattern. And here is a great [RegEx tutorial](https://regexone.com/).

Now that we know the basics of HTML, you should also learn how to **open the source code panel** of your web browser. If you are using Google Chrome, just press F12. If you are using Safari on a Mac, you need to **enable the developer menu**:
1. Click on *Safari Menu* > *Preferences* > *Advanced*
1. Check the *"show Develop menu in the menu bar"*
1. Now there a is a drop down menu called *Develop*, with the *view source* option.

In this notebook we are going to use some new libraries:
* **`requests`**: Allows to grab the source code and other characteristics of a webpage.
* **`BeaultifulSoup4`**: Interprets text as sorce code and lets you grab the parts you want.
* **`re`**: The RegEx module that is already built in python.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

For now, our target is going to the **G1 website**.

https://g1.globo.com/

We can use the requests library to `get` the contents of the page.

In [3]:
response = requests.get('http://g1.globo.com')
type(response)

requests.models.Response

notice that the `get` method returns a specific object of the `requests` library. Let's take a look at the atributes of the object.

In [4]:
dir(response)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

Try a few of these attributes on the next cell. The most important for us is going to be `.text` method.

In [10]:
response.text

'<!DOCTYPE HTML>\n<html lang="pt-br" itemscope itemtype="http://schema.org/WebPage"> <head><meta charset="utf-8"><meta http-equiv="x-ua-compatible" content="ie=edge,chrome=1"><meta name="viewport" content="width=device-width, initial-scale=1"><title>G1 - O portal de notícias da Globo</title><link rel="preconnect" href="https://s.glbimg.com"><link rel="preconnect" href="https://s2.glbimg.com"><link rel="preconnect" href="https://s3.glbimg.com"><link rel="preconnect" href="https://tags.globo.com"><link rel="preconnect" href="https://tags.tiqcdn.com"><link rel="preconnect" href="https://www.google-analytics.com"><meta name="title" content="G1 - O portal de notícias da Globo"><meta name="description" content="Últimas notícias de economia, política, carros, emprego, educação, ciência, saúde, cultura do Brasil e do mundo. Vídeos dos telejornais da TV Globo e da GloboNews."><meta rel="icon" type="image/jpeg" href="https://s2.glbimg.com/nMQRSo35mM4QdMphrT4FsjQuMeM=/32x32/smart/filters:strip_ic

We can use RegEx to search content in this page. For example, let's **grab all the URLs from G1 website**.

In [21]:
regex_html_url = '''href=["'](.[^"']+)["']'''  # RegEx for URLs
re.findall(regex_html_url, response.text)

['https://s.glbimg.com',
 'https://s2.glbimg.com',
 'https://s3.glbimg.com',
 'https://tags.globo.com',
 'https://tags.tiqcdn.com',
 'https://www.google-analytics.com',
 'https://s2.glbimg.com/nMQRSo35mM4QdMphrT4FsjQuMeM=/32x32/smart/filters:strip_icc()/s.glbimg.com/jo/g1/f/original/2018/05/23/favicon-g1.jpeg',
 'https://g1.globo.com/',
 'https://s2.glbimg.com/Nme7nzPOlSYd1wZQzw294kUVR4Y=/16x16/smart/filters:strip_icc()/s.glbimg.com/jo/g1/f/original/2018/05/23/favicon-g1.jpeg',
 'https://s2.glbimg.com/nMQRSo35mM4QdMphrT4FsjQuMeM=/32x32/smart/filters:strip_icc()/s.glbimg.com/jo/g1/f/original/2018/05/23/favicon-g1.jpeg',
 'https://s2.glbimg.com/HgQSI36Q2YVwzKwqPRAdK_AuZy8=/192x192/smart/filters:strip_icc()/s.glbimg.com/jo/g1/f/original/2018/05/23/favicon-g1.jpeg',
 'https://s2.glbimg.com/FuRy3SDTOpEJIivfVlsCRzQ7XFE=/57x57/smart/filters:strip_icc()/s.glbimg.com/jo/g1/f/original/2018/05/23/favicon-g1.jpeg',
 'https://s2.glbimg.com/aaAnfcAkozhA0CPrwxtMXUgmYOc=/72x72/smart/filters:strip_icc(

Notice that you can now write a program to visit all the avaible URLs from G1 home page and turn your webscrapper into a webcrawler.

Now, let's say we want to **grab the text from all the headlines**.

In [39]:
response = requests.get('http://g1.globo.com')
headlines = bs(response.content, 'html.parser').find_all('a', class_='feed-post-link gui-color-primary gui-color-hover')

print(len(headlines))

headlines

9


[<a class="feed-post-link gui-color-primary gui-color-hover" href="https://g1.globo.com/economia/noticia/2018/10/17/elevacao-do-piso-salarial-de-agentes-de-saude-vai-custar-r-48-bi-em-tres-anos-diz-planejamento.ghtml">Governo prevê gasto de R$ 4,8 bi em 3 anos após Congresso derrubar veto de Temer</a>,
 <a class="feed-post-link gui-color-primary gui-color-hover" href="https://g1.globo.com/sp/sao-paulo/noticia/2018/10/17/relatorio-final-do-inquerito-dos-portos-traz-detalhes-sobre-esquema-de-corrupcao-no-porto-de-santos.ghtml">PF vê indícios de que Temer participa de esquema de propina há 20 anos</a>,
 <a class="feed-post-link gui-color-primary gui-color-hover" href="https://g1.globo.com/politica/noticia/2018/10/17/barroso-valida-delacao-de-dono-da-engevix-que-cita-temer-pf-pede-novo-inquerito.ghtml">Após delação de empresário, PF pede novo inquérito sobre Temer</a>,
 <a class="feed-post-link gui-color-primary gui-color-hover" href="https://g1.globo.com/politica/eleicoes/2018/noticia/201

To grab the contents inside a HTML tag, use the `.get_text` method

In [45]:
print(type(headlines[0]), '\n')

print(headlines[0])

headlines[0].get_text()

<class 'bs4.element.Tag'> 

<a class="feed-post-link gui-color-primary gui-color-hover" href="https://g1.globo.com/economia/noticia/2018/10/17/elevacao-do-piso-salarial-de-agentes-de-saude-vai-custar-r-48-bi-em-tres-anos-diz-planejamento.ghtml">Governo prevê gasto de R$ 4,8 bi em 3 anos após Congresso derrubar veto de Temer</a>


'Governo prevê gasto de R$ 4,8 bi em 3 anos após Congresso derrubar veto de Temer'

In [48]:
for i in range(len(headlines)):
    print(headlines[i].get_text(), '\n')

Governo prevê gasto de R$ 4,8 bi em 3 anos após Congresso derrubar veto de Temer 

PF vê indícios de que Temer participa de esquema de propina há 20 anos 

Após delação de empresário, PF pede novo inquérito sobre Temer 

Saiba mais: o que Bolsonaro e Haddad propõem sobre armas de fogo 

Bolsonaro visita a PF e assina carta de compromisso na Arquidiocese do Rio 

FHC diz que 'porta' entre ele e Haddad está 'enferrujada' 

Haddad se reúne com representantes de igrejas evangélicas em SP 

Após críticas, Cid Gomes grava vídeo de apoio a Haddad a pedido do PT 

Presidente do PSL defende negociar comando da Câmara com 'centrão' 



---
# <font color='#ff0000'> DELETE THIS SECTION </font>
# Scraping Emails

[Alunos do mestrado da PUC-Rio](http://www.econ.puc-rio.br/pessoas/alunos-mestrado)

In [49]:
response = requests.get('http://www.econ.puc-rio.br/pessoas/alunos-mestrado')

regExpEmail = '[\w\.-]+@[\w\.-]+' # RegEx for email
re.findall(regExpEmail, response.text)

['alex.carmar93@gmail.com',
 'alex.carmar93@gmail.com',
 'aliceodrumond@gmail.com',
 'aliceodrumond@gmail.com',
 'biaribeiro11@hotmail.com',
 'biaribeiro11@hotmail.com',
 'carlosalbertobdc@gmail.com',
 'carlosalbertobdc@gmail.com',
 'carloshg96@gmail.com',
 'carloshg96@gmail.com',
 'caterina.vieira@gmail.com',
 'caterina.vieira@gmail.com',
 'cesar.zambrano@live.com',
 'cesar.zambrano@live.com',
 'daniel_cardoso@poli.ufrj.br',
 'daniel_cardoso@poli.ufrj.br',
 'doine.daniel@gmail.com',
 'doine.daniel@gmail.com',
 'daniel.mar.coutinho@gmail.com',
 'daniel.mar.coutinho@gmail.com',
 'davi.doneda@gmail.com',
 'davi.doneda@gmail.com',
 'efagundesdecarvalho@gmail.com',
 'efagundesdecarvalho@gmail.com',
 'eduardo.leitner@gmail.com',
 'eduardo.leitner@gmail.com',
 'felipekotinda@gmail.com',
 'felipekotinda@gmail.com',
 'fernandolmc@al.insper.edu.br',
 'fernandolmc@al.insper.edu.br',
 'gabrielgranato@gmail.com',
 'gabrielgranato@gmail.com',
 'kaian.arantes@gmail.com',
 'kaian.arantes@gmail.com',


Alguns dos números não existem

In [50]:
requests.get('http://www.econ.puc-rio.br/pessoas/perfil/17').text

'<br />\n<b>Fatal error</b>:  Call to a member function getNucleosId() on boolean in <b>/var/www/Puc-econ/apps/frontend/modules/nucleos/actions/actions.class.php</b> on line <b>87</b><br />\n'

Sempre que o erro aparece o termo **Fatal Error** aparece no texto, podemos usar isso como condição para pular aquele número.

In [51]:
re.findall('Fatal error', requests.get('http://www.econ.puc-rio.br/pessoas/perfil/17').text)

['Fatal error']

No caso em que a página existe, usamos o beautiful soup para pegar a tag do nome e usar a expressão regular para pegar o email.

In [60]:
response = requests.get('http://www.econ.puc-rio.br/pessoas/perfil/18')

soup = bs(response.content.decode('utf-8','ignore'), 'html.parser')
print(soup.find_all('h5')[0].get_text())

print(re.findall(regExpEmail , response.text)[0])

Carlos Viana de  Carvalho
cvianac@econ.puc-rio.br


Agora é só iterar para todas as páginas, pulando aquelas que tiverem "fatal error"

In [69]:
# Cria listas vazias. Os dados serão adicionados nelas.
Nomes = []
Emails = []

# For para o número da página da web
for i in range(1, 3000):    
    
    if i%250 == 0:
        print('buscando pagina', i)
    
    response = requests.get('http://www.econ.puc-rio.br/pessoas/perfil/' + str(i))
    
    if 'Fatal error' in re.findall('Fatal error', response.text):
        # print('erro')
        continue
    else:
        nome_i = bs(response.content.decode('utf-8','ignore'), 'html.parser').find_all('h5')[0].text
        
        try:
            email_i = re.findall(regExpEmail, response.text)[0]
        except IndexError:
            # print('sem email na pagina')
            email_i = ''
  
        Nomes.append(nome_i)
        Emails.append(email_i)
        
EmailsPUC = pd.DataFrame(data={'Nomes':Nomes, 'Emails':Emails}).sort_values('Nomes')

buscando pagina 250
buscando pagina 500
buscando pagina 750
buscando pagina 1000
buscando pagina 1250
buscando pagina 1500
buscando pagina 1750
buscando pagina 2000
buscando pagina 2250
buscando pagina 2500
buscando pagina 2750


In [81]:
EmailsPUC[EmailsPUC['Nomes'].str.contains('Gustavo')].dropna().sort_values('Nomes')

Unnamed: 0,Nomes,Emails
189,Carlos Gustavo Machicado Salas,guccio@hotmail.com
760,Gustavo Amoras Souza Lima,gustslima@gmail.com
1775,Gustavo Arantes Camargo,
1011,Gustavo Barbosa de Almeida,
1066,Gustavo Bicharra Pinto,
2012,Gustavo Cardoso de Castro,
1730,Gustavo Cesar Lima,
1776,Gustavo Chalhoub Garcez,
739,Gustavo Cicchelli de Sá Vieira,gustavocsv@gmail.com
1670,Gustavo Cunha Garcia,


One way to protect yourself from webscraping is to not write your email **explicitly**. Here is an example from Marcos Lopez de Prado [website](http://www.quantresearch.info/).


# <font color='#ff0000'> DELETE UNTIL HERE </font>


---
## Query Strings: URLs with Parameters
Everything after "?" is the query string and it is meant to contain data that does not fit within a URL’s normal hierarchical path structure

https://www.google.com.br/search?q=pesquisa+eleitoral+bolsonaro&num=3&as_sitesearch=g1.com.br

* query string comes at the end of a URL, starting with a single question mark, “?”.
* Parameters are provided as key-value pairs and separated by an ampersand, “&”.
* The key and value are separated using an equals sign, “=”.

In [82]:
url = 'https://www.google.com.br/search?q=pesquisa+eleitoral+bolsonaro&num=3&as_sitesearch=g1.com.br'
res = requests.get(url)
res.text

'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="pt-BR"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><noscript><meta content="0;url=/search?q=pesquisa+eleitoral+bolsonaro+site:g1.com.br&amp;num=3&amp;ie=UTF-8&amp;gbv=1&amp;sei=Y6DHW8GFNsOpwgTYvYuQDw" http-equiv="refresh"><style>table,div,span,p{display:none}</style><div style="display:block">Clique <a href="/search?q=pesquisa+eleitoral+bolsonaro+site:g1.com.br&amp;num=3&amp;ie=UTF-8&amp;gbv=1&amp;sei=Y6DHW8GFNsOpwgTYvYuQDw">aqui</a> se você não for redirecionado em alguns segundos.</div></noscript><title>pesquisa eleitoral bolsonaro site:g1.com.br - Pesquisa Google</title><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;positio

We want an easy way to change the parameters of our search. Luckily, the `requests` library already has a functionality for URL queries.

In [83]:
url = 'https://www.google.com.br/search'
param_dict = {'q': 'pesquisa+eleitoral+bolsonaro', 
              'num': '3', 
              'as_sitesearch': 'g1.com.br'}

res = requests.get(url, params=param_dict)
res.text

'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="pt-BR"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><noscript><meta content="0;url=/search?q=pesquisa%2Beleitoral%2Bbolsonaro+site:g1.com.br&amp;num=3&amp;ie=UTF-8&amp;gbv=1&amp;sei=c6DHW9i4A4SkwgSIxJbQDw" http-equiv="refresh"><style>table,div,span,p{display:none}</style><div style="display:block">Clique <a href="/search?q=pesquisa%2Beleitoral%2Bbolsonaro+site:g1.com.br&amp;num=3&amp;ie=UTF-8&amp;gbv=1&amp;sei=c6DHW9i4A4SkwgSIxJbQDw">aqui</a> se você não for redirecionado em alguns segundos.</div></noscript><title>pesquisa+eleitoral+bolsonaro site:g1.com.br - Pesquisa Google</title><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0

## Practical Example - Derivatives Settlement Prices from B3
An applied example of this is going to be relevant for is scraping sttlement prices from the B3 derivatives.

http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-enUS.asp?Data=10/01/2018&Mercadoria=DOL

In [11]:
url = 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-enUS.asp'
param_dict = {'Data': '10/01/2018', 'Mercadoria': 'DOL'}
r = requests.get(url, params=param_dict)
r.text

'\r\n\r\n\r\n<!doctype html>\r\n<html class="no-js" lang="en-us">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\r\n<meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n<link rel="stylesheet" href="css/foundation.css" />\n<link rel="stylesheet" href="css/jquery.ui.datepicker.css" />\n<script src="js/vendor/modernizr.js"></script>\r\n\r\n<script type="text/javascript">\r\nfunction Retroativo(theForm) {\r\n    if(!CkDate(theForm.dData1, theForm.dData1.value))\r\n        {\r\n            \r\n                alert("Select the query period.");\r\n            \r\n            theForm.dData1.focus();\r\n            theForm.dData1.select();\r\n            return false;\r\n        }\r\n    if(!CkDate(theForm.dData2, theForm.dData2.value))\r\n        {\r\n            \r\n                alert("Select the query period.");\r\n            \r\n            theForm.dData2.focus();\r\n            theForm.dData2.select();\r\n            return

---
# Webdrivers