# Scraping the web with Python.

## BeautifulSoup


 - urllib es la libraría que nos permite acceder a información en la web y a solicitar acceso mediante request.
 - HTML arroja data estructurada y data no estrcturada.
 - BeautifulSoup analiza y extrae la data estructurada que nos fue entregada en la respuesta del request.
 
 **web site BeautifulSoup.**
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#problems-after-installation

In [1]:
# Importar paquetes necesarios para el scraping.

import requests
from bs4 import BeautifulSoup

In [10]:
# Especificar la url

url = 'https://es.wikipedia.org/wiki/Dan_Brown'

In [11]:
# Enviar y recibir respuesta a la solicitud de acceso

r = requests.get(url)

In [12]:
# Extraer la respuesta en HTML

html_doc = r.text

print(html_doc)

<!DOCTYPE html>
<html class="client-nojs" lang="es" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Dan Brown - Wikipedia, la enciclopedia libre</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Dan_Brown","wgTitle":"Dan Brown","wgCurRevisionId":107240258,"wgRevisionId":107240258,"wgArticleId":41826,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia:Artículos que necesitan referencias adicionales","Wikipedia:Referenciar (aún sin clasificar)","Personas vivas","Wikipedia:Control de autoridades con 15 elementos","Wikipedia:Artículos con identificadores VIAF","Wikipedia:Artículos con identificadores ISNI","Wikipedia:Artículos con identificadores BNE","Wiki

In [13]:
# Crear un objeto BeautifulSoup desde el HTML
soup = BeautifulSoup(html_doc, "lxml")

In [14]:
# Obtener el titulo del web page de Dan Brown

dan_title = soup.title

print(dan_title)

<title>Dan Brown - Wikipedia, la enciclopedia libre</title>


In [15]:
# Extraeremos el texto de la pagina de wikipedia de Dan Brown.

texto_dan = soup.get_text()

In [16]:
print(texto_dan)




Dan Brown - Wikipedia, la enciclopedia libre
document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Dan_Brown","wgTitle":"Dan Brown","wgCurRevisionId":107240258,"wgRevisionId":107240258,"wgArticleId":41826,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia:Artículos que necesitan referencias adicionales","Wikipedia:Referenciar (aún sin clasificar)","Personas vivas","Wikipedia:Control de autoridades con 15 elementos","Wikipedia:Artículos con identificadores VIAF","Wikipedia:Artículos con identificadores ISNI","Wikipedia:Artículos con identificadores BNE","Wikipedia:Artículos con identificadores BNF","Wikipedia:Artículos con identificadores CANTIC","Wikipedia:Artículos con identificadores 

In [17]:
# con el método find_all() se encontrarán todos los hypervinculos, todos estos emíezan con <a> en html,
# por ello buscamos 'a

a_tags = soup.find_all('a')

In [19]:
# Para cada objeto encontrado extraemos el link y lo imprimimos

for link in a_tags:
    print(link.get('href'))

None
#mw-head
#p-search
/wiki/Wikipedia:Referencias
/wiki/Wikipedia:Verificabilidad
/wiki/Wikipedia:Fuentes_fiables
/wiki/Ayuda:C%C3%B3mo_referenciar
/wiki/Wikipedia:Wikipedia_no_es_una_fuente_primaria
/wiki/Archivo:Dan_Brown_bookjacket_cropped.jpg
/wiki/22_de_junio
/wiki/1964
/wiki/Exeter_(Nuevo_Hampshire)
/wiki/Nuevo_Hampshire
/wiki/Estados_Unidos
/wiki/Estados_Unidos
https://www.wikidata.org/wiki/Q7345?uselang=es#P27
/wiki/Idioma_ingl%C3%A9s
https://www.wikidata.org/wiki/Q7345?uselang=es#P103
/wiki/Agnosticismo
https://www.wikidata.org/wiki/Q7345?uselang=es#P140
/wiki/Amherst_College
/wiki/Phillips_Exeter_Academy
https://www.wikidata.org/wiki/Q7345?uselang=es#P69
/wiki/Novela
/wiki/La_fortaleza_digital
/wiki/La_conspiraci%C3%B3n
/wiki/%C3%81ngeles_y_demonios_(novela)
/wiki/El_c%C3%B3digo_Da_Vinci
/wiki/El_s%C3%ADmbolo_perdido
/wiki/Inferno_(novela_de_Dan_Brown)
/wiki/Origen_(novela_de_Dan_Brown)
https://www.wikidata.org/wiki/Q7345?uselang=es#P800
https://www.wikidata.org/wiki/Q7345?

---
# APIs Y JSON


**API = Application programming interface

Las API crean interfaces entre programas y trabajan con archivos JSON. 

** JSON: Java Script Object notation.

Un archivo JSON esta basado en clave:valor, como los diccionarios.

In [21]:
# Assign URL to variable: url
url = ('http://www.omdbapi.com/?apikey=ff21610b&t=social+network')

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:  Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:  Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  USA
Awards:  Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:  https://ia.media-imdb.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.7
imdbVotes:  537,084
imdbID:  tt1285016
Type:  movie
DVD:  11 Jan 2011
BoxOffice:  $96,400

In [27]:
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Dan_Brown'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print the Wikipedia page extract
DanBrown_extract = json_data['query']['pages']
print(DanBrown_extract)

{'444645': {'pageid': 444645, 'ns': 0, 'title': 'Dan Brown', 'extract': '<p><b>Daniel Gerhard Brown</b> (born June 22, 1964) is an American author of thriller novels, most notably the Robert Langdon stories: <i>Angels &amp; Demons</i> (2000), <i>The Da Vinci Code</i> (2003), <i>The Lost Symbol</i> (2009), <i>Inferno</i> (2013) and <i>Origin</i> (2017). His novels are treasure hunts set in a 24-hour period, and feature the recurring themes of cryptography, keys, symbols, codes, and conspiracy theories. His books have been translated into 56 languages, and as of 2012, sold over 200 million copies. Three of them, <i>Angels &amp; Demons</i> (2000), <i>The Da Vinci Code</i> (2003) and <i>Inferno</i> (2013) have been adapted into films.</p>\n<p>Brown\'s novels that feature the lead character, Langdon, also include historical themes and Christianity as motifs, and have generated controversy. Brown states on his website that his books are not anti-Christian, though he is on a \'constant spirit