# Initial webscraping example

The tutorial uses python packages *requests* for html download and *BeautifulSoup* for parsing of objects. 

First try to download a sample file from the web. 

http://dataquestio.github.io/web-scraping-pages/simple.html. 

In [5]:
import requests
from bs4 import BeautifulSoup

site = "http://dataquestio.github.io/web-scraping-pages/simple.html"
page = requests.get(site)
page 

<Response [200]>

In [6]:
page.status_code

200

Status code starting with 2 are success, 4 and 5 are typically errors.
We can now simply print the page contents.

In [7]:
page.content


b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

Using *BeatifulSoup* the page contents can be parsed and printed nicely:

In [20]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>




Basically *BeatifulSoup* produces a list generator which contains several childrens. The 2 child are the acutal nested *html Tag* object, which can then be extracted similarly. 


In [23]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [24]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [25]:
html = list(soup.children)[2]

In [26]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

This now allows to access the body text: 

In [29]:
body = list(html.children)[3]

In [30]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

and access a single paragraph *p tag* in a similar way. 

In [31]:
p = list(body.children)[1]

In [32]:
p.get_text()

'Here is some simple content for this page.'

## Finding all instances of a tag 


In [36]:
soup = BeautifulSoup(page.content,'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [37]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'