## 2. Making your own API: Web scraping

Sometimes data is on the web but there is no API to grant access to it, the API is lacking functionalities or the terms of service are not adequate. In those cases because as humans we have visual access to the data we might wonder how to extract that data automatically. The discipline for doing so is **Web Scraping**. 

Before we start, it is useful to understand a little how web pages are created and data stored. In this section a brief introduction to web front-end development is presented. We will focus on two basic aspects:

+ Basic HTML + CSS static pages.
+ Dynamic HTML (a basic JavaScript example using JQuery).


### 2.1 Basic HTML + CSS 101

The most basic web pages are built upon HTML + CSS technology. This division stnds for content and design, respectively. **HTML (Hypertext markup language)** is used to give websites structure and stores the contents. This is our target for scraping. On the other hand **CSS (Cascading Style Sheets)** gives format to the content, sigles out content for visualization purposes, i.e. defines the style (e.g. font, color, family, borders, image style, relative positioning of the content, etc). HTML files include tags and references to style, thus it is worthwhile to understand a little bit of both technologies which can help us to scrap data more efficiently.


HTML is a tagged language usually rendered by a browser. Tags are specified in the following format:

<p style="text-align: center">&lt;tag_name *attributes*&gt; content &lt;/tag_name&gt;<p>

<p>
<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">STRUCTURE of an HTML file:

<ul>
    <li> HTML files start with the <!DOCTYPE html>. This tells the browser that we will use HTML5. In former versions of HTML standard there were different versions. </li>
    <li> The first tag in a web page is &lt;html&gt; and its corresponding &lt;/html&gt; closing tag. All the web page is found inside these tags. </li>
    <li> HTML files have a &lt;head&gt; and a &lt;body&gt; </li>
    <li> In the head, we have the &lt;title&gt; tags, and we use this to specify the webpage's name. We can also find references to CSS stylesheets (&lt;link&gt;) used for formating the page and links to javascript files (&lt;script&gt;)that give the web page dynamic behavior.</li>
    <li> In the body we find the content of the page. </li> 
        <ul>
            <li> Headings and text paragraphs can be created using &lt;h#&gt; (# is a natural number) and &lt;p&gt; ,respectively. </li>
            <li> Hyperlinks (links) are given in the <strong>href</strong> attribute of the &lt;a&gt; (anchor) tag. </li>
            <li> Images can be embedded using the &lt;img&gt; tag and setting the <strong>src</strong> attribute to the resource. Caution: img is an special tag and it does not have a closing tag, e.g. &lt;img src = "my_pic.jpg" /&gt; </li>
        </ul>
</ul>
</div>
</p>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">

Let us build a basic HTML web page, adding the following tags. Remember that nearly all tags require to be closed using &lt;/tag&gt;

+ DOCTYPE
+ html
+ head
+ title
+ body

<ol>
<li>Create a file 'example.html' in your favorite editor.</li>
<li>Create a basic html web page containing a *title*, *h1*, *p*, *img* and *a* tags.</li>
</ol>
</div>

If you are lazy go to the files folder and double-click on "example.html". You can check the html code in the following cell.

<html>
	<head>
		<title>
			Basic knowledge for web scraping.
		</title>	
	</head>
	<body>
		<h1>About HTML
		</h1>
		<p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
        
        <p> One of the following rubberduckies is clickable
	</p>
	<p>
            <img src = "files/rubberduck.jpg"/>
        
            <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
        </p>
	</body>
</html>


Because Ipython notebook cells directly interpret markdown and HTML we can use the cell as an interactive editor for our HTML understanding.


In [5]:
from IPython.core.display import HTML

HTML("files/example.html")

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
**Old style HTML** static pages rely heavily on tables and lists: 

<ul>
<li> Making ordered and unordered lists is simple: *ol* (ordered list), *ul* (unordered list) are the main tags. Each item is inserted as *li* (list item) </li>
<li> *table* is the containing tag for building tables, each table row is given as *tr* and columns depend on the table data elements *td*. Tables may have a head (*thead*) and a body (*tbody*). *th* is the same as *td* but for the header. If you want a multi column cell then use colspan=number of cells to cover.
</li>
</ul>
</div>

The next example shows a simple table build. Check the markdown code.

<table>
<thead>
<tr><th colspan = 2>A table</th><tr>
</thead>
<tbody>
<tr>
<td>Hello I am element 1.1</td><td>Hello I am element 1.2</td>
</tr>
<tr>
<td colspan=2>Hello I am element 2.1 and 2.2</td>
</tr>
</tbody>
</table>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
**Current HTML** static pages rely heavily on containers and style: 

<ul>
<li> *div* stands for division and mark a block of content.
</li>
<li> *span* is used to single out an element of a block content.
</li>
</ul>

</div>

By themselves they are not much but when combined with the *style* attribute they become interesting.

For example, consider the following example of code:

<div style = "width:100px;height:100px;background-color:red;padding:10px;font-family:Verdana;font-size:24;color:pink;display:inline-block">  Box 1
</div>
<div style = "width:100px;height:100px;background-color:blue;padding:10px;font-family:Futura;font-size:24;color:lightblue;display:inline-block">  Box 2
</div>
<div style = "width:100px;height:100px;background-color:yellow;padding:10px;font-family:Garamond;font-size:24;color:orange;display:inline-block">  Box 3
</div>
<div style = "width:100px;height:100px;background-color:green;padding:10px;font-family:ArialNarrow;font-size:24;color:lightgreen;display:inline-block">  Box 4
</div>

The attribute *style* is also referred as *inline CSS* and let us give the skeleton some skin and makeup.

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">

Let us build a basic HTML web page and check the magic of CSS in action before going in detail into CSS.
<ol>
<li>Create a file 'example2.html' using your favorite editor.</li>
<li>Fill the header and body basic HTML structure</li>
<li>Let us add three containers *div* in the body.</li> 
<li>Select one of them. This will be used as a navigation bar and will contain an unordered list with three elememnts: Home, Brief Bio, Hobbies</li>
<li>Select another division and create a table inside. Each row will contain information about your profile, e.g. the first row may contain Name: Your Name, the second row Position: Your current position, etc</li>
<li>The last one will contain an image of youself and a paragraph with your contact info (email)</li>
</ol>
<p>
Check the [result](files/example2.html). Nearly professional, doesn't it?
</p>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">

Let us add some style.
<ol>
<li>Add the class "navbar" as an attribute to the *div* containing the list. (eg. class = "navbar")</li>
<li>Add the class "head" to the *div* containing the image and the email.</li>
<li>Add the class "right" to the *div* containing the table.</li>
<li>Add the identifier "email" to the paragraph containing the email. (eg. id = "email")</li>
<li>Finally, let us link the class and ids definitions we have just writen by adding to the head tag the following line:
<p>< link type="text/css" rel="stylesheet" href="stylesheet.css"/ ></p>
</li>
</ol>
<p>
Check the [result](files/example2f.html) now. Do not forget to hover over your navigation bar.
</p>
</div>

The former exercise is an extremely simple exercise showing the separation between the content and the styling. Observe that the html file you have created does not have any explicit styling. However, we have added two new elements to the mix, classes and identifiers as attributes of the tags. As you can imagine styling rules are given for each class and ID and are compactly found on the stylesheet.css we have just linked.

<div class = "alert alert-warning" style = "border-radius:10px;border-width:3px;border-color:orange;font-family:Verdana,sans-serif;font-size:16px;">**COMMENT:**
Very simple formating can be also given using html markers. For example *strong* and *em* tags refers to bold and italics fonts.
</div>

### HTML Structure

The html document can be seen as a tree structure. The root of the tree is the *html* tag. This has two children *head* and *body*. Head may have different children such as *title*, *link*, or *script*. Body may have any combination of tags, *divs*, *p*, *a*, etc. These tags can be nested, e.g. we can find a *div* inside a *div* inside a *div*. In the example we have seen how to refer to nested elements. The elements can be html tags or classes or identifiers.
    + "elem1 elem2" refers to any elem2 inside any other elem1 disregarding the degree of nesting (it may have any arbitrary set of elementes in between both).
    + "elem1>elem2" specifically refers to any elem2 children of a direct parent with tag elem1.

## 2.2 Hands on with CSS selection

Different web-focused parsing libraries allow to use CSS selection. In this course we will see a couple of them. The first one is **LXML**. 

LXML is build upon the C libraries libxml2 and libxslt. These libraries brings standards-compliant XML support as wells as support for (broken) HTML and are very, very fast!

LXML allows to use CSS selection. Let us make some drills with lxml.

`pip install cssselect`

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**LXML PRACTICE** With the source of python.org
<ol>
<li>How many paragraphs are on the page?</li>
<li>What is the text content of the div wiht the class "shrubbery"? What are the links in that same div?</li>

</ol>
</div>

In [6]:
from urllib.request import urlopen

source = urlopen('http://python.org')


from lxml import html
from lxml import cssselect
tree = html.document_fromstring(source.read())
tree


<Element html at 0x194dc688bd0>

In [7]:
# Add your code here. Use tree.cssselect("")
#How many paragraphs are on the page?<
# Be aware of comments!
count = 0

for el in tree.iter():
    if el.tag == "p":
        count += 1
print ("1- How many paragraphs are on the page? ", count)



1- How many paragraphs are on the page?  23


In [8]:
import urllib
source = urllib.request.urlopen('http://python.org')
tree = html.document_fromstring(source.read())
print ("2- What is the text content of the div whithin the class \"shrubbery\"? What are the links in that same div?")
for el in tree.iter():
    if el.tag == "div" and el.attrib.get("class")=="shrubbery":
           print (el.text_content())

            
       


2- What is the text content of the div whithin the class "shrubbery"? What are the links in that same div?

                        
                            Latest News
                            More
                            
                            
                                
                                
                                
2020-10-05
 Python 3.9.0 is now available, and you can already test 3.10.0a1!
                                
                                
2020-10-02
 Python 3.5 is no longer supported
                                
                                
2020-10-02
 Join the Python Developers Survey 2020: Share and learn about the community
                                
                                
2020-09-24
 Python 3.8.6 is now available
                                
                                
2020-09-22
 The Python Software Foundation re-opens its Grants Program!
                                
              

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**LXML EXERCISE:** <BR>

Scrap 20 pages from python.org and store in mongoDB the text inside the code elements

</div>

In [9]:
print ("3- What is the text in the code elements?")

3- What is the text in the code elements?


# 3 Advanced scraping using automation tools


As a simple exercise try to scrap the numerical value in the text box of the hidden.html file.

In [13]:
from IPython.display import HTML
HTML('<iframe src=./files/hidden.html width=700 height=300></iframe>')

In [14]:
import urllib.request
socket = urlopen("file:./files/hidden.html")
print (socket.read())

b'<!DOCTYPE html>\n<html>\n<head>\n<title>The hidden scraper</title>\n<link rel=\'stylesheet\' type=\'text/css\' href=\'hiddenstylesheet.css\'/>\n        <script type=\'text/javascript\' src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js">\n</script>\n        <script type=\'text/javascript\' src=\'hiddenscript.js\'></script>\n</head>\n<body>\n<div></div>\n</body>\n</html>\n'


... and the value?

Problems and limitations of LXML and basic scraping techniques,

     + DOM loaded content. The page finishes loading and it is being acquired when the response is closed. Any further data will be not loaded.
     + Really broken HTML/XML
     + Proprietary and login required can be difficult depending on the log and flow of the page.
     + JS form interaction


We see the data in our web browser but the data is not directly found in the html. However "Data is out there". This is due to the fact that it has been dinamically generated with a function call. Thus, we see that we have two versions of the web page. The first contains static data and function calls, the second contains static data after the interpretation of the function calls. The question now is how we can access this post interpretation data. There are many different ways. One way could be opting for running our own interpreter such as node.js. Another way is to take advantage of the browser interpretation capabilities and run it as an interpreter.

Automation tools such as mechanize or selenium are suites with the goal of testing web interfaces automatically from scripts. They allow to start a browser and interact with the web page in the same way a human user would do. We can use these tools for our scraping purposes.


## 3.1 Starting with Selenium 

+ Requirements
        'pip install selenium'
  
If you want to use Chrome you need the Chrome webdriver interface 'chromedriver'. 

+ Download 'chromedriver'  https://sites.google.com/a/chromium.org/chromedriver/downloads

+ When create the webdriver put the path to chromedriver

Check the following code



## The Cepstral demo and our new goal.
<small>An updated version of the case study of Asheesh Laroia (PaulProtheus at Github)</small>

Our new goal is to deal with dynamically generated data. Our goal is to be able to perform a web scraping as the following case. Cepstral is a text-to-speech provider. Let us check the web page.

In [1]:
from IPython.display import HTML
HTML('<iframe src="http://cepstral.com" width=700 height=350></iframe>')



Our goal is to retrieve the audio file that has been played using web scraping techniques. Let us check how can we do it.

In [6]:
#CEPSTRAL DEMO
%reset -f
#!/usr/bin/python
# -*- coding: utf-8 -*-

# Download Chrome driver: https://sites.google.com/a/chromium.org/chromedriver/downloads


from selenium import webdriver
import time,os


url = 'http://www.cepstral.com/en/demos' #Poseu el nom de la pàgina web
browser = webdriver.Chrome("chromedriver.exe") #Obrir un navegador Chrome

In [7]:
browser.get(url)

In [8]:
element = browser.find_element_by_css_selector("#demo_text")
element.clear()
s='My name is Eloi and I am so cool!!!'
element.send_keys(s)

In [10]:
browser.find_element_by_id('demo_submit').click()
browser.implicitly_wait(5)
browser.find_element_by_css_selector('audio')
html=browser.page_source
#Preventing they can delete the file!
browser.quit()

KeyboardInterrupt: 

In [11]:
#Check the data is in
chunks=html.split('"')
for chunk in chunks:
    if '.mp3' in chunk:
        break
print (chunk)

/demos/audio/q8qp2phjhb92vhl644jo1t8627.1603105563956.mp3


In [12]:
from urllib.parse import urljoin
furl=urljoin(url,chunk)
print (furl)

http://www.cepstral.com/demos/audio/q8qp2phjhb92vhl644jo1t8627.1603105563956.mp3


In [13]:
import os

player = "mpv " 

##Replace with mplayer for Linux. 
#MPV has been installed in MAC OSX using homebrew.
#Mplayer has not been installed because of dependency troubles.
#brew tap mpv-player/mpv
#brew install --HEAD mpv-player/mpv/libass-ct
#brew install mpv

os.system(player+furl)

1

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Element manipulation in Selenium:**
<p>
Consider the result of a selection, e.g. 

<span style = "font-family:Courier;">element = browser.find_element_by_css_selector('div')</span>

We can do several things on it.
<ul>
<li>element**.click()** - click on a selected element</li>
<li>Element properties:
<ul>
<li>element**.location**: x, y location</li>
<li>element**.parent**: parent element</li>
<li>element**.tag_name**: The tag of the element</li>
<li>element**.text**: text of the element and childs</li>
</ul>
</li>
   
</ul>




<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Form input with Selenium:**
<ul>
<li> element**.send_keys()** - Keys, commands, arrows, etc </li>
<li> element**.clear()** - clear the element</li>
</ul>
<p>


</div>

<div class = "alert alert-info" style = "background-color:lightyellow;border-radius:10px;border-width:3px;border-color:darkorange;font-family:Verdana,sans-serif;font-size:16px;color:brown">**Other web driver utilities:**
<ul>
<li>browser.execute_script('window.close()') - execute any javascript on a load page</li>
<li>brosers.save_screenshot('foo.png')</li>
<li>browser.switch_to_alert(): handle pop-ups automatically</li>
<li>browser.forward() / browser.back(): navigation</li>
</ul>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Basic manipulation in Selenium:**
<p>
A webdriver instance allows to manipulate the web session, control cookies, retrieve the html code or find elements in the source code.
</p>
Given a webdriver instance (e.g.<span style = "font-family:Courier;">
            browser = webdriver.Firefox()</span>) the most relevant methods

<ul>
<li>**Open URL:**  .get(url) (e.g.
<span style = "font-family:Courier;"> browser.get(url)</span>)</li>
<li>**Selection: ** .find_element(s)... [element will return the first, elements the complete list]
<ul>
<li>..._by_link_text('foo') - find the link with text foo</li>
<li>..._by_partial_link_text() - similar to contains ...</li>
<li>..._by_css_selector()</li>
<li>..._by_tag_name()</li>
<li>..._by_xpath()</li>
<li>..._by_class_name()</li>
</ul>
</li>
<li>**Retrieve source: ** .page_source</li>
  
</ul>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Scrolling and moving:**
Moving around the page is tricky, be prepared for displaying a little patience.

ActionChains provide a way of stringing together one or more actions and then implementing them.
<ul>
<li>move_by_offset(x,y)</li>
<li>move_to_element() - for highlighting, hovering, rollover, etc.</li>
<li>move_to_elemnte_by_offset(elem, x, y)</li>
</ul>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;"> **Wait**

We can distinguish two types of waiting strategies, namely, implicit and explicit waits.

*Implicit waits* set up a timeout that will last for the full life of the web driver. On the other hand, *explicit waits* tell the driver to poll the DOM until some condition is met, e.g. a certain element has finished loading on the page. 

Example:
<p style="font-family:Courier;">
try: <br>
movie_info = webdriverwait(browser,10).until(EC.element_to_be_clickable((By.ID,'BotMovie')))<br>
title = movie_info.find_element_by_class_name('title').text<br>
link = movie_info.find_element_by_class_name('mdpLink').get_attribute('href')<br>
except:<br>
 print 'taking too long!!'<br>
 </p>
 
*EC* stands for Expected Condition and are the basis of explicit waits (see http://selenium-python.readthedocs.org for more information)
</div>


<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**Selenium EXERCISE:** <BR>
<ul>
<li> Open a browser 

<li>Go to tripadvisor/Restaurants

<li>Find the search text box

<li>Clear it, input the query "Sant Cugat" and send it

<li>Go to "Restarurants" and get all the links and names of top 10 restaurants in Sant cugat
<li> Store them into mongoDB database
</ul>

</div>

In [15]:
from selenium import webdriver
browser = webdriver.Chrome("chromedriver")
browser.get("https://www.tripadvisor.es/Restaurants")


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;"> **Wrap-up**
<ul>
<li>We understood how data is usually stored in the web site and how to access it using different kinds of accessors, namely API and direct selectors.</li>
<li>We have seen how to capture different kinds of data types(text, audio and pictures).</li>
<li>We are now familiar with JSON data and basic No-SQL databases.</li>
</ul>
</div>