# Data Acquisition: Web scrapping

Hi everyone, <br>

This session deals with web scrapping. <br>

It will walk you through the following sections. <br>

1. What is web scrapping ?
2. XML and web pages
3. Using XPath
4. Scrapping Yahoo Finance data

# 1. What is web scrapping ?

Web scrapping is a technique that is used to extract data from web pages. <br>

Web scrapping typically incudes two steps:

1. fetching: downloading a web page (in the exact same way your web browser does)
2. extraction: acquiring data out of the fetched page

Typical applications of web scrapping would then include saving the scrapped data. 

**Should I care about web scrapping ?**

Web scrapping certainly is one of the most important source of data. Increasing interest for data science is supported by the exponential increase of data produced by human activity. Large proportion of these new data are digital data and are therefore accessible through web scrapping. <br>

**Mmmh... Is web scrapping completely legal ?**

Technology usually evolves faster than legislation and legislations about technology often vary accross continents. <br>

Some applications of web scrapping are definitely forbidden (ex: acquire all data from Facebook accounts) and some are not. <br>

Legality depends on different factors: <br>
- source of data <br>
- volume of data <br>
- acquisition strategy <br>

# 2. XML, HMTL and Web Pages

Extensible Markup Language (XML) is a language that is used to encode documents in XML format. <br>

An .XML file is a type of document that uses XML structure to encode data. <br>


**Why should I care about XML ?**

While the design of XML is made to store data in documents, XML design is also often used for creating data structures in web pages. <br>

HMTL (Hypertext Markup Language), which is a dominant language in web services, also follows XML design. <br>

As a result, XML language can be used to navigate web content and access data you want to scrap. <br>

**2.1 Can you visualize the following html code in an html viewer ? **

http://htmledit.squarefree.com/

In [None]:
<!DOCTYPE html>
<html>
<body>

<h2>Data science is like teenage sex</h2>

<ol type="a">
  <li>everyone talks about it </li>
  <li>nobody really knows how to do it </li>
  <li>everyone thinks everyone else is doing it </li>
  <li>so everyone claims they are doing it... </li>
</ol>  

</body>
</html>

**2.2 Can you modify the following html code and visualize it again ? **

Congratulation, You just wrote you first website ! <br>

# 3. Using Xpath

** What is XPath ? **

XPath is a query language that is used to select nodes from an XML document. <br>

XML documents can be seen as a tree of nodes. In the frame of this course, we will consider 3 types of nodes:

- element
- attribute
- text

** Element ** <br>
Html element consists of 'start tag' and 'end tag', with the content inserted between <br>

** Attributes ** <br>
HTML elements can have attributes that provide information about an element. <br>
Attributes are specified in the start tag using a name/value format: name="value"

** Text ** <br>
HTML element can contain text value. Text values are specified between 'start tag' and 'end tag'

**3.1 In the following html code:**
- can you identify an element ?
- can you identify an attribute ?
- can you identify a text ?

In [None]:
<bookstore>
  <book>
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>

**Relationship of Nodes**

In an XML tree: <br>
- Each element and attribute has one **parent** <br>
- Element nodes can have 0, 1 or n **children** <br>
- Nodes sharing the same parent are **siblings** <br>
- Nodes parent's parent are called **ancestors** <br>
- Nodes childen's children are called **descendants** <br>
- Topest element in the tree is called the **root element**


**3.2 In the following html code:**
- can you identify the parent, the siblings and the ancestor of the title element ?
- can you identify the children of the book element ?
- can you identify the root element ?

In [None]:
<bookstore>
  <book>
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>

**XPath Syntax**

XPath uses path expressions to select nodes in an XML document. <br>

XPath expressions are composed of one or several steps. <br>

** Run the following lines to create an HTML element **

In [1]:
html_string = '''
<wikimedia>
  <projects>
    <project name="Wikipedia" launch="2001-01-05">
      <editions>
        <edition language="English">en.wikipedia.org</edition>
        <edition language="German">de.wikipedia.org</edition>
        <edition language="French">fr.wikipedia.org</edition>
        <edition language="Polish">pl.wikipedia.org</edition>
        <edition language="Spanish">es.wikipedia.org</edition>
      </editions>
    </project>
    <project name="Wiktionary" launch="2002-12-12">
      <editions>
        <edition language="English">en.wiktionary.org</edition>
        <edition language="French">fr.wiktionary.org</edition>
        <edition language="Vietnamese">vi.wiktionary.org</edition>
        <edition language="Turkish">tr.wiktionary.org</edition>
        <edition language="Spanish">es.wiktionary.org</edition>
      </editions>
    </project>
  </projects>
</wikimedia>
'''

from lxml import etree
tree = etree.fromstring(html_string)

Let's try some of the most useful path expressions

**3.3 Can you select wikimedia element from the root node?**

In [4]:
tree.xpath('...')

[<Element wikimedia at 0xafdf0f8c>]

**3.4 Can you select project elements starting from root node?**

In [5]:
tree.xpath('...')

[<Element project at 0xafdf0fac>, <Element project at 0xafe027ec>]

**3.5 Can you select first project element starting from root node?**

In [6]:
tree.xpath('...')

[<Element project at 0xafdf0fac>]

**3.6 Can you select name attributes from the second project element starting from root node?**

In [7]:
tree.xpath('...')

['Wikipedia']

**3.7 Can you select edition elements, no matter where they are in the tree?**

In [None]:
tree.xpath('...')

**3.8 Can you select text value from edition elements, no matter where it is in the tree?**

In [None]:
tree.xpath('...')

**3.9 Can you select text value from edition element whose language attribute is "Polish"?**

In [10]:
tree.xpath('...')

['pl.wikipedia.org']

# 4. Scrapping Yahoo Finance

** 4.1 Go on Yahoo Finance and select summary for Apple ? **

https://finance.yahoo.com/quote/AAPL?p=AAPL

** 4.2 Can you inspect page source for this web page ? Comment **

This is the web page source code of the web page we consulted. <br>

The source code is written in HTML. <br>

When downloading a request from Yahoo Finance web server, this is what your browser receives. <br>

It is then the responsability of your web browser to turn it into human readable content <br>

** 4.3 In your web browser, can you use XPath to extract Apple official name ?** <br>

$x('...')[0]

** 4.4 In your web browser, can you use XPath to extract market cap ?** <br>

$x('...')[0]

** 4.5 Can you fetch summary section of Yahoo Finance for Apple ? **

In [1]:
import requests
r = requests.get('...')

Requests return a Response object. <br>

We can then get all information we need from this response object. 

** 4.6 Can you print response status code from Yahoo Finance web server ? Comment. **

** 4.7 Can you print response content from Yahoo Finance web server ? Comment **

** 4.8 Convert r.text into an actual html document named tree. Comment **

** 4.9 print tree document **

** 4.10 Can you extract Apple market cap from fetched web page ?** <br>

** 4.11 Can you use XPath language to extract the following values for Apple ?** <br>

'Company name', 'opening price', 'Previous Close', 'Market Cap', 'Beta', 'PE Ratio'

Store them in a dictionary named 'apple_data' and print the dict

**4.12 Can you create a web scrapper that:**

Fetch Yahoo Finance web pages for 'Apple', 'Microsoft', 'Facebook', 'Twitter', 'Yahoo' and 'Tesla'

Extract 'Company name', 'opening price', 'Previous Close', 'Market Cap', 'Beta', 'PE Ratio' for each company

Save the data into a mongo collection named 'yahoo_finance'

In [12]:
from lxml import html
import requests
import time
from pymongo import MongoClient, errors
client = MongoClient()
db = client['Solvay']

**4.13 Can you check how many elements are in yahoo_finance collection?**

**4.14 Can you get data for Twitter document?**