# There are mainly two ways to extract data from a website:

* Use the API of the website (if it exists). For example, Wikipedia or Facebook. Faceboo has the Facebook Graph API which allows retrieval of data posted on Facebook.
* Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

# API
The acronym API stands for Application Programming Interface and it is a device such as a server that is used to send and retrieve data using programming code. Most commonly this technology is used to retrieve data from a source and display it to a software application and it’s users.

APIs work in the same way as a browser when you visit a webpage, a request for information is sent to a server and the server responds. The only difference is the type of data that the server responds with, for APIs the data is of the type JSON.

JSON stands for JavaScript Object Notation, which is the standard data notation for APIs in most software languages.
(we will work on JSON later on this course)

#### Example of Wikipedia API

In [1]:
import wikipedia
print(wikipedia.search("Coronavirus"))

['Coronavirus', 'COVID-19 pandemic', 'COVID-19', 'COVID-19 pandemic in the United States', 'COVID-19 pandemic by country and territory', 'SARS-CoV-2', 'COVID-19 pandemic on cruise ships', 'Coronavirus (film)', 'COVID-19 pandemic in Algeria', 'Novel coronavirus']


In [2]:
# article entire content
print(wikipedia.page("Coronavirus").content)

Coronaviruses are a group of related RNA viruses that cause diseases in mammals and birds. In humans and birds, they cause respiratory tract infections that can range from mild to lethal. Mild illnesses in humans include some cases of the common cold (which is also caused by other viruses, predominantly rhinoviruses), while more lethal varieties can cause SARS, MERS and COVID-19, which is causing an ongoing pandemic. In cows and pigs they cause diarrhea, while in mice they cause hepatitis and encephalomyelitis.
Coronaviruses constitute the subfamily Orthocoronavirinae, in the family Coronaviridae, order Nidovirales and realm Riboviria. They are enveloped viruses with a positive-sense single-stranded RNA genome and a nucleocapsid of helical symmetry. The genome size of coronaviruses ranges from approximately 26 to 32 kilobases, one of the largest among RNA viruses. They have characteristic club-shaped spikes that project from their surface, which in electron micrographs create an image 

In [3]:
# article summary
print(wikipedia.summary("Coronavirus"))

Coronaviruses are a group of related RNA viruses that cause diseases in mammals and birds. In humans and birds, they cause respiratory tract infections that can range from mild to lethal. Mild illnesses in humans include some cases of the common cold (which is also caused by other viruses, predominantly rhinoviruses), while more lethal varieties can cause SARS, MERS and COVID-19, which is causing an ongoing pandemic. In cows and pigs they cause diarrhea, while in mice they cause hepatitis and encephalomyelitis.
Coronaviruses constitute the subfamily Orthocoronavirinae, in the family Coronaviridae, order Nidovirales and realm Riboviria. They are enveloped viruses with a positive-sense single-stranded RNA genome and a nucleocapsid of helical symmetry. The genome size of coronaviruses ranges from approximately 26 to 32 kilobases, one of the largest among RNA viruses. They have characteristic club-shaped spikes that project from their surface, which in electron micrographs create an image 

#### But not all sites provide API, 
so we actually have to extract the information from the site reading its HTML code! (kind of a brute method but ... it works)
# Web scrapping / Web crawling

#### What is Web Scraping?

Web Scraping refers to the extraction of data from a website or webpage. Usually, this data is extracted on to a new file format. For example, data from a website can be extracted to Pandas or an excel spreadsheet.

#### What is Web Crawling?

Web Crawling refers to the process of using bots (or spiders) to read and store all of the content on a website for archiving or indexing purposes.

#### Why Web-scraping? (its not always legal - thats why no API)

* Web-scraping provides one of the great tools to automate most of the things a human does while browsing. Web-scraping is used in an enterprise in a variety of ways. 


* Web scraping various hotel data from travel portals with HTML. People do web scraping for hotel websites because it contains tons of data related to the prices, reviews, ratings, people who have rated the hotel, etc. These professionals use several lines of code on HTML, and they get the data within a matter of time.

* Scraping data from social media. A study by Oberlo found that around 3.78 billion people use social media in 2021. This amounts to approximately 48% of the total population. This number will likely reach up to 4.41 billion by the year 2025. Hence social media is the best place for web scraping as data on almost everyone is available here. The codes used for different social media platforms are different, but they can be easily run, and the professional will get huge amounts of data.

* Scraping data from Yahoo Finance by the use of Python. Python is one of the frequently used and well-known coding languages. Hence it only makes sense that it is time experts will use it for web scraping. They have found a way for web scraping from Yahoo Finance. Since the stock market is one of the biggest repositories of data, as it is used by almost everyone remotely interested in studying the market and deciding which stocks to buy and when. Getting access to the current and past trends is a crucial milestone, as it will give the scrapers an exact idea of how the market performs. For this, scrapers have used Python.
    
* Scraping job recruiting data. Like social media, the job recruitment industry has also gained a lot of attention with the advancement of the internet. It has also become a major scraping point for experts. Experts scrape data from the job listings from private and company websites and use it to their advantage. The most common web scraping solution for job recruitment data is JobsPikr. It is an autonomous tool that gets you listed using filters you set.
    
* Scraping for insights about various industries. That is the best idea for web scraping. Experts use web scraping for building massive databases and then draw insights related to their specific industry. These experts then sell these data to people interested in these industries. For example, a company will scrape information related to the oil industry, the oil prices, stock prices, and then this data will be sold to oil companies, and they will earn a profit.

* Scraping data from comparison websites. You can find numerous websites that give you access to compare and research the prices. These comparisons are made between two different retailers for the same product. Experts scrape data by scraping competitor product prices, specifications, performance, and pricing of the products daily. This helps them give an accurate comparison to their users when they search for it.

* Social media sentiment analysis. With the recent boom of social media and the internet, web scraping sentiment analysis has risen. For example, if you have ever tweeted about an episode of Chernobyl (a series from HBO about the nuclear disaster at Chernobyl), there is a chance it might have been scraped and analyzed by HBO. This way of web scraping helps them understand how the show is received on social media. However, this was an example of a company web scraping. Web scraping can also be done by individuals, such as politicians on social media, and get an idea of how they are perceived.

* Scraping is used to get information about the traffic, its size, sources, etc. Analyzing the traffic size, sources, and how many times a visitor stays on the website is done by web scraping. In this process, the entire domains are analyzed, and the scraper will estimate the traffic. You can also understand the traffic source by scraping the top five countries that have the most visits and the top five domains used by them.

* Web scraping for content creation. To ensure they retain their visibility, companies need to post huge amounts of high-quality and relevant content. By having regular content, they can create a trustworthy and visible brand. The marketing and advertising experts scrape the internet. They get plenty of new ideas for content, and they need not brainstorm for ideas. This helps them create new content and attract more visitors, thereby increasing their sales and revenue.

#### HTML

HyperText Markup Language (HTML) is the foundation of the web. This markup language uses tags to tell the browser how to display the content when we access a URL.

If we go to our homepage and right click + inspect object or press ctrl/command + shift + c to access the inspector tool, we’ll be able to see the HTML source code of the page.

Here are a few of the most common tags:

* div –  it specifies an area or section on a page. Divs are mostly used to organize the page’s content
* h1 to 6 – defines headings. 
* p – tells the browser the content is a paragraph.
* a –  tells the browser the text or element is a link to another page. This tag is used alongside an href property that contains the target URL of the link


#### Here are the main element of the HTML website
https://developer.mozilla.org/en-US/docs/Web/HTML/Element
* but you dont have to be the web developer to scrap the pages

# Tags and attributes:

* `<body>` is a tag
* `<a...>` is a tag
* `<p...>` is a tag
* `<p ... id="title">` `id` is an attribute of a tag `<p>`

### Example

* this jupyter notebook
* file:///home/michal/index.html
* any website
* https://www.amazon.com/Best-Sellers-Books/zgbs/books/

# Python tool for web scrapping
* Scrapy 
* Selenium 
* **Beautiful Soup**
* any of them + RegEx

* All of the three web scraping libraries are all open-source and completely free to use. With this, money is not a deciding factor. Each of them has a community of developers supporting its development. So which of them should you make use of?

* This depends on the project requirements. If a project is complex, Scrapy is the tool for the job. This is because it is a framework designed for handling complex web scraping tasks. It even allows you to extend its functionality.

* For smaller projects, BeautifulSoup is the library of choice. You just have to install the requests module and your preferred HTML parser (HTML.parser is installed by default). Selenium comes handy when you are handling Javascript featured website.

Web Scrapping (Beautiful Soup)
* famous New York Times Art installation in Manhattan: https://vimeo.com/1850342 (scrapping news from the NYT magazine)
* Alexander Harrowell uses Beautiful Soup to track the business activities of an arms merchant. ()https://yorksranter.wordpress.com/2008/07/06/the-viktorfeed-documentation/
* Jiabao Lin's DXY-COVID-19-Crawler uses Beautiful Soup to scrape a Chinese medical site for information about COVID-19, making it easier for researchers to track the spread of the virus. (Source: "How open source software is fighting COVID-19") (https://blog.tidelift.com/how-open-source-software-is-fighting-covid-19)
* The Lawrence Journal-World uses Beautiful Soup to gather statewide election results. (https://www2.ljworld.com/)

#### Beautiful soup

* A library is **needed** to make a request to the website because it can’t able to make a request to a particular server. To overcome this issue It takes the help of the most popular library named `Requests` or `urlib2`. these libraries will help us to make our request to the server.
* After downloading the HTML, XML data into our local Machine, Beautiful Soup require an External parser to parse the downloaded data. The most famous parsers are — lxml’s XML parser, lxml’s HTML parser, HTML5lib, html.parser.
    
Beautiful Soup provides simple methods for navigating, searching, and modifying a parse tree in HTML, XML files.
It transforms a complex HTML document into a tree of Python objects. It also automatically converts the document to Unicode, so you don’t have to think about encodings. This tool not only helps you scrape but also to clean the data. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports several third-party Python parsers like lxml or hml5lib.
* pip3 install bs4 # (in case of linux, install packages as a user not a root)

# With Beautiful Soup we can access:
* the entire tag using `select`,`find_all`
* tags within tags using `select`,`find_all`
* attribute of a tag using `attr.get('...')`

#### some useful links:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* again, like in RegEx its a tool with many options...
* install pip3 instal bs4 (as a user)

In [5]:
from bs4 import BeautifulSoup
import re

Before extracting the content from the document, you have to parse the HTML document. To do so you have to pass the data and the html.parser as an argument to the BeautifulSoup() method.

In [37]:
data = """
<html>
<head>
<title>Data Science Learner</title>
</head>

<body>
<p class="title"> id="title" <b>Data Science Learner Links</b></p>
<p class="links">Links
<a href="http://example.com/dsl1" class="element" id="link1">1</a>
<a href="http://example.com/dsl2" class="element" id="link2">2</a>
<a href="http://example.com/dsl3" class="avatar" id="link3">3</a>
<p> line ends</p>
</body>
</htm>

"""

In [38]:
soup = BeautifulSoup(data, "html.parser")

#### parsing is basically to resolve (a sentence) into its component parts and describe their syntactic roles. 
* According to wikipedia, Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).
* Browsers have a built in way of reading (referred to as parsing) HTML tags and rendering that in the browser. HTML files are just text files, files that you can read and open with a text editor.
* The tags are enclosed in starting and ending `<>` like `<body>` or `<h1>`.

In [7]:
# prettify - to make the html code easier to read:
# print(soup.prettify())

extract the content from the HTML document using the beautifulsoup select() method. Inside the select() method you have to find the CSS like class name or id to get the content from that class.

In [8]:
soup.select("head") # returns a list

[<head>
 <title>Data Science Learner</title>
 </head>]

suppose I want to get the title inside this head class then I will use the below code.

In [9]:
soup.select("head title")
#soup.select("head > title")

[<title>Data Science Learner</title>]

`select` returns list of all found elements so we can access each of them using indexing

In [10]:
# we can get a text that is embraced by the head tag using `get_text()`
soup.select("head title")[0].get_text()

'Data Science Learner'

In [11]:
# we can get a text that is embraced by the head tag using `regex`
regex=r'<title>(.*)</title>'
re.findall(regex,str(soup.select("head title")[0]))[0]

'Data Science Learner'

In [12]:
# find all <p> tags 
soup.select("p")

[<p class="title"> id="title" <b>Data Science Learner Links</b></p>,
 <p class="links">Links
 <a class="element" href="http://example.com/dsl1" id="link1">1</a>
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>
 <a class="avatar" href="http://example.com/dsl3" id="link3">3</a>
 <p> line ends</p>
 </p>,
 <p> line ends</p>]

In [13]:
print(len(soup.select("p")))

3


In [14]:
# find all <a> tags inside <p> tags
soup.select("p a")

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>,
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>,
 <a class="avatar" href="http://example.com/dsl3" id="link3">3</a>]

In [15]:
# find all <a> tags inside <p> tags and the <a> tag must be of class="element"
soup.select("p a.element")

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>,
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>]

In [16]:
# find all <a> tags inside <p> tags and the <a> tag must have attribute id="link"
soup.select("p a[id=link1]")

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>]

In [17]:
# find all <p> tags that have with particular class: `class="title"`
soup.select("p.title")

[<p class="title"> id="title" <b>Data Science Learner Links</b></p>]

In [18]:
# find all <p> tags that have with particular class: `class="links"`
soup.select("p.links")

[<p class="links">Links
 <a class="element" href="http://example.com/dsl1" id="link1">1</a>
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>
 <a class="avatar" href="http://example.com/dsl3" id="link3">3</a>
 <p> line ends</p>
 </p>]

In [19]:
soup.select("p.links a")

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>,
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>,
 <a class="avatar" href="http://example.com/dsl3" id="link3">3</a>]

* `select` is a one option to extract (select) all the tags that we want
* `find_all` is another option to select tags 
  * find by TAG name
  * find by attribute
  * find by class 


In [20]:
#find all by attribute: id="link":
soup.find_all(id="link1")

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>]

In [21]:
#find all tags by class="element":
soup.find_all(class_="element")
#Notice how we have to use class_ rather than class as it is a reserved word in Python.

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>,
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>]

In [22]:
#find all tags with href that matches a regex!
# use re.compile before using it
import re
regex=re.compile(r'http.*2')
soup.find_all(href=regex)

[<a class="element" href="http://example.com/dsl2" id="link2">2</a>]

In [23]:
#find all tags <a> with href that matches a regex!
# use re.compile before using it
import re
regex=re.compile(r'http.*2')
soup.find_all('a', {'href': regex})

[<a class="element" href="http://example.com/dsl2" id="link2">2</a>]

### lets look at tag `<p>` with class "links"

In [24]:
soup.select("p.links")

[<p class="links">Links
 <a class="element" href="http://example.com/dsl1" id="link1">1</a>
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>
 <a class="avatar" href="http://example.com/dsl3" id="link3">3</a>
 <p> line ends</p>
 </p>]

### lets say from there we want only links from tag `<a>` with class "element"

In [25]:
soup.select("p.links a.element")

[<a class="element" href="http://example.com/dsl1" id="link1">1</a>,
 <a class="element" href="http://example.com/dsl2" id="link2">2</a>]

#### so we can get the links using `attrs.get`

In [26]:
# we can use list comprehension
links = [a.attrs.get('href') for a in soup.select('a.element')]
print(links)

['http://example.com/dsl1', 'http://example.com/dsl2']


* `select` vs `find_all`
  * do the same job.
  * I prefer `select`

### This looks nice and easy in a string like above `data` string. But things will get messy if we import the entire page. For this we will use `request`

### every well built website has a structure but every structure can be different. So we have to open the website and look and understand its content!!

* Request the content (source code) of a specific URL from the server
* Download the content that is returned
* Identify the elements of the page that are part of the table we want
* Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

# Example 1

### Currency exchange rates extraction using RegEx
https://www.x-rates.com/table/?from=USD&amount=1

In [27]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [28]:
# Requests library is one of the integral part of Python for making HTTP requests to a specified URL
page  = requests.get("https://www.x-rates.com/table/?from=USD&amount=1")
print(page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>\n<meta name="format-detection" content="telephone=no"/>\n<meta name="description" content="This currency rates table lets you compare an amount in US Dollar to all other currencies."/>\n<meta name="keywords" content="USD EUR, currency exchange table, exchange rate table, convert, euro, american dollar, british pound, canadian dollar, australian dollar, x-rates"/>\n<link rel="canonical" href="https://www.x-rates.com/table/?from=USD&amount=1"/>\n\t\t\t<script type=\'text/javascript\'>\n\t\t\t\tvar e9AdSlots  = { \n\t\t\t\t  output_lb : {site:\'ExchangeRates\', adSpace:\'Homepage\', size:\'728x90,468x60\', noAd: \'1\'},\noutput_rs : {site:\'ExchangeRates\', adSpace:\'Homepage\', size:\'300x250,300x600,160x600\', noAd: \'1\'},\nros_ls : {site:\'XEInternal\', adSpace:\'HRROS\', size:\'300x250\', rsize: \'238x230\', noAd: \'1\', async: false},\nros_ms : {site:\'XEInternal\', ad

In [29]:
print(page.status_code)
# A status_code of 200 means that the page downloaded successfully.

200


In [30]:
soup = BeautifulSoup(page.content, 'html.parser')

In [31]:
#print(soup.prettify())

In [32]:
stock_picks = soup.select('table td.rtRates')
for pick in stock_picks:
    print(str(pick))

<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=USD&amp;to=EUR">0.977311</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=EUR&amp;to=USD">1.023216</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=USD&amp;to=GBP">0.834024</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=GBP&amp;to=USD">1.199006</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=USD&amp;to=INR">79.875374</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=INR&amp;to=USD">0.012520</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=USD&amp;to=AUD">1.449801</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=AUD&amp;to=USD">0.689750</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=USD&amp;to=CAD">1.289322</a></td>
<td class="rtRates"><a href="https://www.x-rates.com/graph/?from=CAD&amp;to=USD">0.775601</a></td>
<td class

In [33]:
import re
regex=r'.*from=(\w{3}).*to=(\w{3}).*>(\d+.\d+).*'
for pick in stock_picks:
    print(re.findall(regex,str(pick)))

[('USD', 'EUR', '0.977311')]
[('EUR', 'USD', '1.023216')]
[('USD', 'GBP', '0.834024')]
[('GBP', 'USD', '1.199006')]
[('USD', 'INR', '79.875374')]
[('INR', 'USD', '0.012520')]
[('USD', 'AUD', '1.449801')]
[('AUD', 'USD', '0.689750')]
[('USD', 'CAD', '1.289322')]
[('CAD', 'USD', '0.775601')]
[('USD', 'SGD', '1.391884')]
[('SGD', 'USD', '0.718450')]
[('USD', 'CHF', '0.968996')]
[('CHF', 'USD', '1.031996')]
[('USD', 'MYR', '4.450184')]
[('MYR', 'USD', '0.224710')]
[('USD', 'JPY', '138.211403')]
[('JPY', 'USD', '0.007235')]
[('USD', 'CNY', '6.744590')]
[('CNY', 'USD', '0.148267')]
[('USD', 'ARS', '129.145464')]
[('ARS', 'USD', '0.007743')]
[('USD', 'AUD', '1.449801')]
[('AUD', 'USD', '0.689750')]
[('USD', 'BHD', '0.376000')]
[('BHD', 'USD', '2.659574')]
[('USD', 'BWP', '12.744567')]
[('BWP', 'USD', '0.078465')]
[('USD', 'BRL', '5.387638')]
[('BRL', 'USD', '0.185610')]
[('USD', 'BND', '1.391884')]
[('BND', 'USD', '0.718450')]
[('USD', 'BGN', '1.911454')]
[('BGN', 'USD', '0.523162')]
[('USD',