# Web Crawling and web Scraping

Sometimes, they have APIs but they have no well-written packages in the language you prefer (e.g. only Java but no Python libraries). Even worse, there may not be APIs for the public and we have to design a scraper to retrieve all the relevant informaiton we want. In such cases, we can manually build our own wrapper functions.

Web crawling and web scraping are two related techniques used to extract information from websites.

Web crawling, also known as web indexing or web spidering, is the process of automatically exploring and indexing web pages on the internet. Web crawlers, also called spiders, bots, or robots, navigate through websites, follow links, and index the content of the pages they encounter. Search engines like Google and Bing use web crawlers to build their indexes of web pages, which enables users to find information easily.

Web scraping, on the other hand, is the process of extracting specific data from web pages. Web scraping involves analyzing the HTML structure of a webpage, identifying the relevant information, and extracting it into a structured format such as a CSV or JSON file. Web scraping can be used to extract product information, pricing data, news articles, and more.

Web crawling and web scraping can be done manually, but it's often more efficient to use specialized software tools. Python is a popular language for web crawling and web scraping, and there are many libraries available, including BeautifulSoup, Scrapy, and Selenium.

However, it's important to note that web scraping can raise legal and ethical concerns, particularly if done without permission or in violation of website terms of service. Web scraping can also put a strain on website servers, potentially causing them to crash or become unavailable. As such, it's important to use web scraping responsibly and within legal and ethical boundaries.

##### Preliminiary examples

Examples from <a href="https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document" target="blank_">w3schools</a>.

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My 1st paragraph.</p>
<p>My 2nd paragraph.</p>
<p>My 3rd paragraph.</p>

</body>
</html>
```

Save this code to your disk as `sample.html` (or any other name). We will use a great library called ___`Beautiful Soup`___ to read the contents from Python. You may also need to install lxml, which is for parsing specific formats (e.g., html and xml).

    poetry add beautifulsoup4 lxml

In [1]:
## Do the following if you have not

from bs4 import BeautifulSoup as Soup

In [2]:
with open("data/sample.html", "r") as sample:
    sample_contents = sample.read()

The structure of HTML is not displayed properly without BeautifulSoup, which is really hand!

In [3]:
sample_contents

'<!DOCTYPE html>\n<html>\n<body>\n\n<h1>My First Heading</h1>\n\n<p>My 1st paragraph.</p>\n<p>My 2nd paragraph.</p>\n<p>My 3rd paragraph.</p>\n\n</body>\n</html>'

In [4]:
type(sample_contents)

str

In [5]:
sample_soup = Soup(sample_contents, 'lxml')

In [6]:
type(sample_soup)

bs4.BeautifulSoup

By printing it, we can see the exact contents as shown above with proper indentation

In [7]:
print(sample_soup.prettify())

<!DOCTYPE html>
<html>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My 1st paragraph.
  </p>
  <p>
   My 2nd paragraph.
  </p>
  <p>
   My 3rd paragraph.
  </p>
 </body>
</html>



Get the contents of interest: all the `p`'s

_`p` means paragraph in html. Check more tag definitions on [w3schools.org](https://www.w3schools.com/tags/default.asp)

In [8]:
p_tags = sample_soup.find_all("p")

In [9]:
p_tags

[<p>My 1st paragraph.</p>, <p>My 2nd paragraph.</p>, <p>My 3rd paragraph.</p>]

In [10]:
p_tags[1]

<p>My 2nd paragraph.</p>

In [11]:
type(p_tags[0])

bs4.element.Tag

For each of the `p` tag, we get the textual value out.

In [12]:
for p in p_tags:
    print(p.text)

My 1st paragraph.
My 2nd paragraph.
My 3rd paragraph.


---

#### A real example

Let's use a real website for illustration. For example, if we are interested in the danish parliments webpage for handeling citizen proposals [borgerforslag](https://www.borgerforslag.dk).

To view the "text style" or the real structure of a web page, you can use ___`developer tools`___ function in your browser.

Recall that [`requests`](http://docs.python-requests.org/) is a convenient package for sending HTTP requests.

In [13]:
import requests

In [14]:
borger_url = "https://www.borgerforslag.dk/se-og-stoet-forslag/?Id=FT-14316"
r = requests.get(borger_url)
r.status_code

200

In [15]:
r.content

b'\n\n<!DOCTYPE html>\n<html lang="da-DK">\n<head>\n    <title>Udskiftning af dans i idr&#xE6;tundervisningen med alternative aktiviteter</title>\n\n\n    <meta charset="utf-8" />\n\n    \n<script id="Cookiebot" data-cbid="51f634e9-7d87-4212-af57-edd0e26f6f06" data-blockingmode="none" type="text/javascript" src="https://consent.cookiebot.com/uc.js"></script>\n\n<script type="text/javascript">\n       window.__THIRD_PARTY_KEYS = { sentry: "https://984e0a1cc92a49acaf5b315f1f3f1cd1@sentry.io/216509" };\n\n       window.addEventListener(\'CookiebotOnAccept\', function (e) {\n           if (Cookiebot.consent.statistics) {\n               // load app insight after consent\n               var appInsights = window.appInsights || function (a) {\n                   function b(a) { c[a] = function () { var b = arguments; c.queue.push(function () { c[a].apply(c, b) }) } } var c = { config: a }, d = document, e = window; setTimeout(function () { var b = d.createElement("script"); b.src = a.url || "

In [16]:
r.text[300:1000]

'"https://consent.cookiebot.com/uc.js"></script>\n\n<script type="text/javascript">\n       window.__THIRD_PARTY_KEYS = { sentry: "https://984e0a1cc92a49acaf5b315f1f3f1cd1@sentry.io/216509" };\n\n       window.addEventListener(\'CookiebotOnAccept\', function (e) {\n           if (Cookiebot.consent.statistics) {\n               // load app insight after consent\n               var appInsights = window.appInsights || function (a) {\n                   function b(a) { c[a] = function () { var b = arguments; c.queue.push(function () { c[a].apply(c, b) }) } } var c = { config: a }, d = document, e = window; setTimeout(function () { var b = d.createElement("script"); b.src = a.url || "https://az416426.vo.msec'

Convert it to a soup object

In [17]:
borger_soup = Soup(r.text, 'lxml')

Find the correponding tag. Note that `class_` has a trailing underscore `_`

In [18]:
borger_soup

<!DOCTYPE html>
<html lang="da-DK">
<head>
<title>Udskiftning af dans i idrætundervisningen med alternative aktiviteter</title>
<meta charset="utf-8"/>
<script data-blockingmode="none" data-cbid="51f634e9-7d87-4212-af57-edd0e26f6f06" id="Cookiebot" src="https://consent.cookiebot.com/uc.js" type="text/javascript"></script>
<script type="text/javascript">
       window.__THIRD_PARTY_KEYS = { sentry: "https://984e0a1cc92a49acaf5b315f1f3f1cd1@sentry.io/216509" };

       window.addEventListener('CookiebotOnAccept', function (e) {
           if (Cookiebot.consent.statistics) {
               // load app insight after consent
               var appInsights = window.appInsights || function (a) {
                   function b(a) { c[a] = function () { var b = arguments; c.queue.push(function () { c[a].apply(c, b) }) } } var c = { config: a }, d = document, e = window; setTimeout(function () { var b = d.createElement("script"); b.src = a.url || "https://az416426.vo.msecnd.net/scripts/a/ai.0.js"

In [19]:
summary_tag = borger_soup.find_all('div')

In [20]:
summary_tag

[<div class="site--wrapper" id="leseweb">
 <div id="B1gxcBWsJA"><div class="no-print FjLrEd" data-reactroot="" role="navigation"><button class="_2gvjgg _2cq934 _13eHrl"><span class="tEWsZk" style="color:#fff">MENU</span><span class="_1QnHfQ"><span class="_1RIvq2"></span></span></button></div></div><script data-module="Menu">_components.push({"name":"Menu","props":{"menuTitleOpen":"Menu","menuTitleClose":"Luk","links":[{"label":"Forside","href":"https:\u002F\u002Fwww.borgerforslag.dk\u002F"},{"label":"Opret forslag","href":"https:\u002F\u002Fwww.borgerforslag.dk\u002Fopret-forslag\u002F"},{"label":"Vejledninger","href":"https:\u002F\u002Fwww.borgerforslag.dk\u002Fvejledninger\u002F"},{"label":"Ikkedigitale borgere","href":"https:\u002F\u002Fwww.borgerforslag.dk\u002Fikkedigitale-borgere\u002F"},{"label":"Ofte stillede spørgsmål","href":"https:\u002F\u002Fwww.borgerforslag.dk\u002Fofte-stillede-spoergsmaal\u002F"},{"label":"Om borgerforslag","href":"https:\u002F\u002Fwww.borgerforslag.dk

In [21]:
summary_tag = borger_soup.find('div', class_='article')

print('--------------------')
print(summary_tag)
print('--------------------')


--------------------
<div class="article">
<div id="rJf4CWs1C"><div class="_3dLODA" data-reactroot="" data-readaloud-ancestor="true"><div class="_3l86Vg" data-readaloud="true"><div><span>Startdato</span><strong>17. marts 2023</strong></div><div><span>Slutdato</span><strong>13. september 2023</strong></div><div><span>Antal støtter</span><strong>41</strong><div class="ssLhPH"><div class="_2KXwjO" style="width:1.4317821063276352%"></div></div></div></div><section data-readaloud="true"></section><button class="_3iicqI" disabled=""><svg height="28px" viewbox="0 0 20 20" width="28px" xmlns="http://www.w3.org/2000/svg"><path d="M11.536 14.01A8.473 8.473 0 0 0 14.026 8a8.473 8.473 0 0 0-2.49-6.01l-.708.707A7.476 7.476 0 0 1 13.025 8c0 2.071-.84 3.946-2.197 5.303l.708.707z"></path><path d="M10.121 12.596A6.48 6.48 0 0 0 12.025 8a6.48 6.48 0 0 0-1.904-4.596l-.707.707A5.483 5.483 0 0 1 11.025 8a5.483 5.483 0 0 1-1.61 3.89l.706.706z"></path><path d="M8.707 11.182A4.486 4.486 0 0 0 10.025 8a4.486 4

In [22]:
borger_content = summary_tag.contents

In [23]:
borger_content

['\n',
 <div id="rJf4CWs1C"><div class="_3dLODA" data-reactroot="" data-readaloud-ancestor="true"><div class="_3l86Vg" data-readaloud="true"><div><span>Startdato</span><strong>17. marts 2023</strong></div><div><span>Slutdato</span><strong>13. september 2023</strong></div><div><span>Antal støtter</span><strong>41</strong><div class="ssLhPH"><div class="_2KXwjO" style="width:1.4317821063276352%"></div></div></div></div><section data-readaloud="true"></section><button class="_3iicqI" disabled=""><svg height="28px" viewbox="0 0 20 20" width="28px" xmlns="http://www.w3.org/2000/svg"><path d="M11.536 14.01A8.473 8.473 0 0 0 14.026 8a8.473 8.473 0 0 0-2.49-6.01l-.708.707A7.476 7.476 0 0 1 13.025 8c0 2.071-.84 3.946-2.197 5.303l.708.707z"></path><path d="M10.121 12.596A6.48 6.48 0 0 0 12.025 8a6.48 6.48 0 0 0-1.904-4.596l-.707.707A5.483 5.483 0 0 1 11.025 8a5.483 5.483 0 0 1-1.61 3.89l.706.706z"></path><path d="M8.707 11.182A4.486 4.486 0 0 0 10.025 8a4.486 4.486 0 0 0-1.318-3.182L8 5.525A3.48

In [24]:
print(borger_content[2].text)

_components.push({"name":"ProposalEditor","props":{"proposalCreationViewModel":{"proposal":{"id":15727,"title":"Udskiftning af dans i idrætundervisningen med alternative aktiviteter","proposalContent":"Jeg foreslår, at dans fjernes fra den obligatoriske idrætundervisning i skolerne i Danmark og erstattes med alternative fysiske aktiviteter. Mange elever oplever udfordringer med dans, da det kan være en intimiderende og ubehagelig oplevelse for nogle elever. Desuden kan dansen have en kønsstereotyp påvirkning, der kan føre til kønsdiskrimination.\n\nVed at erstatte dansen med alternative fysiske aktiviteter vil eleverne stadig have mulighed for at deltage i sjove og udfordrende aktiviteter, der vil forbedre deres fysiske og mentale sundhed. Der er mange alternative aktiviteter, som kan udvikle elevernes styrke, smidighed, koordination og samarbejdsevner, som f.eks. yoga, fitness, svømning, eller andre idrætsgrene.\n\nJeg tror, at fjernelse af dans fra idrætundervisningen vil give elever

With this in mind, you can scrape almost any webpage of interest. Other formats such as <a href="http://www.json.org/" target="_blank">JSON</a> and <a href="https://www.w3.org/XML/" target="_blank">XML</a> do have high similarities and a few differences. 

***But keep in mind that you should act politely, with propoer permission!! To find out whether specific paths/contents are allowed to be scraped, you can check their ___`robots.txt`___. For example, <a href="https://www.google.com/robots.txt" target="_blank">here's</a> the permission information set by Google.***

---

Note that the examples we are using here are relatively simple. There are cases that we cannot access the pagination/scoll simply by `requests` alone. In those cases, [Selenium](http://selenium-python.readthedocs.io/) will save our lifes by ___simulating Browsers___!

Some more tutorials/tools:

- https://scrapy.org/ #building a crawler 
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- https://www.quora.com/Python-programming-language-1/How-is-BeautifulSoup-different-from-Scrapy

---

return to [overview](../00_overview.ipynb)