# Web Crawling and web Scraping

Sometimes, they have APIs but they have no well-written packages in the language you prefer (e.g. only Java but no Python libraries). Even worse, there may not be APIs for the public and we have to design a scraper to retrieve all the relevant informaiton we want. In such cases, we can manually build our own wrapper functions.

Web crawling and web scraping are two related techniques used to extract information from websites.

Web crawling, also known as web indexing or web spidering, is the process of automatically exploring and indexing web pages on the internet. Web crawlers, also called spiders, bots, or robots, navigate through websites, follow links, and index the content of the pages they encounter. Search engines like Google and Bing use web crawlers to build their indexes of web pages, which enables users to find information easily.

Web scraping, on the other hand, is the process of extracting specific data from web pages. Web scraping involves analyzing the HTML structure of a webpage, identifying the relevant information, and extracting it into a structured format such as a CSV or JSON file. Web scraping can be used to extract product information, pricing data, news articles, and more.

Web crawling and web scraping can be done manually, but it's often more efficient to use specialized software tools. Python is a popular language for web crawling and web scraping, and there are many libraries available, including BeautifulSoup, Scrapy, and Selenium.

However, it's important to note that web scraping can raise legal and ethical concerns, particularly if done without permission or in violation of website terms of service. Web scraping can also put a strain on website servers, potentially causing them to crash or become unavailable. As such, it's important to use web scraping responsibly and within legal and ethical boundaries.

##### Preliminiary examples

Examples from <a href="https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document" target="blank_">w3schools</a>.

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My 1st paragraph.</p>
<p>My 2nd paragraph.</p>
<p>My 3rd paragraph.</p>

</body>
</html>
```

Save this code to your disk as `sample.html` (or any other name). We will use a great library called ___`Beautiful Soup`___ to read the contents from Python. You may also need to install lxml, which is for parsing specific formats (e.g., html and xml).

    poetry add beautifulsoup4 lxml

In [None]:
## Do the following if you have not

from bs4 import BeautifulSoup as Soup

# BeautifulSoup = package for web scraping (allows to use html)
# using beautifulsoup4 instead, because it works with Python3

In [None]:
with open("data/sample.html", "r") as sample:
    sample_contents = sample.read()

The structure of HTML is not displayed properly without BeautifulSoup, which is really hand!

In [None]:
sample_contents

In [None]:
type(sample_contents)

In [None]:
sample_soup = Soup(sample_contents, 'lxml')

In [None]:
type(sample_soup)

By printing it, we can see the exact contents as shown above with proper indentation

In [None]:
print(sample_soup.prettify())

Get the contents of interest: all the `p`'s

_`p` means paragraph in html. Check more tag definitions on [w3schools.org](https://www.w3schools.com/tags/default.asp)

In [None]:
p_tags = sample_soup.find_all("p")
# finding all the p's in the sample_soup and putting it in a list 
# p = paragraph 

In [None]:
p_tags
# Calling the list from above

In [None]:
p_tags[1]

In [None]:
type(p_tags[0])

For each of the `p` tag, we get the textual value out.

In [None]:
for p in p_tags:
    print(p.text)

---

#### A real example

Let's use a real website for illustration. For example, if we are interested in the danish parliments webpage for handeling citizen proposals [borgerforslag](https://www.borgerforslag.dk).

To view the "text style" or the real structure of a web page, you can use ___`developer tools`___ function in your browser.

Recall that [`requests`](http://docs.python-requests.org/) is a convenient package for sending HTTP requests.

In [None]:
import requests

In [None]:
borger_url = "https://www.borgerforslag.dk/se-og-stoet-forslag/?Id=FT-14316"
r = requests.get(borger_url)
r.status_code

In [None]:
r.content

In [None]:
r.text[300:1000]

Convert it to a soup object

In [None]:
borger_soup = Soup(r.text, 'lxml')

Find the correponding tag. Note that `class_` has a trailing underscore `_`

In [None]:
borger_soup

In [None]:
summary_tag = borger_soup.find_all('div')

In [None]:
summary_tag

In [None]:
summary_tag = borger_soup.find('div', class_='article')

print('--------------------')
print(summary_tag)
print('--------------------')


In [None]:
borger_content = summary_tag.contents

In [None]:
borger_content

In [None]:
print(borger_content[2].text)

With this in mind, you can scrape almost any webpage of interest. Other formats such as <a href="http://www.json.org/" target="_blank">JSON</a> and <a href="https://www.w3.org/XML/" target="_blank">XML</a> do have high similarities and a few differences. 

***But keep in mind that you should act politely, with propoer permission!! To find out whether specific paths/contents are allowed to be scraped, you can check their ___`robots.txt`___. For example, <a href="https://www.google.com/robots.txt" target="_blank">here's</a> the permission information set by Google.***

---

Note that the examples we are using here are relatively simple. There are cases that we cannot access the pagination/scoll simply by `requests` alone. In those cases, [Selenium](http://selenium-python.readthedocs.io/) will save our lifes by ___simulating Browsers___!

Some more tutorials/tools:

- https://scrapy.org/ #building a crawler 
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- https://www.quora.com/Python-programming-language-1/How-is-BeautifulSoup-different-from-Scrapy

---

return to [overview](../00_overview.ipynb)