# Python Introduction - Webscraping


## Contents 

[1. Rules](#Rules)

[2. APIs](#APIs)

[3. Websites](#Websites)

[4. Exercises](#Exercises)

## Rules

1. Respect the wishes of the targeted website(s):

$\qquad$ - Checks if an API is available or if the data can be downloaded otherwise <br>
$\qquad$ - Keep in mind where the data comes from, respect copyrights and refer to the source if necessary.<br>
$\qquad$ - Play with open cards, i.e. don't identify yourself as a normal internet user<br>

2. Waits one or two seconds after each request

$\qquad$ - Scrape only what is needed for your project and only once (e.g. save html data on your hard disk and edit it afterwards)

3. How do we find out if the access is authorized?

$\qquad$ - Some websites prohibit the access of scrapers via a robots.txt file.<br>
$\qquad$ - Also in the Terms of Service (AGB's) you will often find hints if scraping is allowed. In case of doubt, always contact the operators first.

### Example ResearchGate.net - is webscraping allowed?

Check https://www.researchgate.net/robots.txt:

````
User-agent: *
Allow: /
Disallow: /connector/
Disallow: /deref/
Disallow: /plugins.
Disallow: /firststeps.
Disallow: /publicliterature.PublicLiterature.search.html
Disallow: /amp/authorize
Allow: /signup.SignUp.html
Disallow: /signup.
````

User-agent: * means, that the following conditions apply to all User agent types (e.g. Google Bots or our Python application).
Within the rest of the file, it is defined which parts of the website are prohibited to scrape: e.g. `/connector/`.

Even though it seems like this website allows us to scrape their content, the terms and conditions might indicate something different:

Within the [Terms of Service](https://www.researchgate.net/application.TermsAndConditions.html) it is clearly stated, that the website provider does not allow webscraping:
<img src="https://www.dropbox.com/s/6o3m0yj59j9ks9t/researchgate_tos.PNG?dl=1" alt="Drawing" style="width: 800px;"/>


**Conclusion**: You should not scrape the site without the permission of the operators.

## APIs

Lots of websites offer free APIs in order to access their data. Please note, that using the API instead of scraping the website directly is considered best practice if applicable.

In [62]:
import requests
req = requests.get("http://data.thecrix.de/data/crix.json")

In [64]:
req

<Response [200]>

In [63]:
req.encoding

In [65]:
req.status_code

200

#### 1xx: Informational
It means the request has been received and the process is continuing.
#### 2xx: Success
It means the action was successfully received, understood, and accepted.
#### 3xx: Redirection
It means further action must be taken in order to complete the request.
#### 4xx: Client Error
It means the request contains incorrect syntax or cannot be fulfilled.
#### 5xx: Server Error
It means the server failed to fulfill an apparently valid request.

In [66]:
req.json()[0:5]

[{'date': '2014-07-31', 'price': 1000},
 {'date': '2014-08-01', 'price': 1018.202717},
 {'date': '2014-08-02', 'price': 1008.772389},
 {'date': '2014-08-03', 'price': 1004.4165},
 {'date': '2014-08-04', 'price': 1004.984138}]

In [67]:
df = pd.DataFrame(columns=['date', 'price'])

In [68]:
for i,row in enumerate(req.json()):
    df = df.append(pd.DataFrame(row, index=[i]))

In [69]:
df.head()

Unnamed: 0,date,price
0,2014-07-31,1000.0
1,2014-08-01,1018.2
2,2014-08-02,1008.77
3,2014-08-03,1004.42
4,2014-08-04,1004.98


In [70]:
import pandas as pd # Powerful package introducing the datastructure DataFrame

crix = pd.read_json("http://data.thecrix.de/data/crix.json")
crix.tail(5)

Unnamed: 0,date,price
2098,2020-04-28,18331.330183
2099,2020-04-29,18379.326121
2100,2020-04-30,20511.177401
2101,2020-05-01,20154.301782
2102,2020-05-02,20624.094692


Please note, that live isn't always that easy and usually you have a lot more to do in order to download your desired information. Sometimes the information you are interested in is spread over severeal JSON files respectively links. In this case, one needs to loop over all relevant links in order to retrieve the necessary information. Please make sure, that you are not overloading the server, since they might block you in case of to many requests within a certain amount of time. A good rule of thumb is to not send more than 1-2 requests per second if nothing is stated by the webiste operator. You can do so using the time package and the sleep function.

In [71]:
import time

# example to illustrate the sleep function -> print the counter i and then do nothing for 2 seconds
for i in range(0, 5):
    print(i+1)
    time.sleep(2)

1
2
3
4
5


In [73]:
for i in range(0, 5):
    print(i+1)

1
2
3
4
5


Lets try an API that contains more data and is therefore more complex in terms of data extraction. To do so, we will use the coingecko API on cryptocurrency data. The API documentation and Terms of use can be found [here](https://www.coingecko.com/en/api#explore-api)


The APIs base link is as follows: https://api.coingecko.com/api/v3/

So lets try it and lets find out the number of cryptocurrencies available via this API

In [74]:
# save base link in variable
base = 'https://api.coingecko.com/api/v3/'
data = pd.read_json(base + 'coins/list')
print(data.shape) # -> 7184 cryptos are currently available via coingecko
data.head(5)

(7193, 3)


Unnamed: 0,id,name,symbol
0,01coin,01coin,zoc
1,02-token,O2 Token,o2t
2,0cash,0cash,zch
3,0chain,0chain,zcn
4,0x,0x,zrx


Now that we now how to extract the symbols, we want to extract certain data. We are for example interested in the price and market capitalization in usd for bitcoin and ethereum. We therefore run the follwoing code:

In [76]:
# extract the ids of interest
ids = ['bitcoin','ethereum']

# extract the relevant information
pd.read_json(base + 'coins/' + ids[0]) # -> not all json structure can be accessed using pandas

ValueError: arrays must all be same length

Due to the different lengths in the json file, we need to use a different packages to load the data in our environment.

In [77]:
base + 'coins/' + ids[0]

'https://api.coingecko.com/api/v3/coins/bitcoin'

In [78]:
import json
import urllib.request as request

df = pd.DataFrame(columns=['id','name','symbol','price','market_cap'])

for idx in ids:
    with request.urlopen(base + 'coins/' + idx) as response:
            source = response.read()
            data = json.loads(source)
        
    line_as_dict = {'id': data['id'],
                    'name': data['name'],
                    'symbol': data['symbol'],
                    'price': data['market_data']['current_price']['usd'],
                    'market_cap': data['market_data']['market_cap']['usd']}
    df = df.append(line_as_dict, ignore_index = True)

df

Unnamed: 0,id,name,symbol,price,market_cap
0,bitcoin,Bitcoin,btc,8709.21,159925209232
1,ethereum,Ethereum,eth,201.38,22318542510


## Websites

In case of no available API, one needs to scrape the information of interest directly of the html code from the website itself.

In [None]:
<a> </a>


### HTML -  Hypertext Markup Language

[HTML](https://en.wikipedia.org/wiki/HTML) is a  language to structure digital documents and consists of multiple elements which are organized in a tree structure. Elements are usually buildt using three different structures:

```html
<a href="https://www.hu-berlin.de/">Link to HU Berlin</a>
```

1. Leading and closing  **Tags**.
2. **Attributes** are set within the tags
3. The **Text** that needs to be structured

What we see in the browser is the interpretation of the HTML document:


````
Elements: <head>, <body>, <footer>...
Components: <title>,<h1>, <div>...
Text Styles: <b>, <i>, <strong>...
Hyperlinks: <a>
````

Next to HTML CSS and Javascript are also relevant for webscraping:

#### CSS
- [Cascading Style Sheets](https://en.wikipedia.org/wiki/Cascading_Style_Sheets) (CSS) describe the format and for example colouring of HTML components (e.g. ``<h1>``, ``<div>``...)
- CSS is useful for us, due to the fact that the CSS pointer (selector) might be used to find HTML elements.

#### Javascript
- [Javascript](https://en.wikipedia.org/wiki/JavaScript) extends the functionality of websites (e.g. hide and display certain objects based on user input)

### HTML in Chrome Browser

- Open [HU Berlin Website](https://www.hu-berlin.de) in Chrome and open Chromes Developer Tools
    
- Hover over different elements within the Developer Console and check what will be displayed on the regular website.
- In the Developer Console one can see all relevant information in regards to certain HTML objects, e.g. `id`, `class`, etc..

### BeautifulSoup

BeautifulSoup is a Python Parser Packagewhich reads in HTML and XML strings. The documentation can be found here [here](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#).

In [79]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# requests package lets you request the HTML code of a certain website 
# -> requests.get("type in your URL of interest here").text 

soup = BeautifulSoup(html_doc, "html5lib") 
# html5lib ->  Parser, ggf. vorher über pip installieren

Subsequently one can retrieve certain attributes from the tree structure:

In [80]:
print(soup.title.text)

The Dormouse's story


Especially, searching the complete HTML document using certain tags helps us to find the information we are interested in (e.g. find all links on a webpage):

In [82]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

Additionally, elements might be selected using its id, href, or class:

In [83]:
print('id',  soup.find(id="link2"))
print('----')
print('href', soup.find(href='http://example.com/lacie'))
print('----')
print('class', soup.find(class_='story'))

id <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
----
href <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
----
class <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


You can also search the document by applying regular expressions:

In [None]:
import re

# again find all links
soup.find_all('a', id=re.compile('link[\d]+')) #\d is short for [0-9]. 

#### Summary -> The Web Scraping - How To

1.  Intensively check the webiste structure
2.  Choose your scraping strategy
3.  Write your Prototype: Extract, process and validate data
4.  Generalize: Functions, Loops, Debugging
5.  Data preparation (Store, clean and make accessible)

## Exercises

### 1. VCRIX API

Write a function that triggers once a day automatically and downloads the HF Crix (http://data.thecrix.de/data/crix_hf.json). The returned results shall be written into a csv file.

In [None]:
# Type your solution here

### 2. HU Berlin

Write a funtion that saves all external links on the main page of [HU Berlin](https://www.hu-berlin.de/de) and the corresponding timestamp of retrieval into a csv file.

In [None]:
# Type your solution here