
# Week 05 - Lecture Walkthrough (45 min)

In this part of the lecture, we will explore how to acquire information from web-pages using the `requests` library in Python. 

We will work with both HTML and APIs. 

The HTML part will consist of the following steps:

1. Getting a response from a web page
2. Parsing the response
3. Navigating the HTML code
4. Extracting information

The API part will imply creating different requests. 

## HTML

1. Let's get to [Google's Key ML Terminology site](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology) and explore the response of the server using the **Inspect element** functionality. 
2. Change some HTML information. For instance, change the name of the heading. 

## Sending a response

In [2]:
# importing required packages 
import requests 
from bs4 import BeautifulSoup

In [49]:
# sending a reuqets to a web-site
response_google = requests.get('https://developers.google.com/machine-learning/crash-course/framing/ml-terminology')

# printing the response 
print(response_google)

<Response [200]>


### Other possible responses
**200** OK  
**204** No Content  
**400** Bad Request  
**401** Unauthorized  
**402** Payment Required   
**403** Forbidden  
**404** Not Found  
**500** Internal Server Error  
**502** Bad Gateway  

---

Let's try to get to week 12 for our course!

In [22]:
# sending a reuqets to a web-site
response_not_found = requests.get('https://lse-dsi.github.io/lse-ds105-course-notes/weeks/week12.html')

# printing the response 
print(response_not_found)

<Response [404]>


---
Let's go back to our Google example and explore the headers.

---

In [50]:
# looking inside the response
print(response_google.headers)

print(response_google.url)

{'Last-Modified': 'Mon, 18 Jul 2022 20:59:31 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Set-Cookie': '_ga_devsite=GA1.3.456951144.1665757388; Expires=Sun, 13 Oct 2024 14:23:08 GMT; Max-Age=63072000; Path=/', 'Content-Security-Policy': "base-uri 'self'; object-src 'none'; script-src 'strict-dynamic' 'unsafe-inline' https: http: 'nonce-UqNFxTjmFnYkDKoNFJOdeLOmN5qwUh' 'unsafe-eval'; report-uri https://csp.withgoogle.com/csp/devsite/v2", 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains; preload', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '0', 'X-Content-Type-Options': 'nosniff', 'Cache-Control': 'no-cache, must-revalidate', 'Expires': '0', 'Pragma': 'no-cache', 'Content-Encoding': 'gzip', 'X-Cloud-Trace-Context': '23b2c0b512875725bf483e2d6696b581', 'Vary': 'Accept-Encoding', 'Date': 'Fri, 14 Oct 2022 14:23:08 GMT', 'Server': 'Google Frontend', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000

## Parsing the response

In [None]:
# parsing the response
soup = BeautifulSoup(response_google.content)

# looking inside the soup (RETURNS A VERY LARGE TEXT)
soup

## Extract one `<h2>` header

In [54]:
# extract the first h2 header
print(soup.find('h2'))

<h2 class="hide-from-toc" data-text="Labels" id="labels">Labels</h2>


In [55]:
# get text from it
print(soup.find('h2').get_text())

Labels


## Extract all the `<h2>` headers

In [63]:
# extract all h2 headers
print(soup.find_all('h2'))

[<h2 class="hide-from-toc" data-text="Labels" id="labels">Labels</h2>, <h2 class="hide-from-toc" data-text="Features" id="features">Features</h2>, <h2 class="hide-from-toc" data-text=" Examples" id="examples"> Examples</h2>, <h2 class="hide-from-toc" data-text=" Models" id="models"> Models</h2>, <h2 class="hide-from-toc" data-text=" Regression vs. classification" id="regression-vs.-classification"> Regression vs. classification</h2>]


In [65]:
# extract text from each of them
headers = soup.find_all('h2')

for head in headers:
    print(head.get_text().strip())

Labels
Features
Examples
Models
Regression vs. classification


## Extracting other attributes

Let's extract links to key terms at the bottom of the page

In [75]:
# the whole table by attributes
soup.find('aside', attrs={'class':'key-term'}).find_all('a')

[<a href="/machine-learning/glossary#classification_model" target="G">classification model</a>,
 <a href="/machine-learning/glossary#example" target="G">example</a>,
 <a href="/machine-learning/glossary#feature" target="G">feature</a>,
 <a href="/machine-learning/glossary#inference" target="G">inference</a>,
 <a href="/machine-learning/glossary#label" target="G">label</a>,
 <a href="/machine-learning/glossary#model" target="G">model</a>,
 <a href="/machine-learning/glossary#regression_model" target="G">regression model</a>,
 <a href="/machine-learning/glossary#training" target="G">training</a>]

In [74]:
# extract one link 
soup.find('aside', attrs={'class':'key-term'}).find_all('a')[0].get('href')

'/machine-learning/glossary#classification_model'

In [76]:
# extract links one by one
all_terms = soup.find('aside', attrs={'class':'key-term'}).find_all('a')

for term in all_terms:
    print(term.get('href'))

/machine-learning/glossary#classification_model
/machine-learning/glossary#example
/machine-learning/glossary#feature
/machine-learning/glossary#inference
/machine-learning/glossary#label
/machine-learning/glossary#model
/machine-learning/glossary#regression_model
/machine-learning/glossary#training


In [78]:
# maybe create a full link?
for term in all_terms:
    print("https://developers.google.com" + term.get('href'))

https://developers.google.com/machine-learning/glossary#classification_model
https://developers.google.com/machine-learning/glossary#example
https://developers.google.com/machine-learning/glossary#feature
https://developers.google.com/machine-learning/glossary#inference
https://developers.google.com/machine-learning/glossary#label
https://developers.google.com/machine-learning/glossary#model
https://developers.google.com/machine-learning/glossary#regression_model
https://developers.google.com/machine-learning/glossary#training


## APIs 

In this part we will explore one of the web-APIs and see how to send requests and get responses. 

We will explore the [Frankfurter API](https://www.frankfurter.app/docs/) that contains information on currency rates for a lot of different currencies. 

In [86]:
# save the base url
base_url = 'https://api.frankfurter.app'

In [99]:
# send a request to the API of the latest 
API_response = requests.get(base_url + '/latest')

# print the response code
print(API_response)

# inspect the content
API_response.json()

<Response [200]>


{'amount': 1.0,
 'base': 'EUR',
 'date': '2022-10-14',
 'rates': {'AUD': 1.5493,
  'BGN': 1.9558,
  'BRL': 5.1177,
  'CAD': 1.3426,
  'CHF': 0.9757,
  'CNY': 6.9952,
  'CZK': 24.587,
  'DKK': 7.4378,
  'GBP': 0.86823,
  'HKD': 7.6278,
  'HRK': 7.5266,
  'HUF': 418.24,
  'IDR': 15032,
  'ILS': 3.444,
  'INR': 79.97,
  'ISK': 140.5,
  'JPY': 143.63,
  'KRW': 1398.5,
  'MXN': 19.5032,
  'MYR': 4.5689,
  'NOK': 10.3323,
  'NZD': 1.7302,
  'PHP': 57.375,
  'PLN': 4.8328,
  'RON': 4.9335,
  'SEK': 11.0035,
  'SGD': 1.3852,
  'THB': 37.109,
  'TRY': 18.0614,
  'USD': 0.9717,
  'ZAR': 17.6932}}

### Adding parameters

Now let's add more request parameters.

In [98]:
# creating parameters
params = {"from": "USD", 
         "to": "GBP"}

# run the query with the parameters
API_response = requests.get(base_url + '/latest', params=params)

# inspect the content
API_response.json()

{'amount': 1.0, 'base': 'USD', 'date': '2022-10-14', 'rates': {'GBP': 0.89352}}

### Maybe even more parameters?

In [100]:
# creating parameters
params = {"from": "USD", 
         "to": "GBP,JPY"}

# run the query with the parameters
API_response = requests.get(base_url + '/2020-01-01..2020-01-31', params=params)

# inspect the content
API_response.json()

{'amount': 1.0,
 'base': 'USD',
 'start_date': '2020-01-02',
 'end_date': '2020-01-31',
 'rates': {'2020-01-02': {'GBP': 0.75787, 'JPY': 108.77},
  '2020-01-03': {'GBP': 0.76357, 'JPY': 108.14},
  '2020-01-06': {'GBP': 0.76126, 'JPY': 108.11},
  '2020-01-07': {'GBP': 0.76247, 'JPY': 108.44},
  '2020-01-08': {'GBP': 0.76354, 'JPY': 108.74},
  '2020-01-09': {'GBP': 0.76764, 'JPY': 109.4},
  '2020-01-10': {'GBP': 0.76467, 'JPY': 109.64},
  '2020-01-13': {'GBP': 0.77081, 'JPY': 109.88},
  '2020-01-14': {'GBP': 0.77029, 'JPY': 110.05},
  '2020-01-15': {'GBP': 0.76901, 'JPY': 109.88},
  '2020-01-16': {'GBP': 0.76524, 'JPY': 109.95},
  '2020-01-17': {'GBP': 0.76616, 'JPY': 110.11},
  '2020-01-20': {'GBP': 0.76928, 'JPY': 110.18},
  '2020-01-21': {'GBP': 0.765, 'JPY': 110.04},
  '2020-01-22': {'GBP': 0.76159, 'JPY': 109.97},
  '2020-01-23': {'GBP': 0.76186, 'JPY': 109.55},
  '2020-01-24': {'GBP': 0.76405, 'JPY': 109.61},
  '2020-01-27': {'GBP': 0.76515, 'JPY': 108.94},
  '2020-01-28': {'GBP': 