**LSE DATA SCIENCE INSTITUTE** 


[DS105M](https://lse-dsi.github.io/lse-ds105-course-notes/) | **Week 05: Data types & data frames**

---


🎯 **OBJECTIVE:** Recap of essential concepts of programming & Introduction of the data frame

👨‍💻 **AUTHOR:** [@jonjoncardoso](https://github.com/jonjoncardoso)

📅 **LAST UPDATED:** 25 October 2022


---


# Setup

If you want to get the most out of this session: install [Python](https://www.python.org/downloads/) and [Jupyter Lab](https://jupyter.org/install) in your computer before coming to the lecture. 

Or, if you are learning R, ensure you have R and the RStudio IDE installed. The main content is on Jupyter but I will switch to RStudio 

# Webscraping

1. Let's get to [Google's Key ML Terminology site](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology) and explore the response of the server using the **Inspect element** functionality. 
2. Change some HTML information. For instance, change the name of the heading. 

### Importing the packages we need

In [None]:
# importing required packages 
import requests 
from bs4 import BeautifulSoup

⚠️ **If you try to replicate it on your own python environment, the lines above might throw an error.** 

You might need to install these packages first.

Close python, open your terminal and type:

```shell
pip install requests=2.28.1
pip install beautifulsoup4==4.11.1
```

Then, try opening the python shell and importing the packages again.

💡**Tip: A good way to find out more about a particular package is by reading their documentation page.**

You can usually find a link on the **pypi** page of the package:

- [requests](https://pypi.org/project/requests/)
- [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)

### Send a request to Google ML course website

We use `requests` for this:

In [None]:
# sending a reuqets to a web-site
response_google = requests.get('https://developers.google.com/machine-learning/crash-course/framing/ml-terminology')

# printing the response 
print(response_google)

📜 **Other possible responses**

**200** OK  
**204** No Content  
**400** Bad Request  
**401** Unauthorized  
**402** Payment Required   
**403** Forbidden  
**404** Not Found  
**500** Internal Server Error  
**502** Bad Gateway  

---

Let's try to get to week 12 for our course!

In [None]:
# sending a reuqets to a web-site
response_not_found = requests.get('https://lse-dsi.github.io/lse-ds105-course-notes/weeks/week12.html')

# printing the response 
print(response_not_found)

---
Let's go back to our Google example and explore the headers.

---

In [None]:
# looking inside the response
print(response_google.headers)

Convert to a regular python dictionary:



In [None]:
dict(response_google.headers)

What else is returned?

In [None]:
response_google.url ## URL

In [None]:
response_google.reason

### Reading the content

Perhaps the most important data you can find out of this response is the `content`. This is what we normally want out of a `response` object: what is in this webpage?

What is the **data type** of `response_google.content`?

In [None]:
type(response_google.content)

**💡Tip:** We will explore the different data types on [Week 05](https://lse-dsi.github.io/lse-ds105-course-notes/main/syllabus.html).

Let's decode these bytes into a string:

In [None]:
response_google.content.decode("utf-8")

Double check it is a string:

In [None]:
type(response_google.content.decode("utf-8"))

Let's print the first 2000 characters of that string:

In [None]:
print(response_google.content.decode("utf-8")[0:2000])

This is the same thing you would get from using the **INSPECT** of your browser!

## _Parse_ the content

We don't want just to read this HTML, we want to **extract** data from it in a structured way.

The [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is extremely useful for this task.

In [None]:
# parsing the response
soup = BeautifulSoup(response_google.content)

# looking inside the soup (RETURNS A VERY LARGE TEXT)
soup

## Extract one `<h2>` header

In [None]:
# extract the first h2 header
print(soup.find('h2'))

In [None]:
# get text from it
print(soup.find('h2').get_text())

## Extract all the `<h2>` headers

In [None]:
# extract all h2 headers
print(soup.find_all('h2'))

How many `<h2>` headers are there in total?

In [None]:
len(soup.find_all('h2'))

In [None]:
# extract text from each of them
headers = soup.find_all('h2')

for head in headers:
    print(head.get_text().strip())

## Extracting other attributes

Let's extract links to key terms at the bottom of the page

In [None]:
# the whole table by attributes
soup.find('aside', attrs={'class':'key-term'}).find_all('a')

In [None]:
# extract one link 
soup.find('aside', attrs={'class':'key-term'}).find_all('a')[0].get('href')

In [None]:
# extract links one by one
all_terms = soup.find('aside', attrs={'class':'key-term'}).find_all('a')

for term in all_terms:
    print(term.get('href'))

In [None]:
# maybe create a full link?
for term in all_terms:
    print("https://developers.google.com" + term.get('href'))

# APIs 

In this part we will explore one of the web-APIs and see how to send requests and get responses. 

We will explore the [Frankfurter API](https://www.frankfurter.app/docs/) that contains information on currency rates for a lot of different currencies. 

In [None]:
# save the base url
base_url = 'https://api.frankfurter.app'

In [None]:
# send a request to the API of the latest 
API_response = requests.get(base_url + '/latest')

# print the response code
print(API_response)

# inspect the content
API_response.json()

### Adding parameters

Now let's add more request parameters.

In [None]:
# creating parameters
params = {"from": "USD", 
         "to": "GBP"}

# run the query with the parameters
API_response = requests.get(base_url + '/latest', params=params)

# inspect the content
API_response.json()

### Maybe even more parameters?

In [None]:
# creating parameters
params = {"from": "USD", 
         "to": "GBP,JPY"}

# run the query with the parameters
API_response = requests.get(base_url + '/2020-01-01..2020-01-31', params=params)

# inspect the content
API_response.json()

# What now?

- **(Practice)** You can try to do [this week's lab exercises](https://lse-dsi.github.io/lse-ds105-course-notes/weeks/week04/lab.html) by yourself before you attend the class on Friday. Try to reuse some of what you saw in this notebook.
- **(Challenge)** Write a script to parse [Amazon.co.uk's Today's Deals page](https://www.amazon.co.uk/deals?ref_=nav_cs_gb) and return a list with the names of all products listed in the first page
- **(Challenge)** Write a script to parse [BBC home page](https://www.bbc.co.uk/) and get a list of the main headlines.