# 🕸️ Web Scraping 🕸️
Data scraping is a technique that a data scientist can use to collect data and content from the internet.

Common data types:
+ images
+ text
+ product information
+ customer review
+ customer data 

How it works?
1. we make an HTTP request to a server
2. the server sends us a response
3. we extract the valuable information from the server response

How to avoid getting blacklisted?
+ read [here](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/)

How to Disguise as a browser

```python
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
```

## Warm-up exercise
**Match the concepts with the correct descriptions:**

<table border="1">
    <tr>
        <th>Concept</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>Requests</td>
        <td>Python library for sending HTTP GET and HTTP POST requests</td>
    </tr>
    <tr>
        <td>Regular Expressions</td>
        <td>Powerful language to find patterns in text</td>
    </tr>
    <tr>
        <td>Scrapy</td>
        <td>Python library for downloading entire web sites</td>
    </tr>
    <tr>
        <td>BeautifulSoup4</td>
        <td>Python library for parsing HTML pages</td>
    </tr>
    <tr>
        <td>Unified Resource Locator (URL)</td>
        <td>Address of a website, file or similar</td>
    </tr>
    <tr>
        <td>HyperText Markup Language (HTML)</td>
        <td>Text format in which most web pages are written</td>
    </tr>
    <tr>
        <td>HyperText Transfer Protocol (HTTP)</td>
        <td>Method to send text messages from one computer to another, built on top of TCP/IP</td>
    </tr>
    <tr>
        <td>HTTP GET</td>
        <td>Request for a web page</td>
    </tr>
    <tr>
        <td>HTTP POST</td>
        <td>Request for a web page that allows to submit large forms or upload files</td>
    </tr>
    <tr>
        <td>API</td>
        <td>Generic name for a (web) programming interface</td>
    </tr>
    <tr>
        <td>200</td>
        <td>Response indicating that a web page was successfully delivered</td>
    </tr>
    <tr>
        <td>404</td>
        <td>Response indicating that a web page could not be found</td>
    </tr>
</table>


## Sending HTTP request with *requests*
We want to scrape lyrics song from [lyrics.com](https://www.lyrics.com)

In [None]:
import requests
import os

In [None]:
# Define URL
URL = ...

In [None]:
# Send a Http get request to the URL
response = ...
response

In [None]:
# check the status of the response
response...

In [None]:
# Get the content
content = response...
content


In [None]:
# Save the content to a file
...

## 🍜 Parsing HTML with BeautifulSoup 🍜

**Read the code and discuss the questions below.**

```python
from bs4 import BeautifulSoup

html = """<html><head></head><body>
<h1>Hamlet</h1>
<ul class="cast"> 
  <li>Hamlet</li>
  <li>Polonius</li>
  <li>Ophelia</li>
  <li>Claudius</li>
</ul>
<ul class="authors">
  <li>William Shakespeare</li>
</ul>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")

for ul in soup.find_all('ul'):
    if "cast" in ul.get('class'):
        for item in ul.find_all('li'):
            print(item.get_text(), end=", ")
        print()
```

+ Q: what is the data type of the HTML document?
  + A: ...
+ Q: what does the find_all() function return?
  + A: ...
+ Q: what does the argument of the find_all() function refer to?
  + A: ...  
+ Q: what does the argument of the get() function refer to? 
  + A: ...
+ Q: what does the get_text() function extract?
  + A: ...
+ Q: how would you extract the title of the play?
  + A: ...

In [None]:
from bs4 import BeautifulSoup

In [None]:
# Let's create a soup from content
soup_content = ...
soup_content

In [None]:
# Let's find the lirics from the soup

In [None]:
# Let's extract the title


In [None]:
# Save the lyrics to a file with the title as the file name

## Web API
+ Very Roughly Speaking:
    + just a special URL, where we get back data ,e.g, in JSON format
+ Advantages over WEB SCRAPING:
    + EASIER to parse JSON than HTML
    + Companies/Organization(data owner) can:
        +  More control, cleanear data
        + Set up rate-limits (e.g. 100 requests per minute)
        + They collect your data as well
        + People can get dependent on APIs so they can charge you



![Beehive](api_example2.png)

Let's send a get request to the api related to this web site https://open-meteo.com

In [None]:
url = ...


In [None]:
# send a get request
response = requests...
json_response = ...

In [None]:
json_response

### Further Readings 📚
+ [Disguising as a Browser](https://stackoverflow.com/questions/27652543/how-can-i-use-pythons-requests-to-fake-a-browser-visit-a-k-a-and-generate-user)
+ [More on Web API](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Client-side_web_APIs/Introduction)
+ [Learn Regular Expression I](https://regexone.com)
+ [Learn Regular Expression II](https://alf.nu/RegexGolf?world=regex&level=r00)
+ [Regular Expression in Python](https://www.w3schools.com/python/python_regex.asp)
+ [Selenium: A program to mimic a Web Browser](https://www.selenium.dev/documentation/webdriver/getting_started/first_script/)