<div class='bar_title'></div>

*Introduction to Data Science (IDS)*

# Data Acquisition

Gunther Gust <br>
Chair for Enterprise AI<br>
Data Driven Decisions (D3) Group<br>
Center for Artificial Intelligence and Data Science (CAIDAS)

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/d3.png" style="width:20%; float:left;" />

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/CAIDASlogo.png" style="width:20%; float:left;" />

## Sources
This lecture relies on:
- https://github.com/kwaldenphd/apis-python
- https://github.com/oxylabs/Python-Web-Scraping-Tutorial/tree/main

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/ao_data_acquisition.png" style="width:100%; float:left;" />

## Obtaining Data
Before diving into APIs and web scraping, it’s important to understand the different ways data scientists commonly obtain data. Here’s an overview of the primary methods:

### Open Datasets
Many organizations, universities, and governments make datasets publicly available online. These open datasets are usually provided in downloadable formats, such as CSV, Excel, or JSON, and are often already cleaned and well-documented (hopefully).

Examples:
- [Kaggle Datasets](https://www.kaggle.com/datasets)
- Github repos often provide datasets
- [Data.gov](https://data.gov/)
- [UCI Machine Learning Repo](https://archive.ics.uci.edu/)
- specifically for Würzburg: https://opendata.wuerzburg.de/pages/wue-dashboard/

### APIs
APIs provide a __structured and standardized way__ to retrieve data from an application or service. Companies and organizations often make APIs available to allow developers to access data in real-time, with data often delivered in formats like __JSON or XML.__

Examples:
- Twitter API
- Weather API
- Nasa API
- Spotify API

Such APIs often provide you with (near-) real time data. Access may be restricted or __rate limits might apply__ in some cases. Handling APIs will be the first part of this lecture.

### Web Scraping
Web scraping involves extracting data __directly from websites__ by __parsing their HTML__ content. This technique is useful when there’s no API or dataset available, but the data is accessible on a website.

We will look into this topic in a minute.

### Data Collection via Surveys and Experiments

In cases where no existing data sources are available, researchers may conduct surveys or experiments to generate their own data. This method is commonly used in fields like social science and psychology, but it can also be of high relevance in computer sciences, where data is collected from sensors.

In general, if no open dataset is available for your usecase and you can't conduct an experiment due to the nature of the data you are looking for or __time constraints,__ the two most advantageous ways to obtain data are __APIs and Web Scraping.__

## APIs

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/api.jpg" style="width:40%;" />


An **API** (Application Programming Interface) is a __set of rules and protocols__ that allows one application to interact with another. APIs let developers access data and services from other applications without knowing how they're implemented.

If an API is available, using it is often __preferred over Web Scraping__ since they provide structured data that is often more accessible than scraping HTML from web pages,  are often faster and more reliable than web scraping and help ensure that you are acessing legal and up-to-date data (more on this later).

### Making an API call
We use the requests module to send __HTTP requests__ (i.e. request data via the world wide web) using Python. These requests use a range of methods that let us interact with elements of the API:

- **delete(url, args)**	sends as DELETE request to the specified URL
- **get(url, params, args)** sends a GET request to the specified URL
- **head(url, args)** sends a HEAD request to the specified URL
- **patch(url, data, args)** sends a PATCH request to the specified URL
- **post(url, data, json, args)** sends a POST request to the specified URL
- **put(url, data, args)** sends a PUT request to the specified URL
- **request(method, url, args)** sends a request of the specified method to the specified URL

The HTTP request returns a __response object__ that includes whatever __data__ is returned as part of the API call.

Let's try to call the [GitHub API](https://docs.github.com/en/rest?) and search for Python repositories sorted by stars. 

In [None]:
import requests
import json
import pandas as pd
import lxml

In [None]:
# store API url
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'

# assign the headers- not always necessary, but something we have to do with the GitHub API
headers = {'Accept': 'application/vnd.github.v3+json'}

# assign the requests method
r = requests.get(url, headers=headers)

# print a status update for the requests command
print(f"Status code: {r.status_code}")

# store API response to variable
response_dict = r.json()

# process results
print(response_dict)

Let's go through the individual elements.

#### URL

Here's a breakdown of the URL:

- Base URL: `https://api.github.com/search/repositories` This is the endpoint for searching repositories on GitHub.
- Query Parameters: 
    - `q=language:python:` This filters the search to repositories where the primary language is Python.
    - `sort=stars:` This sorts the repositories by the number of stars, showing the most popular ones first.

#### Header of an HTTP Request

A **header of an HTTP request** is a key part of the request that provides metadata about the request being sent to the server. It is a collection of key-value pairs that convey additional information such as how to handle the request, client capabilities, and preferences.


HTTP request headers are part of the request message and are sent **after the request line** (e.g., `GET /index.html HTTP/1.1`) and **before the request body** (if any). They are structured as key-value pairs:

     - `Host`: Specifies the domain name of the server (required in HTTP/1.1).
     - `User-Agent`: Provides information about the client (e.g., browser, device).
     - `Accept`: Informs the server about the media types the client can process.
     - `Authorization`: Contains credentials for authentication.
     - `Cookie`: Sends cookies to the server.
     - `Content-Type`: Indicates the MIME type of the body content.
     - `Content-Length`: Specifies the size of the request body in bytes.

#### Responses

When you interact with web servers (e.g., via APIs or web scraping), servers __respond with HTTP status codes__ that indicate the result of your request. Here are some common codes that can help you debug and manager your access:

- 200 OK: The request was successful, and the server returned the data.
- 201 Created: The request was successful, and something new was created (e.g., a new record).
- 400 Bad Request: The request was invalid, often due to missing or incorrect parameters.
- 401 Unauthorized: Authentication is required, and the provided credentials are missing or invalid.
- 403 Forbidden: The request was understood, but the server refuses to fulfill it (often due to permissions).
- 404 Not Found: The server couldn’t find the requested resource (e.g., a non-existent endpoint).
- 500 Internal Server Error: An error occurred on the server, unrelated to the request itself.

Now we can start to explore the data returned by the API.

In [None]:
print(f"Total repositories: {response_dict['total_count']}")

repo_dicts = response_dict['items']
print(f"Repositories returned: {len(repo_dicts)}")

repo_dict = repo_dicts[0]
print("\nSelected information about first repository:")
print(f"Name: {repo_dict['name']}")
print(f"Owner: {repo_dict['owner']['login']}")
print(f"Stars: {repo_dict['stargazers_count']}")
print(f"Repository URL: {repo_dict['html_url']}")
print(f"Created: {repo_dict['created_at']}")
print(f"Updated: {repo_dict['updated_at']}")
print(f"Description: {repo_dict['description']}")

Here is a github repo with an extensive collection of public APIs you can try out:

https://github.com/public-apis/public-apis.

For most APIs, you will have to provide some credentials when calling the `requests.get()` function. This would be done via
`requests.get(url, auth=('user', 'password'))`.

## Exercise 1
What is wrong in this API request? Correct the error.

In [None]:
joke_url = "https://official-joke-api.appspot.com/joke/random"
response = requests.get(joke_url)
response.json()

## Webscraping

Web scraping is a technique to automatically extract data from websites. Unlike APIs, which provide structured data, web scraping involves fetching and parsing HTML.

In [None]:
from bs4 import BeautifulSoup

This will go to the Wikipedia page for the web scraping and print the first paragraph:

In [None]:
response = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
bs = BeautifulSoup(response.text,"lxml")
print(bs.find("p").text)

### Components of a Web Scraping with Python Code

The main building blocks for any web scraping code is like this:

1. Get HTML
2. Parse HTML into Python object
3. Save the data extracted

In most cases, there is no need to use a browser to get the HTML. While HTML contains the data, the other files that the browser loads, like images, CSS, JavaScript, etc., just make the website pretty and functional. Web scraping is focused on data. Thus in most cases, there is no need to get these helper files.

There will be some cases when you do need to open the browser. Python makes that easy too. 


Web scraping with Python is easy due to the many useful libraries available:
- The [Requests](https://docs.python-requests.org/en/master/) library is used to get the HTML files, bypassing the need to use a browser. We already saw that one in the API call.
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is used to convert the raw HTML into a Python object, also called parsing. We will be working with Version 4 of this library, also know as `bs4` or `BeautifulSoup4`.
- The [CSV](https://docs.python.org/3/library/csv.html) library is part of the standard Python installation. No separate installation is required.

In [None]:
url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)
print(response)
print(response.text)

This messy string is the HTML code that the website is made of. All the content, its position and formatting is specified here. In order to convert this string into something that can be queried to find the specific information we will use `BeautifulSoup`.

### BeautifulSoup
Beautiful Soup provides simple methods for __navigating, searching, and modifying the HTML__ (check out the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more usage examples). It takes care of encoding by automatically converting into UTF-8.
The first step is to decide the parser that you want to use. Usually, `lxml` is the most commonly used.  This will need a separate install.

In [None]:
import lxml

In [None]:
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify()) #returns the document as a well-formatted, readable string with proper indentation

We can then use the following __syntax__ to access html tags:

In [None]:
print(soup.title)
print(soup.title.text)

Similarly `soup.h1` will return the **first** `h1` tag it finds:

In [None]:
soup.h1

### `find()` and `find_all()`

Perhaps the most commonly used methods are `find()` and `find_all()`. 

The signature of find looks something like this:

find(name=None, attrs={}, recursive=True, text=None, **kwargs)

In order to understand how you can search for a certain element in a webpage, you can go to the webpage and click on `Inspect` to open the HTML view.
Once you idientified the information that you are looking for, the find method can be used to find elements based on `name`, `attributes`, or `text`. This should cover most of the scenarios. For scenarios like finding by `class`, there is `**kwargs` that can take other filters.



 #### Example 
 
Let’s open the [Wikipedia Python page](https://en.wikipedia.org/wiki/Python_(programming_language)) and get the __table of contents.__

Moving on to Wikipedia example, the first step is to look at the __HTML markup for the table of contents__ to be extracted. 

Right-click on the part (so called division or `div`) that contains the table of contents and examine its markup. It is clear that the whole table of contents is in a div tag with the class attribute set to toc:
```html
<div id="vector-toc" class="vector-toc vector-pinnable-element">
```
<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/scraping_python.png" style="width:80%;" />




If we simply run `soup.find("div")`, it will return the first div it finds - similar to writing `soup.div`:

In [None]:
toc = soup.find("div")
print(toc.text)

#### Find & filter

If we simply run `soup.find("div")`, it will return the first div it finds - similar to writing `soup.div`. Finding the Table of Contents needs __filtering as we need a specific div.__ We are lucky in this case as it has an `id `attribute. The following line of code can extract the div element:

In [None]:
toc = soup.find("div",id="vector-toc")
print(toc.text)

#### Further details

- Note the second parameter here - `id="vector-toc"`.  The find method does not have a named parameter `id`, but still this works because of the implementation of the filter using the `**kwargs`. (`**kwargs` is a way to pass a variable number of keyword arguments to a function. It allows you to accept any number of named arguments in a function call. The kwargs stands for "keyword arguments," and the ** syntax is what makes it special. See e.g. [this example](https://www.geeksforgeeks.org/args-kwargs-python/))

- Be careful with CSS class though. `class `is a reserved keyword in Python. It cannot be used as a parameter name directly.  There are two workarounds – first, just use `class_` instead of `class`. The second workaround is to use a dictionary as the second argument.

- This means that the following two statements are same:

In [None]:
toc2 = soup.find("div",class_="vector-toc vector-pinnable-element")
toc3 = soup.find("div",{"class": "vector-toc vector-pinnable-element"}) 


#### Filtering based on several criteria

The advantage of using a __dictionary__ is that __more than one attribute__ can be specified. For example,if you need to specify both class and id, you can use the find method in the following manner:

In [None]:
soup.find("div",{"class": "vector-toc vector-pinnable-element", "id":"vector-toc"})

### `find_all()` to find multiple elements

Consider this scenario - the object is to create a CSV file, which has two columns. The first column contains the __heading number__ and the second column contains the __heading text.__

To find multiple columns, we can use the `find_all` method.

This method works the __same way as the `find()` method__ works, just that instead of one element, it returns a list of all the elements that match criteria. If we look at the source code, we can see that all the heading text is inside a span, with toctext as class. We can use find_all method to extract all these:

In [None]:
toc_elements = soup.find_all("div",class_="vector-toc-text")

This returned all divs with class `vector-toc-text`. Each of them consists of two spans, the first one indicating the number of the element in the ToC, the second one containing the title. Now if we want to properly work with that, a dictionary of this information would be more useful.

In [None]:
toc_data= []
for element in toc_elements:
    if element.find("span", class_="vector-toc-numb"):
        heading_number = element.find("span", class_="vector-toc-numb").text
        heading_text = element.find_all("span")[-1].text
        toc_data.append({
            'heading_number' : heading_number,
            'heading_text' : heading_text,
        })

toc_data = pd.DataFrame(toc_data)
toc_data

Now you could export the scraped information as a csv file in order to work with it in another project.

In [None]:
toc_data.to_csv('toc.csv', index=False)

## Exercise 2

Go to [Quotes to Scrape](http://quotes.toscrape.com/) and use the browser's Inspect tool.
Notice that each quote is inside a 
```html
<span class="text">
```
element. We’ll target only these elements.

Scrape all the quotes displayed on this page and print them.

## Common Data Formats
As you have already seen in the examples, a common format returned by APIs is JSON.

When you interact with APIs or scrape websites, you'll often encounter different data formats. Here are the main ones:

- JSON (JavaScript Object Notation): Commonly used with APIs. We already encountered this usig the GitHub API.
- HTML: Used in web pages and requires parsing to extract data.
- XML: Sometimes used by older APIs or services.

### JSON (JavaScript Object Notation)
JSON is a lightweight data format that is widely used in web APIs for exchanging data between a server and a client. It’s easy for both humans and machines to read and write.It is organized in a key-value structure, similar to Python dictionaries, making it easy to work with in Python. It can contain nested objects and arrays.

The example.json file contains a short example on how a JSON file might look.

### HTML (HyperText Markup Language)
HTML is the standard markup language for creating web pages. When we scrape websites, we usually work with HTML documents to extract data displayed on the page, such as text, images, links, and tables. It uses tags to define elements (e.g., `<div>, <p>, <span>, <table>`). Elements are often nested, forming a tree structure (DOM - Document Object Model) that represents the layout of the webpage.

We already encountered this data type in the scraping example.

### XML (eXtensible Markup Language)
XML is another markup language like HTML but is designed to store and transport data. It’s commonly used in data interchange between systems, especially in older APIs or specific industries (e.g., banking, healthcare).

It uses nested tags to define data, similar to HTML, but it’s more flexible as developers define their own tags based on the type of data. It often has a hierarchical structure, making it useful for representing complex, nested data.

In [None]:
xml_content = """
<course>
    <name>Data Science</name>
    <module>
        <title>Introduction to Python</title>
        <duration>2 weeks</duration>
    </module>
    <module>
        <title>Data Analysis with Pandas</title>
        <duration>3 weeks</duration>
    </module>
</course>
"""

XML can be handled in Python via the xml.etree.ElementTree module:

In [None]:
import xml.etree.ElementTree as ET

In [None]:
root = ET.fromstring(xml_content)

for module in root.findall('module'):
    title = module.find('title').text
    print(title)

## Outlook: Advanced Data Collection Techniques


### Selenium: Working with Dynamic Websites

Selenium is especially useful for automating interactions on websites that rely heavily on JavaScript (programming language that enables __interactive web pages and dynamic user experiences__). It allows for actions such as:

- Clicking buttons
- Filling out forms
- Scrolling through pages
- Taking screenshots
- Running custom JavaScript

$\Rightarrow$ Great option for scraping data from dynamic websites. Unlike traditional tools that only retrieve raw HTML and JavaScript code, Selenium can simulate human interaction, enabling access to the underlying data on these complex pages.

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/selenium_interacting.png" style="width:20%; float:left;" />


If you are interested in using Selenium for your projects, [this video](https://www.youtube.com/watch?v=nOV-UrRU9N4) is a good ressource to start with the setup.

### Automating Data Collection

Automating data collection can be useful for ensuring that we gather data consistently and without needing manual intervention. For example, we might want to:
- Collect data from an API every day at a certain time.
- Capture changes on a website at regular intervals.
- Aggregate data over time for long-term analysis.

Two tools for scheduling such tasks are:
- **Cron**: A time-based job scheduler commonly used in Unix-like operating systems (Linux and macOS).
- **Task Scheduler**: A Windows utility for automating tasks by scheduling programs or scripts to run at specific times.

**Cron**

Here, the Syntax looks like this:
```{cron}
* * * * * command
```
so for example for a script called `scrape_data.py` that is supposed to be executed every day at midnight:
```{cron}
0 0 * * * /usr/bin/python3/path/scrape_data.py
```
<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/cron_syntax.png" style="width:50%; float:left;" />


**Task Scheduler**

Example Setup:
1. Open Task Scheduler and create a basic task.
2. Trigger: Daily at 1:00 AM.
3. Action: Run python with the path to your script

See e.g. [this post](https://www.jcchouinard.com/python-automation-using-task-scheduler/) for a more detailed description.

## Ethical Considerations

Web scraping gives the scraper a lot of power, especially when it comes to websites that handle a lot of user data and contain personal information. Without setting up ethical standards and a moral code for web scraping, it can be hard to differentiate between sleazy web scrapers looking to profit from their data at the expense of others, and those who wish to innovate and learn new things using the data available online.

Key considerations to take into account when you scrape a website:

1. If there is an __API available__, use it.

2. Respect __Robots.txt:__ Websites use a robots.txt file to communicate their scraping policies. This file specifies which pages or sections of a website can be crawled and scraped by bots. Always check and respect these guidelines, as ignoring them may violate the website's terms of service.

3. Abide by __Terms of Service:__ Many websites have terms of service (ToS) that explicitly prohibit or restrict web scraping. Violating these terms could lead to legal consequences, including being banned from the site or facing litigation. It's important to read and understand a site's ToS before scraping.

4. Avoid __Overloading Servers:__ Sending too many requests in a short period can overwhelm a website’s server. Use throttling, rate limiting, or pauses between requests to reduce server load and avoid causing disruptions.

5. Respect __Copyright and Intellectual Property:__ Scraping copyrighted content (such as articles, images, or databases) without permission could be a violation of intellectual property laws. Always ensure you are scraping publicly available data or data that you have explicit permission to use.

6. __Data Privacy:__ Avoid scraping personal or sensitive information.

7. ___Transparency and Fair Use:__ When scraping, be transparent about your intentions if asked, and ensure that your use of the scraped data aligns with fair use principles.

Consider [this blog post](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) on ethical web scraping.

# Mentimeter

# Next lecture

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/ao_data_prep.png" style="width:100%; float:left;" />