# Part 2, Lesson 3: APIs, Web Scraping & JSON Handling in R
**Author:** Your Name  
**Date:** Block Lecture (4 hours)

# Part 2 – Lesson 3: APIs, Web Scraping, & JSON

Welcome to the **third 4-hour session** of Part 2! This lesson explores how to **obtain** data from the web—either through **public APIs** or by **scraping** HTML pages—and handle **JSON** or other semi-structured formats in R.

## Topics
1. **Recap & Motivation**
2. **Accessing APIs with R**
3. **JSON Basics & Parsing**
4. **Web Scraping** with `rvest`
5. **Storing Scraped Data in a Database**
6. **Practical Exercises & Best Practices**

> **Outcome**: By the end, you should be comfortable making **HTTP requests**, parsing **JSON** or **HTML** results, and saving that data for further analysis.


---
## 1. Recap & Motivation
**Why** focus on APIs or scraping?
- Many **public datasets** are available via REST APIs (e.g., government portals, social media, weather).
- Some data isn’t provided in a convenient format, so you might need to **scrape** it from websites.
- Journalists often need to gather info from multiple sources quickly.

### Recap & Q&A
- Questions from previous lessons about **SQL**, **window functions**, or **database** usage?
- Experience or challenges retrieving external data?


---
## 2. Accessing APIs with R
### 2.1 HTTP & REST Basics
Most modern data APIs use **HTTP** methods like `GET`, `POST`, etc. A typical request might look like:
```
GET https://api.openweathermap.org/data/2.5/weather?q=London&appid=YOUR_API_KEY
```
The server returns **JSON** (or XML), which we parse.

### 2.2 Using `httr` or `curl`
R packages for making requests:
- **`httr`**: High-level functions for GET/POST.
- **`curl`**: Lower-level approach.
- **`jsonlite`**: Often used for parsing JSON.

Let’s demonstrate a **public** API example. (We’ll use a mock or test endpoint if a real key is needed.)


In [None]:
# Install if needed:
# install.packages("httr")
# install.packages("jsonlite")

library(httr)
library(jsonlite)

# Example: GitHub's public API for user data (no API key needed)
url <- "https://api.github.com/users/hadley"

res <- GET(url)  # Make GET request

# Check status
status_code(res)

content_raw <- content(res, as = "text")  # Get raw text
content_json <- fromJSON(content_raw)

str(content_json)

The `content_json` variable is now a **list** in R containing user details from GitHub’s API. We can extract fields like `content_json$login`, `content_json$name`, etc.

### 2.3 Handling API Keys
Some APIs require an **API key** or **OAuth** token. Typically:
```r
res <- GET(
  "https://api.example.com/data",
  add_headers(Authorization = paste("Bearer", MY_API_TOKEN))
)
```
You’d store `MY_API_TOKEN` in an **environment variable** or a secure file, not directly in your script.


### 2.4 Rate Limits & Pagination
Many APIs have **rate limits** (e.g., 60 requests/hour). For large data, you might need to **paginate** or request data in **batches**.
```r
# Pseudocode for paginated results
page = 1
results_all = list()
repeat {
  res <- GET(paste0("https://api.example.com/data?page=", page))
  data_page <- fromJSON(content(res, "text"))
  if (length(data_page) == 0) break  # no more data
  results_all[[page]] <- data_page
  page <- page + 1
}
final_data <- bind_rows(results_all)
```
Always read the **API docs** for how to navigate or handle large queries.


---
## 3. JSON Basics & Parsing
**JSON** (JavaScript Object Notation) is a lightweight format for data exchange, frequently used by APIs. R can parse JSON with **`jsonlite`** or **`rjson`**. We’ll use `jsonlite`.

### 3.1 Parsing JSON from a String
```r
library(jsonlite)
json_text <- '{"name": "Alice", "age": 30, "pets": ["dog", "cat"]}'
obj <- fromJSON(json_text)
str(obj)
# $ name: chr "Alice"
# $ age : num 30
# $ pets: chr [1:2] "dog" "cat"
```
We can **navigate** `obj` like a list: `obj$name` or `obj$pets`.

### 3.2 Converting R Objects to JSON
```r
my_list <- list(city = "Berlin", population = 3.6, landmarks = c("Brandenburg Gate", "Reichstag"))
toJSON(my_list, pretty = TRUE)
```
Useful if you need to send data **to** an API or store config in JSON.


### 3.3 Flattening Nested JSON
APIs often return **nested** structures. The function **`fromJSON(..., flatten = TRUE)`** can help produce a more tabular structure. If that fails, you can **manually** navigate and reformat.

> **Exercise**: Use a public JSON API, parse the data, and try to flatten it into a dataframe. Then, store that data in a database table (tying in your **SQL** skills from previous lessons).


---
## 4. Web Scraping with `rvest`
When an API doesn’t exist, or the data is behind an **HTML** page, we can **scrape** it. **`rvest`** (from the tidyverse) is a top choice.

### 4.1 Basic Workflow
1. **Read** the webpage’s HTML.
2. Identify **CSS selectors** or **XPath** that point to the desired elements.
3. Extract text or attributes.
4. Clean & structure the data.

### 4.2 Example
We’ll scrape a small example from [example.com] or a placeholder site. (For real scraping, pick an actual site that allows it and check **robots.txt** / terms of service!)


In [None]:
# install.packages("rvest") if needed
library(rvest)

# Let's use a sample page for demonstration:
url2 <- "https://example.com/"

# 1) Read HTML
page <- read_html(url2)

# 2) Extract the <h1> tag text
h1_text <- page %>% html_element("h1") %>% html_text()
h1_text

You’d follow the same approach for **tables**, **lists**, or **div** elements. For example, if a page has a table:
```r
table_data <- page %>% 
  html_element("table") %>% 
  html_table()
```
This transforms the HTML table into a dataframe (assuming it’s well-structured HTML).

### 4.3 Handling Pagination or Multiple Pages
Similar to an API, you might have multiple pages to scrape. You can loop over URLs:
```r
pages <- c("page1.html", "page2.html")
all_data <- list()
for (i in seq_along(pages)) {
  pg <- read_html(pages[i])
  data_extracted <- pg %>% ...
  all_data[[i]] <- data_extracted
}
final <- bind_rows(all_data)
```
Again, **respect** the site’s usage policy, consider **delays** between requests, and handle **error checking** for missing pages.


---
## 5. Storing Scraped Data in a Database
Tie in your **SQL** knowledge. Once you have the final dataframe from the API or web page, you can store it in a database (SQLite, MySQL, etc.).
```r
library(DBI)
con <- dbConnect(RSQLite::SQLite(), "scraped_data.sqlite")

# Suppose we have final_data from an API or scraped table
dbWriteTable(con, "my_table", final_data)
```
Then you can run queries, merges, or further analysis on that table—**closing the loop** on your data pipeline.


### A Mini ETL Example
1. **Extract** from an API or website with `httr`/`rvest`.
2. **Transform** the JSON/HTML structures into clean data.
3. **Load** the data into a database with `dbWriteTable`.
4. Use SQL or `dplyr` to do further summarization or linking to other datasets.


---
## 6. Practical Exercises & Best Practices
Here are some suggestions for hands-on **exercises**:

1. **API**: Pick a public API (GitHub, OpenWeatherMap, etc.). Make a **GET** request, parse the JSON, flatten it, and store the relevant fields in a local SQLite database.
2. **Web Scraping**: Identify a website that lists data in a table or items. Use `rvest` to scrape it into a dataframe. Clean up the text (remove whitespace, parse numbers), and store it in your DB.
3. **Combine**: If you have local CSVs or a second database table, do a **join** or **merge** to add context.
4. **Rate Limits / Politeness**: If you scrape multiple pages, insert a short delay with `Sys.sleep(1)` to avoid hammering the server.

### Best Practices
- Check **robots.txt** or the site’s terms to ensure scraping is allowed.
- Use **caching** if you repeatedly fetch the same data.
- Don’t store **API keys** in your code repository. Use environment variables or `.Renviron`.
- For large-scale scraping, consider specialized tools or a queue-based approach.


---
## Wrap-Up & Next Steps

In this **Part 2, Lesson 3**, you learned:
- **Accessing APIs** in R via `httr`, handling JSON with `jsonlite`.
- **Web scraping** basics using `rvest`.
- Converting messy nested data into R data frames.
- Saving the final results into your **database** or local files for further analysis.

### What’s Next?
1. **Expand** your scraping or API usage: handle pagination, error handling, or session-based logins.
2. Integrate **SQL** window functions or joins on your newly scraped data.
3. Build a **Shiny** app or custom script to automate data ingestion from an API or website.

### Additional Resources
- [**rvest** documentation](https://rvest.tidyverse.org/)
- [**httr** package vignettes](https://httr.r-lib.org/articles/)
- [**jsonlite** reference](https://github.com/jeroen/jsonlite)
- [**W3C** docs on JSON-LD, if you deal with structured web data (Schema.org)](https://json-ld.org/)
- [**Scraping guidance** by NICAR or other journalism orgs] for legal/ethical considerations.

This completes your introduction to **APIs, web scraping, and JSON** in R. Happy data hunting!

# End of Part 2, Lesson 3
