# Getting Data from the Web

This notebook introduces ways to collect data from the web using Python. You'll learn how to fetch data from APIs, explore JSON, and (optionally) scrape simple web pages. These skills can help you automate the process of gathering open data for your research.

## 1. Introduction

Automating data collection saves time and reduces errors compared to downloading files by hand. Common research uses include:
- Downloading weather or environmental data
- Collecting census or population statistics
- Gathering real estate or planning datasets

**APIs** (Application Programming Interfaces) are official ways to get data directly from a website or service, often in a structured format like JSON. **Web scraping** means extracting data from the visible content of web pages, which is less reliable and may not be allowed by the website's terms.

> **Tip:** Always check a website's terms of use and robots.txt before scraping. Use APIs when available, respect rate limits, and avoid overloading servers.

## 2. Accessing an API with `requests`

Let's use the `requests` library to fetch data from a public API. We'll use [JSONPlaceholder](https://jsonplaceholder.typicode.com/) as a simple example.

In [None]:
import requests

# Fetch a list of sample users
url = "https://jsonplaceholder.typicode.com/users"
response = requests.get(url)

# Always check the status code first
print("Status code:", response.status_code)

# Print a preview of the raw response text
print(response.text[:200])  # Show only the first 200 characters

## 3. Exploring JSON Responses

Most APIs return data in **JSON** format (like a nested dictionary/list). Let's convert the response and explore it.

In [None]:
# Convert the response to JSON (Python objects)
data = response.json()

print(type(data))  # Should be a list
print(len(data), "records")

# Look at the first record
print(data[0])

# Print the available keys in the first record
print(list(data[0].keys()))

In [None]:
# Extract and print a few fields from each user
for user in data[:3]:  # Just show first 3 users
    print("Name:", user['name'], "| Email:", user['email'], "| City:", user['address']['city'])

## 4. From JSON to DataFrame

You can load JSON data into a pandas DataFrame for analysis. If the data is nested, `pd.json_normalize()` helps flatten it.

In [None]:
import pandas as pd

# Flatten nested fields like address.city
df = pd.json_normalize(data, sep='_')
df.head()

## 5. Optional: Scraping a Simple Web Page

If no API is available, you might need to scrape data from a web page. Let's use `requests` and `BeautifulSoup` to extract headlines from a simple website.

> **Ethics & Legality:**
> - Always check the website's robots.txt and terms of service.
> - Don't overload servers—add delays if scraping many pages.
> - Attribute your sources and respect data ownership.
> - Some sites may block or ban scrapers.

Below is a basic example (scraping Wikipedia headlines):

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Architecture"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

# Find all the section headings (h2 tags)
headings = soup.find_all('h2')
for h in headings[:5]:  # Show first 5 headings
    print(h.text.strip())

## 6. Saving and Reusing Fetched Data

It's a good idea to save the raw data you fetch, so you can reuse it or check your work later.

In [None]:
import json

# Save JSON data to a file
with open('users_data.json', 'w') as f:
    json.dump(data, f, indent=2)

# Save DataFrame to CSV
df.to_csv('users_data.csv', index=False)
print("Data saved to users_data.json and users_data.csv")

## 7. Practice Challenge (Optional)

**Task:**
- Use an API to get data about a place (e.g. city weather, population, or similar).
- Extract 5 fields from the response.
- Convert the result into a CSV file.

## 8. Summary & Links

- Use APIs when possible—they're more reliable and ethical than scraping.
- Scraping should be a last resort, and always done with care and respect for the source.
- Store your raw data before transforming it.

### Useful Links
- [List of public APIs](https://github.com/public-apis/public-apis)
- [pandas documentation](https://pandas.pydata.org/docs/)
- [requests documentation](https://requests.readthedocs.io/en/latest/)
- [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Next up: How to structure and share your own Python projects!