# Week 1 ‚Äî Python for Research Data Collection (Human Factors, Privacy & Security)

**Time budget:** ~2 hours  
**Goal:** Get comfortable using Python (in a notebook) to fetch a web page, extract a few signals, and store them in simple data structures.

This week is intentionally ‚Äúsmall but real.‚Äù We‚Äôll scrape **public pages** that are relevant to privacy & security research (e.g., *privacy policy* pages).

---

## What you‚Äôll learn
### Python fundamentals (applied)
- Variables and basic data types (`str`, `int`, `float`, `bool`, `None`)
- Lists and dictionaries (your first ‚Äúdataset‚Äù)
- `for` loops and `if` statements
- Writing small functions

### Web fundamentals (lightweight intro)
- What HTTP is (request ‚Üí response)
- HTML as a tree (tags)
- How a ‚Äúscraper‚Äù works at the simplest level

### Output
By the end you will produce a **list of dictionaries** like:

```python
[
  {"url": "...", "title": "...", "num_links": 42, "mentions_cookies": True},
  ...
]
```


## Setup

We‚Äôll use:
- `requests` to download HTML
- `bs4` (BeautifulSoup) to parse HTML

If you run this notebook locally and don‚Äôt have these installed, run:

```bash
pip install requests beautifulsoup4
```

In many notebook environments, they may already be available.


In [None]:
import requests
from bs4 import BeautifulSoup

## 1) Variables + data types (in the context of scraping)

When we scrape, we deal with:
- URLs (strings)
- status codes (integers)
- HTML text (strings)
- flags like ‚Äúmentions cookies‚Äù (booleans)
- ‚Äúmaybe missing‚Äù values (`None`)

Let‚Äôs create a few on purpose.


In [None]:
url = "https://www.wikipedia.org/"
status_code_example = 200
mentions_cookies = False
missing_value = None

print(type(url), url)
print(type(status_code_example), status_code_example)
print(type(mentions_cookies), mentions_cookies)
print(type(missing_value), missing_value)

### üß† Concept: The Client-Server Model (HTTP)

Imagine ordering at a restaurant:
1.  **You (Client)**: "I'd like a burger, please." (This is the **Request**)
2.  **Waiter (Server)**: Checks kitchen, comes back with a burger on a plate. (This is the **Response**)

In the web world:
-   **You** are `requests` (Python).
-   **The Restaurant** is Wikipedia.
-   The **Burger** is the HTML code.
-   The **Plate** is the Response object (holding the HTML, the status code, etc).

## 2) Your first HTTP request

`requests.get(url)` sends an HTTP GET request.

Useful fields on the response:
- `response.status_code` (e.g., 200 = OK)
- `response.text` (HTML as a string)
- `response.headers` (metadata)

We‚Äôll fetch one page and inspect it.


In [None]:
url = "https://www.wikipedia.org/"
response = requests.get(url, timeout=20)

print("Status:", response.status_code)
print("Content-Type:", response.headers.get("Content-Type"))
print("First 300 characters of HTML:")
print(response.text[:300])

### üß† Concept: What is HTML?

HTML is just **text with tags**.

Think of it like a Russian Matryoshka doll or a family tree:
-   `<html>` is the grandmother.
-   `<body>` is the mother.
-   `<div>` are the children.
-   `<p>` (paragraphs) and `<a>` (links) are the grandchildren.

**Tags** tell the browser (and us) what the content *is*.
-   `<p>` = Paragraph
-   `<a>` = Anchor (Link)
-   `<h1>` = Header 1 (Big Title)

**Attributes** give extra info:
-   `<a href="https://google.com">` -> `href` tells us *where* the link goes.

## 3) HTML parsing: turning text into a tree

HTML is a nested structure. BeautifulSoup helps you query it.

Key ideas:
- Tags like `<title>`, `<a>`, `<p>`
- Attributes like `href`, `class`, `id`
- `.find(...)` gets one element
- `.find_all(...)` gets many elements


In [None]:
soup = BeautifulSoup(response.text, "html.parser")

title_tag = soup.find("title")
print("Title tag:", title_tag)
print("Title text:", title_tag.get_text(strip=True))

## 4) Lists: collecting repeated items (links)

A typical scraping pattern:
1. Find many elements (`find_all`)
2. Loop through them
3. Extract what you want
4. Store results in a list


In [None]:
links = soup.find_all("a")
print("Number of <a> tags:", len(links))

# Take a quick look at the first 5 links
for a in links[:5]:
    print("-", a.get_text(strip=True)[:40], "=>", a.get("href"))

## 5) Dictionaries: making each row of data structured

For research work, a dictionary is a great ‚Äúrow‚Äù format:
- consistent keys
- readable
- easy to convert later to CSV/JSON

Let‚Äôs create a small record for a page.


In [None]:
record = {
    "url": url,
    "title": title_tag.get_text(strip=True),
    "num_links": len(links),
}
record

## 6) `if` statements: extracting a simple privacy-related signal

Human factors privacy & security research often analyzes:
- language like ‚Äúcookies‚Äù, ‚Äúthird-party‚Äù, ‚Äúconsent‚Äù, ‚Äúdata retention‚Äù
- user choices: opt-out, control, settings
- transparency cues: ‚Äúwe collect‚Ä¶‚Äù, ‚Äúwe share‚Ä¶‚Äù

We‚Äôll implement a simple detector: does a page mention **cookies**?


In [None]:
html_lower = response.text.lower()
mentions_cookies = "cookie" in html_lower  # crude but useful start
mentions_cookies

## 7) Functions: turning steps into a reusable tool

A function packages logic so you can apply it to *many* pages.

We‚Äôll build `analyze_page(url)` that returns one dictionary.


In [None]:
def analyze_page(url: str, timeout: int = 20) -> dict:
    """Fetch a URL and extract a few simple signals.

    Returns a dict suitable for putting into a list (dataset).
    """
    r = requests.get(url, timeout=timeout)
    soup = BeautifulSoup(r.text, "html.parser")
    
    title = None
    title_tag = soup.find("title")
    if title_tag:
        title = title_tag.get_text(strip=True)

    links = soup.find_all("a")
    text_lower = soup.get_text(" ", strip=True).lower()

    return {
        "url": url,
        "status": r.status_code,
        "title": title,
        "num_links": len(links),
        "mentions_cookies": ("cookie" in text_lower),
        "mentions_privacy": ("privacy" in text_lower),
        "mentions_security": ("security" in text_lower),
    }

In [None]:
analyze_page("https://www.wikipedia.org/")

## 8) Mini ‚Äúdataset‚Äù: list of dictionaries

Now we can run the same function across multiple pages.

For privacy/security, here are example targets:
- A few major sites‚Äô privacy policies (public pages)
- A standards or regulatory page (public pages)
- A research lab page

**Important research ethics note:** This curriculum is about scraping *public pages responsibly*. We will:
- scrape slowly (few pages)
- respect Terms of Service / robots.txt when scaling later
- avoid personal data collection

For Week 1, we‚Äôll do a **small, manual list** of URLs.


In [None]:
urls = [
    "https://www.wikipedia.org/",
    "https://www.mozilla.org/en-US/privacy/",
    "https://support.google.com/accounts/answer/112802?hl=en",  # example help page
]

dataset = []
for u in urls:
    try:
        dataset.append(analyze_page(u))
    except Exception as e:
        dataset.append({"url": u, "error": str(e)})

dataset

## 9) Quick descriptive summary (without pandas)

Even without pandas, you can compute simple stats.

Example:
- How many pages mention ‚Äúprivacy‚Äù?
- Average number of links?


In [None]:
# Count mentions
privacy_count = sum(1 for row in dataset if row.get("mentions_privacy") is True)
cookies_count = sum(1 for row in dataset if row.get("mentions_cookies") is True)

# Average links (skip rows that errored)
link_counts = [row["num_links"] for row in dataset if "num_links" in row]
avg_links = sum(link_counts) / len(link_counts) if link_counts else None

print("Pages mentioning privacy:", privacy_count, "/", len(dataset))
print("Pages mentioning cookies:", cookies_count, "/", len(dataset))
print("Average num_links:", avg_links)

## Wrap-up

You just built:
- a **basic scraper function**
- a **mini dataset**
- a **mini descriptive analysis**

Next week we‚Äôll:
- extract more structured signals (headings, paragraphs)
- store results cleanly
- start thinking about ‚Äúdata schemas‚Äù for research scraping
