# Week 05 â€” Pandas for Descriptive Stats (Research-friendly)

**Time budget:** ~2 hours  
**Goal:** Load scraped data into pandas and compute descriptive statistics and simple plots.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt

## Pandas basics (researcher-friendly)
Weâ€™ll:
- create a DataFrame from rows
- compute descriptive stats
- plot simple charts

Keep it simple and interpretable.


### ðŸ§  Concept: The DataFrame

Think of a **DataFrame** (`df`) as a **Programmable Excel Sheet**.
- It has **Rows** (Index) and **Columns** (Series).
- Unlike Excel, you can't "click and drag". You write instructions.

**Why use it?**
- It handles millions of rows instantly.
- It allows repeatable analysis (run the cell again -> get same charts).

In [None]:
rows = [
    {"source":"mozilla", "mentions_choices": True, "mentions_retention": True, "num_headings": 18},
    {"source":"nist", "mentions_choices": True, "mentions_retention": False, "num_headings": 9},
    {"source":"enisa", "mentions_choices": False, "mentions_retention": False, "num_headings": 12},
]
df = pd.DataFrame(rows)
df

### ðŸ§  Concept: `describe()` = Instant Summary

This one command does the work of 20 interactions in Excel.
- Counts
- Averages (mean)
- Min/Max
- Unique values (for text)

In [None]:
df.describe(include="all")

In [None]:
df["mentions_choices"].value_counts()

In [None]:
df["num_headings"].plot(kind="hist", title="Distribution of number of headings")
plt.show()