# Lab 12

_[General notebook information](https://computing-in-context.afeld.me/notebooks.html)_

You will be given a bunch of instructions without (fully) understanding what they do yet. That's ok.

## Setup

[Install](https://computing-in-context.afeld.me/notebooks.html#installing-packages) [`lxml`](https://anaconda.org/conda-forge/lxml) and [`requests`](https://anaconda.org/conda-forge/requests) packages.

## Scraping

Common tools:

- [Beautiful Soup package](https://realpython.com/beautiful-soup-web-scraper-python/)
- [pandas' `read_html()`](https://pandas.pydata.org/docs/user_guide/io.html#html)

Beautiful Soup package：用于解析和提取 HTML 和 XML 数据。  
pandas的read_html() 方法：用于直接从网页中读取 HTML 表格，并将其转化为 pandas 的 DataFrame（数据表格）。  


Pull [Wikipedia's list of countries by area](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area#Countries_and_dependencies_by_area) into a DataFrame using `read_html()`.

In [2]:
import pandas as pd  # 数据操作
import lxml          # HTML/XML 解析器
import html5lib      # 另一个解析 HTML 的工具
import requests  

目标网页： 它从 Wikipedia 页面（按面积列出国家和地区的列表）提取表格。  
筛选表格：match="Country / dependency" 表示只提取表头或表格中包含 "Country / dependency" 的表格。  
如果网页中有多个表格，match 帮助缩小范围，只提取我们需要的表格。  
所有匹配条件的表格会存储在 tables 列表中。  
通过 tables[0] 提取第一个匹配的表格。

In [3]:
tables = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area",
    match="Country / dependency",
)
countries = tables[0]
countries

Unnamed: 0.1,Unnamed: 0,Country / dependency,Total in km2 (mi2),Land in km2 (mi2),Water in km2 (mi2),% water,Unnamed: 6
0,–,Earth,"510,072,000 (196,940,000)","148,940,000 (57,506,000)","361,132,000 (139,434,000)",70.8,
1,1,Russia,"17,098,246 (6,601,667)","16,376,870 (6,323,142)","721,380 (278,530)",4.2,[b]
2,–,Antarctica,"14,200,000 (5,480,000)","14,200,000 (5,480,000)",0,0.0,[c]
3,2,Canada,"9,984,670 (3,855,100)","9,093,507 (3,511,021)","891,163 (344,080)",8.9,[d]
4,3/4 [e],China,"9,596,960 (3,705,410)","9,326,410 (3,600,950)","270,550 (104,460)",2.8,[f]
...,...,...,...,...,...,...,...
259,–,Ashmore and Cartier Islands (Australia),5.0 (1.9),5.0 (1.9),0,0.0,[q]
260,–,Coral Sea Islands (Australia),3.0 (1.2),3.0 (1.2),0,0.0,[db]
261,–,Spratly Islands (disputed),2.0 (0.77),2.0 (0.77),0,0.0,[54]
262,194,Monaco,2.0 (0.77),2.0 (0.77),0,0.0,[dc]


## FEC data

We'll make an API call in the browser.

1. Visit https://www.fec.gov/data/candidates/
1. [Open Developer Tools](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools#how_to_open_the_devtools_in_your_browser).
1. Reload the page.
1. In the Network tab's request list:
   1. Filter to Fetch/XHR/AJAX (terminology will differ by browser)
   1. Right-click the API call row.
1. Click `Open in New Tab`. You will see an error.
1. In the URL bar, replace the `api_key` value with `DEMO_KEY`. The URL should therefore contain `api_key=DEMO_KEY`.

You should see a big wall of JSON data.


这个就是下面的代码会实现的事情

### Querying

Retrieve candidates who have raised funds. [API documentation.](https://api.open.fec.gov/developers/)

In [4]:
import requests

candidates_with_funds = {
    "api_key": "DEMO_KEY",
    "has_raised_funds": "true",
}
response = requests.get("https://api.open.fec.gov/v1/candidates/", params=candidates_with_funds)
data = response.json()
data

{'api_version': '1.0',
 'pagination': {'count': 26220,
  'is_count_exact': True,
  'page': 1,
  'pages': 1311,
  'per_page': 20},
 'results': [{'active_through': 2022,
   'candidate_id': 'H2CO07170',
   'candidate_inactive': False,
   'candidate_status': 'P',
   'cycles': [2022, 2024],
   'district': '07',
   'district_number': 7,
   'election_districts': ['07'],
   'election_years': [2022],
   'federal_funds_flag': False,
   'first_file_date': '2021-12-27',
   'has_raised_funds': True,
   'inactive_election_years': None,
   'incumbent_challenge': 'O',
   'incumbent_challenge_full': 'Open seat',
   'last_f2_date': '2022-08-10',
   'last_file_date': '2022-08-10',
   'load_date': '2023-03-09T10:16:03',
   'name': 'AADLAND, ERIK',
   'office': 'H',
   'office_full': 'House',
   'party': 'REP',
   'party_full': 'REPUBLICAN PARTY',
   'state': 'CO'},
  {'active_through': 2022,
   'candidate_id': 'H2UT03280',
   'candidate_inactive': False,
   'candidate_status': 'C',
   'cycles': [2022],
  

api_key:
用于身份验证。这里的 DEMO_KEY 是一个示例密钥，实际使用时可能需要替换为你自己的密钥。  
has_raised_funds:
查询条件，告诉 API 返回那些“已经筹集到资金的候选人”。  
"https://api.open.fec.gov/v1/candidates/",  # API 的基础 URL  
params=candidates_with_funds                # 查询参数

Turn those results into a DataFrame.

In [5]:
results = data.get("results", [])
df = pd.json_normalize(results)
print(df.head())

   active_through candidate_id  candidate_inactive candidate_status  \
0            2022    H2CO07170               False                P   
1            2022    H2UT03280               False                C   
2            2018    S2UT00229               False                P   
3            2020    H0TX22260               False                C   
4            1978    H6PA16106               False                P   

                           cycles district  district_number  \
0                    [2022, 2024]       07                7   
1                          [2022]       03                3   
2  [2012, 2014, 2016, 2018, 2020]       00                0   
3                          [2020]       22               22   
4              [1976, 1978, 1980]       16               16   

  election_districts election_years  federal_funds_flag  ...  \
0               [07]         [2022]               False  ...   
1               [03]         [2022]               False  ...   
2 

results = data.get("results", []) 是一个 Python 中字典的常用操作，用于从字典中获取某个键的值，同时为其提供一个默认值。如果该键不存在，则返回默认值。  
data：一个字典对象。  

get("results", [])：  
get() 方法试图从字典 data 中获取键 "results" 的值。  
如果字典中不存在 "results" 键，则返回默认值 []（一个空列表）。

## Pagination

Get _all_ [NYC film permits](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p/about_data) through [the API](https://dev.socrata.com/foundry/data.cityofnewyork.us/tg4x-b46p). [Documentation on paging.](https://dev.socrata.com/docs/paging)

**Hints**

You'll probably want to create DataFrames for each page, then "concatenate" them. Here's a structure you can start with:

pagination  分页

In [None]:
# in a loop
#     get the first/next page of data
#     combine with the data that's already been retrieved
#     if there are fewer than the default number of records returned, stop the loop

## GitHub

If you miss a step, don't worry - some are more important than others, and it's possible to do them out of order. Ask for help if you need.

1. [Sign up.](https://github.com/signup)
    - If you have an account already, [sign in](https://github.com/login).
    - A [Free plan](https://github.com/pricing) is sufficient.
1. [Open](https://code.visualstudio.com/docs/editor/workspaces#_how-do-i-open-a-vs-code-workspace) the folder/repository from [Lecture 23](https://computing-in-context.afeld.me/lecture_23.html) in VSCode.
1. Click [`Publish Branch`](https://code.visualstudio.com/docs/sourcecontrol/intro-to-git#_using-branches).
1. [Allow signing in with GitHub](https://code.visualstudio.com/docs/sourcecontrol/github#_authenticating-with-an-existing-repository), if prompted.
1. Click `Publish to GitHub public repository`.
    1. [Public vs. private](https://docs.github.com/en/repositories/creating-and-managing-repositories/about-repositories#about-repository-visibility)
1. [Have VSCode periodically fetch](https://code.visualstudio.com/docs/sourcecontrol/overview#_remotes), if asked.
1. Visit the repository on GitHub.
    1. Click into the files.
1. Make a change in the repository in VSCode (locally).
1. Commit the change.
1. [Push (a.k.a. "sync")](https://code.visualstudio.com/docs/sourcecontrol/overview#_remotes) the change to GitHub.
1. Open the repository on GitHub, which should look something like this:

    ![repository file list](img/repo_initial.png)

## [JupyterBook](https://jupyterbook.org/)

- [Static website builder](https://jupyterbook.org/en/stable/publish/web.html)
- Used to build the [course site](https://computing-in-context.afeld.me/)
- Setting up for [Project 3](https://computing-in-context.afeld.me/project_3.html), which we'll [kick off in Lecture 24](https://computing-in-context.afeld.me/lecture_24.html#project-3)
- From here on, we recommend following [these instructions on the web](https://computing-in-context.afeld.me/lab_12.html#jupyterbook) rather than through a downloaded notebook, which:
    - Makes copying-and-pasting easier
    - Ensures you're seeing the latest version of the instructions

### Install

[Install `jupyter-book` via Anaconda.](https://computing-in-context.afeld.me/notebooks.html#installing-packages) Make sure you've done the `conda-forge` step.

The install might take a while; you can continue up until [building the site](#build-the-site) in the meantime.

### [Config](https://jupyterbook.org/en/stable/customize/config.html)

Using VSCode (you _could_ use JupyterLab), create a minimal `_config.yml` file containing the following:

```yaml
title: Computing in Context - NAME
author: NAME

only_build_toc_files: true

execute:
  execute_notebooks: "off"

sphinx:
  config:
    # https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#implicit-targets
    myst_heading_anchors: 4
    # https://jupyterbook.org/en/stable/interactive/interactive.html#plotly
    html_js_files:
      - https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js
    suppress_warnings: ["mystnb.unknown_mime_type"]
```

Replace the `NAME` with your name. [More about YAML](https://www.cloudbees.com/blog/yaml-tutorial-everything-you-need-get-started), which rhymes with "mammal".

### [Table of contents](https://jupyterbook.org/customize/toc.html)

1. Move/copy the Lab 12 notebook to this folder, if it's not there already.
1. Create a `_toc.yml` containing the following:

    ```yaml
    format: jb-book
    root: lab_12
    chapters:
      - file: project_3
    ```

### [Build the site](https://jupyterbook.org/en/stable/start/build.html#build-your-books-html)

[Open a terminal](https://code.visualstudio.com/docs/terminal/getting-started) and run:

```sh
jupyter-book build --all .
```

This converted your notebooks to HTML.

#### Troubleshooting

If you get an error like `jupyter-book: command not found`:

1. Double-check you've done [the install](#install).
1. Windows: Confirm you're using Git BASH, not [Command Prompt](https://www.lifewire.com/command-prompt-2625840) or [Powershell](https://learn.microsoft.com/en-us/powershell/scripting/overview).
1. Run `conda activate base` to [activate the environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment).

### View the site (locally)

1. It will output "Your book's HTML pages are here … paste this line directly into your browser bar".
1. Copy that [`file://` URL](https://en.wikipedia.org/wiki/File_URI_scheme) into your web browser.
1. You should see your notebook as a JupterBook site.

    ![JupyterBook](img/book_local.png)

## Commit changes

1. [View the diff](https://code.visualstudio.com/docs/sourcecontrol/overview#_viewing-diffs)
1. [Ignore](https://docs.github.com/en/get-started/getting-started-with-git/ignoring-files#configuring-ignored-files-for-a-single-repository) generated files: Create a `.gitignore` file containing the following:

    ```
    .ipynb_checkpoints/
    _build/
    ```

1. View the diff again
1. Commit
1. Push
1. The GitHub repository should then look like this:

    ![repository with JupyterBook files](img/repo_book.png)

---

Submit via Gradescope.