from bs4 import BeautifulSoup# Cheat Sheet: APIs and Data Collection

In [1]:
from bs4 import BeautifulSoup
import requests
import warnings

## 1. `BeautifulSoup()`

<p>Parse the HTML content of a web page using BeautifulSoup. The parser type can vary based on the project.</p>

**Syntax**:
```python
soup = BeautifulSoup(html, (html.parser))
```

In [2]:
example1 = "<a href=\"https://example.com\" class=\"external-link\" target=\"_blank\">Visit Example</a>"
bs = BeautifulSoup(example1, "html.parser")

## 2. `find()`

<p>Find the first HTML element that matches the specified tag and attributes.</p>

**Syntax**:
```python
element = soup.find(tag, attrs)
```

In [3]:
element1 = bs.find("a")

In [4]:
element1

<a class="external-link" href="https://example.com" target="_blank">Visit Example</a>

## 3. Accessing element attribute

<p>Access the value of a specific attribute of an HTML element.</p>

**Syntax**:
```python
attribute = element[(attribute)]
```

In [5]:
href = element1["href"]

In [6]:
href

'https://example.com'

## 4. `find_all()`

<p>Find all HTML elements that match the specified tag and attributes.</p>

**Syntax**:
```python
elements = soup.find_all(tag, attrs)
```

In [7]:
response = requests.get("https://www.google.com")
bs = BeautifulSoup(response.text, "html.parser")
element2 = bs.find_all("div")

In [8]:
element2

[<div id="mngb"><div id="gbar"><nobr><b class="gb1">搜尋</b> <a class="gb1" href="https://www.google.com/imghp?hl=zh-TW&amp;tab=wi">圖片</a> <a class="gb1" href="https://maps.google.com.hk/maps?hl=zh-TW&amp;tab=wl">地圖</a> <a class="gb1" href="https://play.google.com/?hl=zh-TW&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/?tab=w1">YouTube</a> <a class="gb1" href="https://news.google.com/?tab=wn">新聞</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">雲端硬碟</a> <a class="gb1" href="https://www.google.com.hk/intl/zh-TW/about/products?tab=wh" style="text-decoration:none"><u>更多</u> »</a></nobr></div><div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com.hk/history/optout?hl=zh-TW">網頁記錄</a> | <a class="gb4" href="/preferences?hl=zh-TW">設定</a> | <a class="gb4" href="https://accounts.google.com

## 5. [Deprecated since v3.0.0] `findChildren()`

<p>Find all child elements of an HTML element.</p>

**Syntax**:
```python
children = element.findChildren()
```

In [9]:
element3 = bs.find("div")
with warnings.catch_warnings(): # ignore warnings
    warnings.simplefilter("ignore", category=DeprecationWarning)
    element3_children = element3.findChildren()

In [10]:
element3_children

[<div id="gbar"><nobr><b class="gb1">搜尋</b> <a class="gb1" href="https://www.google.com/imghp?hl=zh-TW&amp;tab=wi">圖片</a> <a class="gb1" href="https://maps.google.com.hk/maps?hl=zh-TW&amp;tab=wl">地圖</a> <a class="gb1" href="https://play.google.com/?hl=zh-TW&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/?tab=w1">YouTube</a> <a class="gb1" href="https://news.google.com/?tab=wn">新聞</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">雲端硬碟</a> <a class="gb1" href="https://www.google.com.hk/intl/zh-TW/about/products?tab=wh" style="text-decoration:none"><u>更多</u> »</a></nobr></div>,
 <nobr><b class="gb1">搜尋</b> <a class="gb1" href="https://www.google.com/imghp?hl=zh-TW&amp;tab=wi">圖片</a> <a class="gb1" href="https://maps.google.com.hk/maps?hl=zh-TW&amp;tab=wl">地圖</a> <a class="gb1" href="https://play.google.com/?hl=zh-TW&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/?tab=w1">You

## 6. `find_next_sibling()`

<p>Find the next sibling element in the DOM.</p>

**Syntax**:
```python
sibling = element.find_next_sibling()
```

In [11]:
element3.find_next_sibling()

<center><br clear="all" id="lgpd"/><div id="XjhHGf"><img alt="Google" height="92" id="hplogo" src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png" style="padding:28px 0 14px" width="272"/><br/><br/></div><form action="/search" name="f"><table cellpadding="0" cellspacing="0"><tr valign="top"><td width="25%"> </td><td align="center" nowrap=""><input name="ie" type="hidden" value="ISO-8859-1"/><input name="hl" type="hidden" value="zh-HK"/><input name="source" type="hidden" value="hp"/><input name="biw" type="hidden"/><input name="bih" type="hidden"/><div class="ds" style="height:32px;margin:4px 0"><input autocomplete="off" class="lst" maxlength="2048" name="q" size="57" style="margin:0;padding:5px 8px 0 6px;vertical-align:top;color:#000" title="Google 搜尋" value=""/></div><br style="line-height:0"/><span class="ds"><span class="lsbb"><input class="lsb" name="btnG" type="submit" value="Google 搜尋"/></span></span><span class="ds"><span class="lsbb"><input class=

## 7. parent

<p>Access the parent element in the Document Object Model (DOM).</p>

**Syntax**:
```python
parent = element.parent
```

In [12]:
parent = element3.parent

In [13]:
parent

<body bgcolor="#fff"><script nonce="TXkFwUOHk0ET3dQZGxeH2Q">(function(){var src='/images/nav_logo229.png';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
}
})();</script><div id="mngb"><div id="gbar"><nobr><b class="gb1">搜尋</b> <a class="gb1" href="https://www.google.com/imghp?hl=zh-TW&amp;tab=wi">圖片</a> <a class="gb1" href="https://maps.google.com.hk/maps?hl=zh-TW&amp;tab=wl">地圖</a> <a class="gb1" href="https://play.google.com/?hl=zh-TW&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/?tab=w1">YouTube</a> <a class="gb1" href="https://news.google.com/?tab=wn">新聞</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">雲端硬碟</a> <a class="gb1" href="https://www.google.com.hk/intl/zh-TW/about/products?tab=wh" style="text-decoration:none"><u>更多</u> »</a>

## 8. `select()`

<p>Select HTML elements from the parsed HTML using a CSS selector.</p>

**Syntax**:
```python
element = soup.select(selector)
```

In [14]:
bs.select("br")

[<br clear="all" id="lgpd"/>, <br/>, <br/>, <br style="line-height:0"/>, <br/>]

## 9. tags for `find()` and `find_all()`

<p>Specify any valid HTML tag as the tag parameter to search for elements of that type. Here are some common HTML tags that you can use with the tag parameter.</p>

<p>Tag Examples:</p>
<ul>
    <li><b>a:</b> Find anchor ("a") tags.</li>
    <li><b>p:</b> Find paragraph ("p") tags.</li>
    <li><b>h1-h6:</b> Find heading tags from level 1 to 6 ("h1", "h2", "h3", "h4", "h5", "h6").</li>
    <li><b>table:</b> Find table ("table") tags.</li>
    <li><b>tr:</b> Find table row ("tr") tags.</li>
    <li><b>td:</b> Find table cell ("td") tags.</li>
    <li><b>th:</b> Find table header cell ("th") tags.</li>
    <li><b>img:</b> Find image ("img") tags.</li>
    <li><b>form:</b> Find form ("form") tags.</li>
    <li><b>button:</b> Find button ("button") tags.</li>
</ul>

## 10. text

<p>Retrieve the text content of an HTML element.</p>

**Syntax:**
```python
text = element.text
```

In [15]:
element3.text

'搜尋 圖片 地圖 Play YouTube 新聞 Gmail 雲端硬碟 更多 »網頁記錄 | 設定 | 登入'

## 11. `delete()`

<p>Send a DELETE request to remove data or a resource from the server. DELETE requests delete a specified resource on the server.</p>

**Syntax**:
```python
response = requests.delete(url)
```

In [16]:
response = requests.delete("https://httpbin.org/delete", headers={"X-Custom-Header": "test"})
response.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "0", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.32.4", \n    "X-Amzn-Trace-Id": "Root=1-686a0b67-11fd7d37793136e06f2ebcc2", \n    "X-Custom-Header": "test"\n  }, \n  "json": null, \n  "origin": "103.151.172.32", \n  "url": "https://httpbin.org/delete"\n}\n'

## 12.`get()`

<p>Perform a GET request to retrieve data from a specified URL. GET requests are typically used for reading data from an API. The response variable will contain the server's response, which you can process further.</p>

**Syntax**:
```python
response = requests.get()
```

In [17]:
response = requests.get("https://www.bing.com")
response.text



## 13. `post()`

<p>Send a POST request to a specified URL with data. Create or update POST requests using resources on the server. The data parameter contains the data to send to the server, often in JSON format.</p>

**Syntax**:
```python
response = requests.post(url, data)
```

In [18]:
response = requests.post("https://httpbin.org/post", {"Name": "John", "Age": 25})
response.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "Age": "25", \n    "Name": "John"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "16", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.32.4", \n    "X-Amzn-Trace-Id": "Root=1-686a0b68-1f98488c3d782d103563b282"\n  }, \n  "json": null, \n  "origin": "103.151.172.32", \n  "url": "https://httpbin.org/post"\n}\n'

## 14. `put()`

<p>Send a PUT request to update data on the server. PUT requests are used to update an existing resource on the server with the data provided in the data parameter, typically in JSON format.</p>

**Syntax**:
```python
response = requests.put(url, data)
```

In [19]:
response = requests.put("https://httpbin.org/put", {"Name": "Jennifer", "Age": 20})
response.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "Age": "20", \n    "Name": "Jennifer"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "20", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.32.4", \n    "X-Amzn-Trace-Id": "Root=1-686a0b69-5a1170e526cbea0a030cc1b5"\n  }, \n  "json": null, \n  "origin": "103.151.172.32", \n  "url": "https://httpbin.org/put"\n}\n'

## 15. Headers

<p>Include custom headers in the request. Headers can provide additional information to the server, such as authentication tokens or content types.</p>

**Syntax**:
```python
headers = {(HeaderName): (Value)}
```

In [20]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get("https://httpbin.org/get", headers=headers)
response.text

'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", \n    "X-Amzn-Trace-Id": "Root=1-686a0b6a-29d4ba345a95b75867828a1f"\n  }, \n  "origin": "103.151.172.32", \n  "url": "https://httpbin.org/get"\n}\n'

## 16. Query parameters

<p>Pass query parameters in the URL to filter or customize the request. Query parameters specify conditions or limits for the requested data.</p>

**Syntax**:
```python
params = {(param_name): (value)}
```

In [21]:
params = {"page": 1, "per_page": 10}
response = requests.get("https://httpbin.org/get", params=params)
response.text

'{\n  "args": {\n    "page": "1", \n    "per_page": "10"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.32.4", \n    "X-Amzn-Trace-Id": "Root=1-686a0b6c-742516745b9f8a6c327a740a"\n  }, \n  "origin": "103.151.172.32", \n  "url": "https://httpbin.org/get?page=1&per_page=10"\n}\n'

## 17. `json()`

<p>Parse JSON data from the response. This extracts and works with the data returned by the API. The response.json() method converts the JSON response into a Python data structure (usually a dictionary or list).</p>

**Syntax**:
```python
data = response.json()
```

In [22]:
response.json()

{'args': {'page': '1', 'per_page': '10'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.4',
  'X-Amzn-Trace-Id': 'Root=1-686a0b6c-742516745b9f8a6c327a740a'},
 'origin': '103.151.172.32',
 'url': 'https://httpbin.org/get?page=1&per_page=10'}

## 18. status_code

<p>Check the HTTP status code of the response. The HTTP status code indicates the result of the request (success, error, redirection). Use the HTTP status codeIt can be used for error handling and decision-making in your code.</p>

**Syntax**:
```python
response.status_code
```

In [23]:
response.status_code

200