# üìò **How `wget` and `requests` Obtain HTML: Internal Mechanics and Network Processes**

Web scraping begins with a fundamental operation: **retrieving the HTML of a page**. Tools such as **`wget`**, **`curl`**, or Python‚Äôs **`requests`** library perform this action by communicating with a web server through the **HTTP/S protocol**. Although both ultimately achieve the same goal, their internal behavior and flexibility differ significantly.

This section explains:

* How `wget` retrieves pages
* How Python‚Äôs `requests` retrieves pages
* When and why each is used in scraping

---

## üü¶ **1. How `wget` Works Internally**

`wget` is a **command-line HTTP client** designed for:

* Recursively downloading sites
* Mirroring servers
* Handling files and binary data

### **Key Characteristics**

| Feature            | Details                        |
| ------------------ | ------------------------------ |
| Environment        | Linux CLI tool                 |
| Header control     | Limited compared to `requests` |
| JavaScript support | None (static pages only)       |
| Cookies            | Basic support                  |
| Redirects          | Automatic                      |
| Speed              | Very fast for simple downloads |

---

### **Example: Download HTML with `wget`**

```bash
wget https://example.com
```

The process:

1. Parse URL
2. Resolve DNS
3. TCP/TLS handshake
4. Send GET request
5. Write received HTML to a file named `index.html`

---

### **Custom Headers (limited)**

```bash
wget --header="User-Agent: Mozilla/5.0" https://example.com
```

Headers affect scraping legality, server acceptance, and access capabilities.

---

## üü¶ **2. How Python's `requests` Works Internally (Detailed Focus)**

`requests` is built on top of:

* **urllib3**
* **httplib (http.client)**
* **OpenSSL** for HTTPS

It abstracts all complexities into a clean API but still performs the full HTTP sequence explained earlier.

---

### üß† **Internal Workflow of `requests.get()`**

When you write:

```python
import requests
r = requests.get("https://example.com")
```

Internally:

#### **1. URL is parsed**

`requests` extracts protocol, host, port, path.

#### **2. Session object uses urllib3 to:**

* Perform DNS lookup
* Open TCP connection
* Perform TLS handshake (HTTPS)

#### **3. HTTP GET request is built**

Headers added automatically:

```
User-Agent: python-requests/2.x.x
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
```

#### **4. Request is sent over TCP**

#### **5. Response is received**

* Status code (e.g., `200`)
* Headers
* Body (HTML or JSON)

#### **6. Response body is decoded**

* gzip or deflate decompressed
* charset decoded
* Text made available as `r.text`

---

### ‚úîÔ∏è Basic HTML Download with `requests`

```python
import requests

url = "https://example.com"
response = requests.get(url)

html = response.text
print(html)
```

---

### ‚úîÔ∏è Adding Headers (important for scraping)

```python
headers = {
    "User-Agent": "Mozilla/5.0",
    "Accept-Language": "en-US,en;q=0.8",
}
response = requests.get(url, headers=headers)
```

Websites often block default Python user agents.

---

### ‚úîÔ∏è Handling Cookies

```python
response = requests.get(url)
cookies = response.cookies
```

Cookies may be required for authenticated or persistent scraping.

---

### ‚úîÔ∏è Connection Pooling with Sessions

```python
s = requests.Session()
r = s.get("https://example.com")
```

Sessions keep:

* Cookies
* Headers
* Connections (keep-alive)

This reduces latency for repeated scraping.

---

## üü¶ **3. When to Use `wget` vs. `requests`**

| Situation                         | Use `wget` | Use `requests`     |
| --------------------------------- | ---------- | ------------------ |
| Simple download of HTML or files  | ‚úîÔ∏è         | ‚úîÔ∏è                 |
| Web scraping with logic           | ‚ùå          | ‚úîÔ∏è                 |
| Custom headers                    | ‚ö†Ô∏è Limited | ‚úîÔ∏è Advanced        |
| Managing cookies                  | Basic      | Full               |
| Need to interact programmatically | ‚ùå          | ‚úîÔ∏è                 |
| Handling forms, APIs              | ‚ùå          | ‚úîÔ∏è                 |
| Large-scale automation            | ‚úîÔ∏è         | ‚úîÔ∏è (with sessions) |

**Conclusion:**
For data science and scraping, **`requests` is the standard tool**, while `wget` is useful for basic, static retrieval.

---

## üìò **Advanced Guide to `requests`: Configurations, Parameters, and Techniques for Web Scraping**

Python‚Äôs `requests` is a high-level HTTP client library built to make network communication *simple*, *readable*, and *powerful*. Although calling `requests.get()` is straightforward, professional scraping requires mastering its advanced capabilities:

* Custom headers
* Cookies and sessions
* Authentication
* Query parameters
* Timeouts
* Error handling
* Redirect control
* File downloads
* Streaming responses
* Proxy usage
* SSL configuration
* Retries

Below you will find a **detailed and practical explanation** of each of these features.

---

### üîµ 1. **HTTP Methods**

Although `GET` is most common in scraping, you should know the others.

```python
requests.get(url)
requests.post(url, data={})
requests.put(url, data={})
requests.delete(url)
```

* **GET** ‚Üí retrieve information (HTML, JSON, images)
* **POST** ‚Üí submit forms, login, APIs
* **PUT / DELETE** ‚Üí less common in scraping, used for RESTful APIs

---

### üîµ 2. **Headers (Critical for Scraping)**

Headers define how your client ‚Äúbehaves‚Äù when talking to servers.

#### The most important header: `User-Agent`

Many websites block default Python agents.

```python
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
}
response = requests.get(url, headers=headers)
```

#### Other useful headers

```python
headers = {
    "User-Agent": "...",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml",
    "Referer": "https://google.com",
}
```

**Purpose of headers:**

* Avoid blocks
* Mimic real browsers
* Handle localization
* Access hidden/conditional content

---

### üîµ 3. **Query Parameters (GET Params)**

You can attach parameters using the `params` argument:

```python
params = {"search": "python", "page": 2}
response = requests.get(url, params=params)
```

`requests` will turn this into:

```
GET /?search=python&page=2 HTTP/1.1
```

Used heavily when interacting with pagination, search filters, or APIs.

---

### üîµ 4. **POST Requests and Form Submission**

Forms use `POST`, not `GET`.

```python
payload = {"username": "ariadna", "password": "1234"}
response = requests.post(url, data=payload)
```

This is equivalent to submitting an HTML form.

---

### üîµ 5. **JSON Requests (Common in APIs)**

```python
response = requests.post(url, json={"key": "value"})
data = response.json()
```

`json=` automatically sets:

```
Content-Type: application/json
```

---

### üîµ 6. **Cookies Handling**

Servers often use cookies to manage logins or sessions.

#### Get cookies returned by the server:

```python
response = requests.get(url)
print(response.cookies)
```

#### Send your own cookies:

```python
cookies = {"session_id": "ABC123"}
response = requests.get(url, cookies=cookies)
```

---

### üîµ 7. **Sessions (Persistent Connections, Cookies, and Headers)**

A `Session` object keeps:

* Cookies
* Connection reuse (Keep-Alive)
* Default headers
* Authentication

```python
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0"})

r1 = session.get(url)
r2 = session.get(url2)
```

**Why sessions matter in scraping:**

* Faster (connection pooling)
* Needed for login flows
* Maintains cookies (maintains login state)

---

### üîµ 8. **Timeouts (Critical for avoiding freezes)**

Always set a timeout to avoid infinite wait.

```python
response = requests.get(url, timeout=5)  
```

Timeout applies to:

* Connection establishment
* Server response delay

---

### üîµ 9. **Error and Status Code Handling**

#### Basic:

```python
if response.status_code == 200:
    ...
```

#### More robust:

```python
response.raise_for_status()  # raises exceptions for 4xx/5xx
```

---

### üîµ 10. **Disable or Control Redirects**

By default, redirects are followed.

```python
response = requests.get(url, allow_redirects=False)
```

This is useful when investigating security, logins, or anti-scraping behavior.

---

### üîµ 11. **Downloading Files**

```python
url = "https://example.com/file.pdf"
response = requests.get(url)

with open("file.pdf", "wb") as f:
    f.write(response.content)
```

#### Large files: use streaming

```python
response = requests.get(url, stream=True)
for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        f.write(chunk)
```

---

### üîµ 12. **Streaming Responses (Avoid Memory Overload)**

Useful for:

* Large pages
* Live data
* Big files

```python
response = requests.get(url, stream=True)
```

You read the content progressively rather than loading everything into RAM.

---

### üîµ 13. **Using Proxies**

Necessary for:

* Geo-targeting
* Rotating IPs
* Avoiding bans

```python
proxies = {
    "http": "http://123.123.123.123:8080",
    "https": "https://123.123.123.123:8080",
}
response = requests.get(url, proxies=proxies)
```

---

### üîµ 14. **Authentication**

#### Basic Auth:

```python
requests.get(url, auth=('user', 'pass'))
```

#### Token-based:

```python
headers = {"Authorization": "Bearer <TOKEN>"}
requests.get(url, headers=headers)
```

Common in APIs and private dashboards.

---

### üîµ 15. **SSL Certificate Verification**

Enabled by default (recommended).

Disable only when testing:

```python
requests.get(url, verify=False)
```

Fear: MITM attacks
Use only in controlled networks.

---

### üîµ 16. **Retry Logic (Very Important for Stability)**

`requests` does NOT retry automatically.
You must configure retries yourself:

```python
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount("https://", adapter)

response = session.get(url)
```

Retries prevent your scraper from breaking due to:

* Temporary server overload
* Network hiccups
* Proxy failures

---

### üîµ 17. **Handling Gzip/Deflate Compression**

`requests` automatically sends:

```
Accept-Encoding: gzip, deflate
```

And decompresses in:

```python
response.text
```

Meaning you always receive readable HTML.

---

### üîµ 18. **Encoding Control**

Sometimes pages send wrong encodings.

```python
response.encoding = "utf-8"
html = response.text
```

---

### üîµ 19. **Handling Redirections, History, and Chain Requests**

```python
response = requests.get(url)
print(response.history)
```

Useful for login sequences or anti-bot redirection loops.

---

### üìò Summary: Most Important Features for Scraping

| Feature        | Why It Matters                  |
| -------------- | ------------------------------- |
| Headers        | Avoid blocks; mimic browsers    |
| Sessions       | Keep cookies, reuse connections |
| Cookies        | Maintain login state            |
| Timeouts       | Avoid freezes                   |
| Retries        | Handle unstable websites        |
| Proxies        | Avoid bans, geolocation control |
| Streams        | Handle large files              |
| Params & POST  | Interact with web apps          |
| SSL, redirects | Control connection behavior     |


# üìò **Understanding HTML for Web Scraping (Static Pages)**

Static HTML refers to webpages where the content is delivered **directly by the server**, without requiring JavaScript to modify or generate it.
When a scraper downloads such a page, the HTML it receives already contains the text, tags, and data that appear in the browser.

To extract data correctly, you must understand:

* How HTML is structured
* How tags and attributes work
* How CSS classes and IDs allow identification of elements
* How nesting and DOM hierarchy affect extraction
* How lists, tables, links, and forms are represented
* How to distinguish meaningful from decorative markup

Let us examine each of these elements in detail.

---

## üîµ 1. **HTML as a Tree (Document Object Model)**

HTML is not just text; it is a **hierarchical tree** structure.

Example (simplified):

```html
<html>
  <body>
    <h1>Title</h1>
    <p class="description">This is a description.</p>
  </body>
</html>
```

Hierarchy:

* `<html>` (root)

  * `<body>`

    * `<h1>`
    * `<p>`

Why this matters for scraping:

* BeautifulSoup and lxml interpret HTML as a tree
* You navigate parent ‚Üí child ‚Üí sibling
* Precise extraction depends on understanding this structure

---

## üîµ 2. **Tags (Elements): The Fundamental Units**

Tags define what each part of the document *means*.

Some of the most relevant tags for scrapers are:

| Tag                    | Purpose             | Relevance for Scraping                                    |
| ---------------------- | ------------------- | --------------------------------------------------------- |
| `<div>`                | Generic container   | Frequently used; usually requires class or ID to identify |
| `<span>`               | Inline container    | Often used for labels, metadata                           |
| `<a>`                  | Link                | Critical for pagination, crawling                         |
| `<img>`                | Images              | Scrapers need `src` attribute                             |
| `<ul>`, `<ol>`, `<li>` | Lists               | Used for menus, product lists                             |
| `<table>`              | Structured data     | Easy to scrape into datasets                              |
| `<form>`               | Login, search       | Important for automating interactions                     |
| `<input>`              | Form fields         | Required for POST requests                                |
| `<script>`             | JavaScript code     | Can reveal API endpoints                                  |
| `<meta>`               | Encodings, metadata | Useful for page information                               |

---

## üîµ 3. **Attributes (critical for identifying elements)**

Attributes are key/value pairs inside tags.

Example:

```html
<p class="summary" id="product-description" data-id="123">
    Great product!
</p>
```

Important attributes for scraping:

| Attribute | Description            | Usage in Scraping                    |
| --------- | ---------------------- | ------------------------------------ |
| `id`      | Unique identifier      | Best for selecting specific elements |
| `class`   | Category/group label   | Most common selector in modern HTML  |
| `href`    | Link URL               | Used to crawl pages                  |
| `src`     | Image or script source | Downloads, media scraping            |
| `data-*`  | Custom metadata        | Hidden but extremely useful          |
| `name`    | Input field name       | Required for form submission         |
| `content` | Meta tag content       | Useful for metadata extraction       |

### Why attributes matter:

Websites rarely use unique structure; they use **classes**, **IDs**, and **data attributes** so scrapers can reliably locate elements.

---

## üîµ 4. **Classes and IDs: The Backbone of Scraper Selectors**

### ‚úîÔ∏è IDs should be unique

Ideal selector:

```html
<div id="main-title">Product A</div>
```

In BeautifulSoup:

```python
soup.find(id="main-title")
```

### ‚úîÔ∏è Classes identify groups of elements

Example:

```html
<div class="product">...</div>
<div class="product">...</div>
<div class="product">...</div>
```

In BeautifulSoup:

```python
soup.find_all(class_="product")
```

**Classes and IDs are the most important parts of HTML for scraping**, because they allow you to locate data precisely and reliably.

---

## üîµ 5. **Nesting and Hierarchy (Critical for Navigation)**

HTML elements are usually nested:

```html
<div class="product">
    <h2 class="name">Laptop X</h2>
    <span class="price">$999</span>
</div>
```

To extract the name and price:

1. Identify the `product` container
2. Extract children elements (`h2`, `span`)

BeautifulSoup example:

```python
product = soup.find("div", class_="product")
name = product.find("h2").text
price = product.find("span", class_="price").text
```

Understanding nesting allows:

* Filtering complex data
* Extracting elements that depend on context
* Avoiding errors when multiple sections share the same classes

---

## üîµ 6. **Tables (Perfect for Structured Data)**

Tables are a goldmine for data scientists:

```html
<table id="stats-table">
  <tr>
    <th>Name</th><th>Value</th>
  </tr>
  <tr>
    <td>Height</td><td>180</td>
  </tr>
</table>
```

Process:

1. Find table
2. Extract header rows
3. Extract data rows

---

## üîµ 7. **Lists (Common in Product Listings, Menus, Comments)**

HTML lists:

```html
<ul class="items">
  <li>Item A</li>
  <li>Item B</li>
</ul>
```

Useful for iterating through repeated structures.

---

## üîµ 8. **Links: Navigating the Website (Crawling)**

`<a>` tags define navigation:

```html
<a href="/product/123">View product</a>
```

Scraper logic:

* Extract `href`
* Convert to full URL if needed
* Follow link

Handling relative paths is essential:

`requests.get(base_url + href)`

---

## üîµ 9. **Forms: Behind Logins, Searches, Filters**

Example:

```html
<form action="/search" method="post">
    <input type="text" name="query">
    <input type="submit">
</form>
```

To simulate a form submission:

1. Extract `action`
2. Extract `method`
3. Extract all inputs with `name`

Example with `requests`:

```python
payload = {"query": "python"}
requests.post("https://example.com/search", data=payload)
```

---

## üîµ 10. **Meta Tags (Encoding, Description, Keywords)**

Example:

```html
<meta charset="UTF-8">
<meta name="description" content="Product info">
```

Useful for:

* Setting `response.encoding`
* Understanding content language
* Extracting SEO metadata

---

## üîµ 11. **Comments and Hidden Elements**

HTML sometimes hides data:

```html
<div style="display:none">Secret</div>
```

Or:

```html
<!-- Price: $499 -->
```

JavaScript frameworks often put JSON inside `<script>` tags:

```html
<script id="data">
  {"price": 100, "name": "Laptop"}
</script>
```

Scrapers frequently extract this to avoid parsing complex DOM trees.

---

## üîµ 12. **Patterns to Recognize in Static HTML Scraping**

‚úîÔ∏è **Repeated structures**
Product listings
Forum threads
Tables

‚úîÔ∏è **Key-value structures**
Profile pages
Specs pages
Metadata blocks

‚úîÔ∏è **Hidden JSON inside script tags**
Used by modern websites as data storage
Easier to parse than HTML

‚úîÔ∏è **Breadcrumbs (navigation hierarchy)**
Useful for categorizing items

‚úîÔ∏è **Relative vs. absolute URLs**
Important for crawling the entire site

---

## üîµ 13. **Non-useful Parts of HTML (Usually Ignored)**

* CSS styling
* Layout-related divs (unless they contain structured content)
* JavaScript code (unless it contains hidden data)
* Decorative icons
* Advertisements

A scraper should focus only on **semantic** HTML elements.

---

## ‚úîÔ∏è Summary: What Matters Most in Static HTML Scraping

| Element          | Critical Role                        |
| ---------------- | ------------------------------------ |
| **Tags**         | Define structure of content          |
| **Classes/IDs**  | Provide stable selectors             |
| **Attributes**   | Contain actual data or links         |
| **Hierarchy**    | Determines extraction strategy       |
| **Lists/Tables** | Perfect sources of structured data   |
| **Links**        | Enable crawling and exploration      |
| **Forms**        | Enable interactions (search, login)  |
| **Meta tags**    | Control encoding and metadata        |
| **Hidden data**  | Often contains essential information |


# üìò **C√≥mo BeautifulSoup Navega el HTML en Web Scraping**

## 1. üß© ¬øQu√© es BeautifulSoup y qu√© hace internamente?

BeautifulSoup es una librer√≠a de Python dise√±ada para:

1. **Parsear HTML o XML**.
2. Transformarlo en una estructura tipo √°rbol (*parse tree*).
3. Permitir recorrer y buscar elementos usando:

   * Nombres de etiquetas
   * Atributos
   * Texto
   * Selectores CSS
   * Estructura del DOM

### ¬øQu√© hace internamente?

Cuando le pasas el HTML, BeautifulSoup ejecuta:

1. **Parsing** con un parser interno (`html.parser`) u otros m√°s robustos (`lxml`, `html5lib`).
2. Construye una estructura tipo **DOM tree**.
3. Cada nodo es un objeto Python del tipo:

   * `Tag`
   * `NavigableString`
   * `Comment`
   * `Doctype`

### Ejemplo m√≠nimo:

```python
from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1 class="title">Hola</h1>
    <p id="intro">Bienvenida al scraping</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())
```

---

## 2. üå≥ **El √Årbol DOM en BeautifulSoup**

BeautifulSoup convierte el HTML en nodos que puedes recorrer como si fueran un √°rbol.

### Tipos de nodos:

| Tipo              | Descripci√≥n                                 |
| ----------------- | ------------------------------------------- |
| `Tag`             | Representa una etiqueta (`<div>, <p>, <a>`) |
| `NavigableString` | Representa texto dentro de una etiqueta     |
| `Comment`         | Representa un comentario `<!-- ... -->`     |
| `BeautifulSoup`   | Objeto ra√≠z del documento                   |

### Ejemplo visual:

HTML:

```html
<div>
    <p>Hola <b>mundo</b></p>
</div>
```

√Årbol conceptual:

```
div
 ‚îî‚îÄ‚îÄ p
      ‚îú‚îÄ‚îÄ "Hola "
      ‚îî‚îÄ‚îÄ b
           ‚îî‚îÄ‚îÄ "mundo"
```

---

## 3. üîç B√∫squeda en el DOM: `find()` y `find_all()`

### ‚úî `find()` ‚Üí primera coincidencia

### ‚úî `find_all()` ‚Üí todas las coincidencias

### Buscar por etiqueta:

```python
soup.find("p")
soup.find_all("p")
```

### Buscar por clase:

```python
soup.find("h1", class_="title")
soup.find_all("div", class_="item")
```

### Buscar por id:

```python
soup.find(id="intro")
```

### Buscar por varios atributos:

```python
soup.find("a", {"href": True, "class": "link"})
```

---

## 4. üéØ B√∫squeda avanzada con funciones personalizadas

Puedes usar una funci√≥n para filtrar elementos:

```python
def is_download_link(tag):
    return tag.name == "a" and tag.get("href", "").endswith(".zip")

links = soup.find_all(is_download_link)
```

---

## 5. üé® Uso de **selectores CSS** (muy potente)

BeautifulSoup permite buscar como si usaras CSS:

| Selector            | Significado                       |
| ------------------- | --------------------------------- |
| `"div p"`           | `<p>` dentro de `<div>`           |
| `".clase"`          | elementos con esa clase           |
| `"#id"`             | elemento con ese id               |
| `"div > p"`         | `<p>` hijo directo de `<div>`     |
| `"a[href]"`         | `<a>` con atributo `href`         |
| `"a[href$='.pdf']"` | `<a>` cuyo href termina en ".pdf" |

### Ejemplo:

```python
soup.select("div.item > a.link")
```

---

## 6. üìÇ Acceso a contenido: texto, atributos y estructura

### Obtener texto:

```python
tag = soup.find("p")
tag.text          # todo el texto interno
tag.get_text()    # m√°s robusto
```

### Obtener un atributo:

```python
a = soup.find("a")
a["href"]
```

### Obtener m√∫ltiples atributos:

```python
a.attrs
```

---

## 7. üß≠ Navegaci√≥n del DOM: recorrer padres, hijos y hermanos

BeautifulSoup permite recorrer la estructura como si fuera un √°rbol.

### ‚ñº 7.1 Hijos

```python
tag.contents   # lista de hijos
tag.children   # generador
```

Ejemplo:

```python
for child in tag.children:
    print(child)
```

### ‚ñº 7.2 Padre

```python
tag.parent
```

### ‚ñº 7.3 Ancestros

```python
for ancestor in tag.parents:
    print(ancestor.name)
```

### ‚ñº 7.4 Hermanos

#### Siguiente hermano:

```python
tag.next_sibling
```

#### Hermano anterior:

```python
tag.previous_sibling
```

Ejemplo:

```python
h1 = soup.find("h1")
h1.next_sibling
```

---

## 8. üßº Limpieza y filtrado del HTML

### Eliminar etiquetas:

```python
tag.decompose()
```

### Limpiar texto:

```python
tag.get_text(strip=True)
```

### Reemplazar elementos:

```python
tag.string = "Nuevo texto"
```

---

## 9. üìù Ejemplo completo de navegaci√≥n

Supongamos este HTML:

```html
<div class="product">
  <h2 class="name">Teclado Mec√°nico</h2>
  <span class="price">$25</span>
  <a href="/compra" class="buy">Comprar</a>
</div>
```

### Extraer datos:

```python
product = soup.find("div", class_="product")

nombre = product.find("h2", class_="name").text
precio = product.find("span", class_="price").text
link = product.find("a", class_="buy")["href"]

print(nombre, precio, link)
```

Salida:

```
Teclado Mec√°nico $25 /compra
```