# HTML & CSS Basics for Web Scraping

# 🌐 HTML for Web Scraping — Complete Guide

---

## 1. What is HTML?

**HTML (HyperText Markup Language)** is the skeleton of every web page.  
It describes the structure of content like text, images, links, and tables.

For **web scraping**, we don’t need to create HTML — we just need to **understand** it well enough to locate the data we want.

---

## 2. Minimal HTML Page Example

This is the smallest valid HTML page you can write:

```
<!DOCTYPE html>
<html>
  <head>
    <title>My First Page</title>
  </head>
  <body>
    <h1>Hello World!</h1>
    <p>This is a paragraph.</p>
  </body>
</html>
```

💡 Save this code into a file called `test.html` and open it in your browser.

---

## 3. Page Structure (Anatomy)

| Tag     | Description                          |
|---------|--------------------------------------|
| `<html>`| Root of the HTML document            |
| `<head>`| Metadata, styles, and scripts        |
| `<body>`| All visible content on the page      |

---

## 4. Common HTML Tags

These are the most common tags you’ll encounter:

| Tag         | Purpose                        |
|-------------|--------------------------------|
| `<h1>`–`<h6>` | Headings (from largest to smallest) |
| `<p>`       | Paragraph of text              |
| `<a>`       | Hyperlink                      |
| `<img>`     | Image                          |
| `<ul>` / `<li>` | Unordered list and list item |
| `<table>`   | Table                          |

### Example:

```
<a href="https://example.com">Visit Site</a>

<img src="photo.jpg" alt="A photo">

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
</ul>

<table>
  <tr><th>City</th><th>Country</th></tr>
  <tr><td>Paris</td><td>France</td></tr>
</table>
```

---

## 5. HTML Attributes

Tags can have attributes, which provide extra information or behavior.

| Attribute | Used in  | Purpose                                  |
|-----------|----------|------------------------------------------|
| `class`   | Any tag  | Used for styling or targeting with CSS/scraping |
| `id`      | Any tag  | Unique identifier on the page            |
| `href`    | `<a>`    | The link’s destination                   |
| `src`     | `<img>`  | The image source URL                     |
| `alt`     | `<img>`  | Text shown if image fails to load        |

### Example:

```
<a href="https://example.com" class="nav-link">Click Here</a>
<img src="photo.jpg" alt="A beautiful view">
```

---

## 6. HTML Containers: `<div>` and `<span>`

Some tags don’t show content themselves — they are used to organize or wrap other elements.

| Tag     | Type        | Used for                            |
|---------|-------------|--------------------------------------|
| `<div>` | Block-level | Grouping larger sections             |
| `<span>`| Inline      | Wrapping small text chunks inline    |

### Example of `<div>` (used for layout):

```
<div class="card">
  <h2>Article Title</h2>
  <p>This is inside a section.</p>
</div>
```

### Example of `<span>` (used inline):

```
<p>This is a <span class="highlight">highlighted</span> word.</p>
```

✅ Use `<div>` to group multiple elements in a block  
✅ Use `<span>` to target part of a sentence

---


## 🎨 7. CSS Basics & Selectors 

---

### 🧠 What is CSS?

**CSS** stands for **Cascading Style Sheets**.  
It’s a language used to **control how HTML looks** — fonts, colors, layout, spacing, and more.

In web scraping, we don’t use CSS to make things pretty — we use **CSS selectors** to locate the exact elements we want to extract data from.

---

### 🧩 HTML + CSS: How They Work Together

HTML gives us the **structure** of a webpage.  
CSS gives us the **style** — but also a powerful way to **target elements**.

Example:

```html
<p class="highlight">This is important</p>
```

```css
.highlight {
  color: red;
  font-weight: bold;
}
```

- The HTML creates a paragraph with a class name
- The CSS targets all elements with `.highlight` and makes them red + bold

---

### 🎯 What is a CSS Selector?

A **CSS selector** is a pattern used to select specific parts of an HTML page.  
It tells the browser (or scraper) which elements to act on.

| Selector Type   | Syntax         | What It Selects                          |
|------------------|----------------|------------------------------------------|
| **Tag**          | `p`            | All `<p>` tags                           |
| **Class**        | `.highlight`   | All elements with class="highlight"      |
| **ID**           | `#footer`      | The element with id="footer"             |
| **Attribute**    | `img[alt]`     | All `<img>` tags that have an `alt` attr |
| **Descendant**   | `div p`        | All `<p>` inside any `<div>`             |
| **Child**        | `div > p`      | Only `<p>` that are direct children of `<div>` |

---

### 🔍 HTML + CSS Selectors in Practice

Here's an example HTML structure:

```html
<div class="card">
  <h2>Article Title</h2>
  <p>This is a paragraph.</p>
  <p class="highlight">Important info</p>
</div>

<p id="footer">This is the footer text</p>

<a href="https://example.com" class="nav-link">Visit site</a>

<img src="photo.jpg" alt="A beautiful view">
```

**Selector usage examples:**

| Selector             | What it targets                                           |
|----------------------|-----------------------------------------------------------|
| `p`                  | All `<p>` tags                                            |
| `.highlight`         | `<p class="highlight">Important info</p>`                |
| `#footer`            | `<p id="footer">This is the footer text</p>`             |
| `a.nav-link`         | `<a>` tag with class="nav-link"                          |
| `img[alt]`           | `<img>` tags that have an `alt` attribute                |
| `div p`              | All `<p>` inside any `<div>`                             |
| `div > p`            | Only `<p>` tags that are **direct children** of a `<div>`|

---

### ✅ Why This Matters for Scraping

Later, when we use **Scrapy**, you’ll write selectors like:

```python
response.css("p.highlight::text").get()
response.css("a.nav-link::attr(href)").get()
```

So understanding how selectors match elements is **critical** to being a good scraper.

---

### 📌 Recap

- **CSS selectors** let us point at specific elements in an HTML page
- We can use:
  - `tag` (like `p`, `img`, `a`)
  - `.class`
  - `#id`
  - attribute selectors like `[href]`, `[alt]`
  - parent-child relationships like `div > p`

✔ Learn to read and write CSS selectors = learn to **navigate any web page’s structure**


## 8. Inspecting a Page (Practical Demo)

For practice, we will use the training website 👉 [http://quotes.toscrape.com](http://quotes.toscrape.com)

1. Open the website in Chrome or Firefox.  
2. Right click on a quote → select **Inspect**.  
3. The Developer Tools will open and highlight the corresponding **HTML element**.

Example (your screenshot here):

![Inspect Quote](img/scrapping_demo1.png)

- div.quote → the full quote box
- span.text → the text of the quote
- small.author → the author name
- a.tag → each tag associated with the quote


## XPATH vs CSS Selectors

![Inspect Quote](img/scrapping_demo2.png)


When you right click an element in the Inspector, you can choose:

**Copy XPath** → gives the exact path in the HTML tree  
Example: `/html/body/div[2]/div[1]/div[2]/span[1]`

**Copy Selector** → gives the CSS selector  
Example: `div.quote > span.text`

👉 For BeautifulSoup, we don’t use XPath.  
We use CSS selectors (class, id, tag) instead:

soup.find("span", class_="text").get_text()  
soup.find("small", class_="author").get_text()  
soup.find_all("a", class_="tag")
