# 🎓 Lesson 6: 🧹 Cleaning Extracted Data

🎯 Goal

In this lesson, you’ll learn:
1. How to clean raw HTML text
2. How to remove unwanted characters and whitespace
3. How to normalize inconsistent data (e.g. numbers or units)

## 💻 Real Example Site

📍 https://scrapethissite.com/pages/simple/

Let’s extract and clean country names, populations, and area data.

## ✅ Example: Extract Raw and Cleaned Data

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://scrapethissite.com/pages/simple/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

# Get all country blocks
countries = soup.select("div.country")

for country in countries:
    raw_name = country.select_one("h3.country-name").text
    raw_population = country.select_one("span.country-population").text
    raw_area = country.select_one("span.country-area").text

    # 🧼 Clean the data
    name = raw_name.strip()
    population = raw_population.strip().replace(",", "")
    area = raw_area.strip().replace(",", "")

    print(f"🌍 {name} | Population: {population} | Area: {area} km²")

## 🧠 Useful Cleaning Functions

| Method                | Purpose                                 |
| --------------------- | --------------------------------------- |
| `.strip()`            | Removes leading and trailing whitespace |
| `.replace(old, new)`  | Replaces characters or strings          |
| `.lower() / .upper()` | Normalize text case                     |
| `.split()`            | Break text into parts                   |
| `int()` / `float()`   | Convert numeric text to numbers         |


##🔁 Converting to Integers and Floats

If you want to store or calculate with the population/area data:

In [2]:
population = int(population)
area = float(area)

⚠️ Just make sure to remove commas first ("1,000,000" → "1000000").

## 🧪 Practice Tasks

1. Extract the raw .text from each .country-area and clean it (remove commas).

2. Convert population and area values into integers/floats and calculate population density.

3. Normalize all country names to uppercase using .upper().

Example:

density = population / area

## 🔜 Next up: Lesson 7 – Error Handling Basics

You’ll learn `try-except`, how to handle missing tags, and avoid common Beautiful Soup pitfalls.