# Web Scraping 
## Introduction to Web Scraping

Web scraping is the process of extracting data from websites automatically using code. Instead of manually copying and pasting information, web scraping allows us to collect data efficiently and systematically.

Why Web Scraping?
Many websites contain valuable data that can be used for analysis, research, or automation. Some common applications of web scraping include:

- Collecting product prices from e-commerce sites 📊
- Gathering real estate listings for market analysis 🏠
- Extracting news articles for sentiment analysis 📰
- Scraping job listings for employment trends 💼

**How Web Scraping Works**

At a high level, web scraping involves:

Sending an HTTP request – Accessing a webpage using Python libraries like requests.
Parsing the HTML content – Extracting the required data using tools like BeautifulSoup or lxml.
Navigating through the webpage structure – Identifying elements like headings, tables, and links.
Storing the extracted data – Saving the information in a structured format such as CSV, JSON, or a database.
Legal and Ethical Considerations
Before scraping a website, always check:

- robots.txt file – Websites specify which parts of their site can be scraped.
- Terms of service – Some websites prohibit automated data collection.
- Ethical use – Only scrape publicly available data and avoid overloading a website with frequent requests.

In the next section, we'll set up our Python environment and start scraping real data! 🚀

## Web Scraping in Action
### Installing Web Scraping Libraries in Python

Before we start web scraping, we need to install some essential libraries that will help us extract data from websites. These libraries include:

- `requests`: Used for making HTTP requests to web pages and retrieving their content.
- `beautifulsoup4`: A library for parsing HTML and XML documents, making it easy to extract specific data.
- `selenium`: A powerful tool for automating web browsers, which is useful for scraping websites that require interaction (e.g., clicking buttons, handling JavaScript-rendered content).

To install these libraries, run the following command in your terminal or command prompt:
```bash
pip install requests beautifulsoup4 selenium
```

Once installed, you can check if the libraries are properly installed by running:

In [1]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

print("Libraries installed successfully!")

Libraries installed successfully!


## Connect to the target URL

Now that we have installed the necessary libraries, let’s start by making a simple web request to a website.

The requests library allows us to send HTTP requests to a website and retrieve its HTML content.

The following code sends a request to the website quotes.toscrape.com, which is a test site designed for practicing web scraping:

In [2]:
# Define the target URL  
url = "https://quotes.toscrape.com"  

# Send a GET request to the website  
page = requests.get(url)  

# Print the status code of the response  
print(f"Status Code: {page.status_code}")  

Status Code: 200


This is what happens on the above code cell; 

- `requests.get(url)`: Sends a request to the website and retrieves the response.
- `page.status_code`: Displays the status of the request.
 
	- `200` means the request was successful.
	- `404` means the page was not found.
In our case the status code was `200` therefore we successfully accessed the website

## Parsing HTML with BeautifulSoup
Now that we have retrieved the webpage content using requests, we need to parse the HTML so that we can extract useful information. This is where BeautifulSoup comes in!

**What is BeautifulSoup?**

BeautifulSoup is a Python library used for parsing HTML and XML documents. It allows us to navigate and search through the webpage’s structure easily.

In [3]:
# Parse the page content using BeautifulSoup  
soup = BeautifulSoup(page.text, 'html.parser')  

# Print the formatted HTML content  
print(soup.prettify()[:500])  # Display only the first 500 characters


<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
   


From the above code; 

- `BeautifulSoup(page.text, 'html.parser`:

	- Converts the HTML content (`page.text`) into a structured format that we can navigate.
	- `'html.parser'` is the built-in parser in Python.

- `soup.prettify()`:

	- Formats the HTML in a readable structure.
	- We use `[:500]` to show only the first 500 characters (to avoid printing too much data).

## Select HTML elements with Beautiful Soup

Now that we have parsed the HTML content using BeautifulSoup, let’s explore how to extract specific elements from the page.

1. **Finding Elements by Tag Name**

To retrieve all elements of a particular tag, such as `<h1>`, use `find_all()`:

In [4]:
# Get all <h1> elements on the page
h1_elements = soup.find_all('h1')
print(h1_elements)

[<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>]


This returns a list of all `<h1>` elements present in the HTML.

2. **Finding Elements by ID**

HTML elements often have unique `id` attributes. You can retrieve an element by its `id` using `find()`:


In [5]:
# Get the element with id="main-title"
main_title_element = soup.find(id='main-title')
print(main_title_element)

None


3. **Finding Elements by Text Content**

Sometimes, elements don’t have specific attributes, but we can identify them based on their text:

In [6]:
# Find the footer element based on the text it contains
footer_element = soup.find(text='Powered by WordPress')
print(footer_element)

None


  footer_element = soup.find(text='Powered by WordPress')


4. **Finding Elements by Attribute**

If an element has a unique attribute, we can search for it using `attrs`:

In [7]:
# Find the email input element through its "name" attribute
email_element = soup.find(attrs={'name': 'email'})
print(email_element)

None


5. **Finding Elements by Class**

To find elements by class name, use the `class_` parameter:

In [8]:
# Find all the centered elements on the page
centered_elements = soup.find_all(class_='text-center')
print(centered_elements)

[]


6. **Combining Methods for Precise Selection**
    
We can chain methods to extract elements within specific sections:

In [9]:
# # Get all "li" elements inside the ".navbar" element
# navbar_items = soup.find(class_='navbar').find_all('li')
# print(navbar_items)

7. **Using CSS Selectors with select()**

Instead of using multiple `.find()` calls, we can use CSS selectors with `select()`:


In [10]:
# Get all "li" elements inside the ".navbar" element
navbar_items = soup.select('.navbar > li')
print(navbar_items)

[]


**In Summary**

- `find_all('tag')` → Retrieves all elements of a given tag.
- `find(id='some_id')` → Finds an element by its unique ID.
- `find(text='some_text')` → Finds an element based on text content.
- `find(attrs={'name': 'some_name'})` → Searches for an element by attribute.
- `find_all(class_='some_class')` → Retrieves all elements with a specific class.
- `soup.select('CSS_selector')` → Uses CSS selectors for more flexible selection.

Using these methods, we can precisely extract any HTML element from a webpage and manipulate it as needed! 🚀

## Extract data from the elements

Now that we have learned how to navigate and extract elements from an HTML page, let's store the scraped data in a structured format.

1. **Initializing a Data Structure**

Before extracting data, we need a storage structure. Since we are working with multiple quotes, we’ll use a list of dictionaries:

In [11]:
# Initialize a list to store the scraped quotes
quotes = []

2. **Finding the Quote Elements**

Each quote on the page is enclosed in a `<div>` with the class `"quote"`. We use `find_all()` to get all such elements:

In [12]:
# Find all <div> elements with the class "quote"
quote_elements = soup.find_all('div', class_='quote')

Now, `quote_elements` contains a list of all quotes on the page.

3. **Extracting Quote Data**

To retrieve the text, author, and tags for each quote, iterate over the list and extract the relevant information:

In [13]:
for quote_element in quote_elements:
    # Extract the quote text
    text = quote_element.find('span', class_='text').text

    # Extract the author of the quote
    author = quote_element.find('small', class_='author').text

    # Extract the tag elements associated with the quote
    tag_elements = quote_element.select('.tags .tag')

    # Store the tags in a list
    tags = []
    for tag_element in tag_elements:
        tags.append(tag_element.text)


In [14]:
tags

['humor', 'obvious', 'simile']

4. **Storing Data in a Dictionary**

After extracting the necessary details, we store each quote as a dictionary and add it to our list:

In [15]:
# Append extracted data to the quotes list
quotes.append(
    {
        'text': text,
        'author': author,
        'tags': ', '.join(tags)  # Convert list of tags into a single string
        }
)

In [16]:
quotes

[{'text': '“A day without sunshine is like, you know, night.”',
  'author': 'Steve Martin',
  'tags': 'humor, obvious, simile'}]

Now, quotes is a list of dictionaries, each containing:

- `"text"` → The quote itself
- `"author"` → The person who said it
- `"tags"` → Relevant topics
  
5. **Printing the Scraped Data**

To verify the extracted data, print a few quotes:

In [17]:
# Display first 5 quotes
for quote in quotes[:5]:
    print(quote)

{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': 'humor, obvious, simile'}


## Exporting Scraped Data to a CSV File

Once we have successfully extracted quotes from the website, the next step is to save them in a structured format. One of the most common ways to store tabular data is in a CSV (Comma-Separated Values) file.

Python provides a built-in `csv` module, which makes it easy to write data to a CSV file. First, we import the module:

In [18]:
import csv

We use the `open()` function to create (or overwrite) a CSV file named `"quotes.csv"`. The 'w' mode ensures that we write to a new file.

- `'w'` → Write mode (creates a new file or overwrites an existing one).
- `encoding='utf-8'` → Ensures proper handling of special characters.
- `newline=''` → Prevents extra blank lines when writing to the file (especially on Windows).

After opening the file, we initialize the CSV writer

The `writer` object helps insert data into the CSV file.

The first row of a CSV file typically contains column headers. Since we extracted `text`, `author`, and `tags`, we define our headers

We iterate over our quotes list and write each quote to a new row:

- `quote.values()` retrieves the dictionary values (text, author, and tags).
- The `writer.writerow()` method writes each quote as a new row in the CSV file.


Since we used a with `open(...)` as `csv_file`: block, the file automatically closes after execution, freeing system resources. However, if you use `open()` without `with`, you should manually close the file using: `csv_file.close()`

In [19]:
# reading  the "quotes.csv" file and creating it
# if not present
with open('../data/quotes.csv', 'w', encoding='utf-8', newline='') as csv_file:
	# initializing the writer object to insert data
	# in the CSV file
	writer = csv.writer(csv_file)

	# writing the header of the CSV file
	writer.writerow(['Text', 'Author', 'Tags'])

	# writing each row of the CSV
	for quote in quotes:
	    writer.writerow(quote.values())

# terminating the operation and releasing the resources
csv_file.close()

After running the script, check your project folder. You should see a quotes.csv file containing the quote:
```
“A day without sunshine is like, you know, night.”
```