# **Web Scraping Practice: Wikipedia, Quotes to Scrape, and Books to Scrape**


## **Implementing Web Scraping in Python with BeautifulSoup**

Web scraping is a powerful technique for extracting data from websites. BeautifulSoup, a Python library, provides tools for parsing HTML and navigating the parse tree, making it an excellent choice for web scraping tasks.

### **Extracting Data from Websites**

There are two main approaches to extract data from a website:

1. **Using the Website's API**: Some websites provide APIs (Application Programming Interfaces) that allow developers to retrieve data in a structured format. For example, Facebook offers the Facebook Graph API for accessing data posted on the platform.

2. **Web Scraping**: When a website doesn't offer an API, we can access the HTML of its webpages and extract useful information directly. This technique, known as web scraping, web harvesting, or web data extraction, involves sending HTTP requests to the website and parsing the HTML content.

### **Steps Involved in Web Scraping**

1. **Sending HTTP Requests**: We start by sending an HTTP request to the URL of the webpage we want to access. Python's `requests` library is commonly used for this task.

2. **Parsing HTML Content**: Once we receive the HTML content of the webpage, we need to parse it to extract the desired data. HTML parsing libraries like `Beautiful Soup` help create a nested/tree structure of the HTML data.

3. **Navigating and Searching the Parse Tree**: With the parsed HTML content, we can navigate and search the parse tree to locate specific elements and extract relevant information. Beautiful Soup provides convenient methods for traversing the parse tree and extracting data.

By following these steps, we can efficiently scrape data from websites and use it for various purposes, such as data analysis, research, and automation.




- The `!` symbol at the beginning of the line tells Colab to execute the command as a shell command rather than a Python statement.
- `pip` is a package management system used to install and manage software packages written in Python.
- `install` is a pip command that is used to install packages.
- `requests` and `beautifulsoup4` are the names of the packages you want to install. These are the packages required for web scraping.

By running this command in a cell in your GitHub Colab file, you'll install the necessary packages, and then you can proceed with your web scraping tasks using `requests` and `BeautifulSoup` in Python.


In [54]:
!pip install requests beautifulsoup4



Three website links from where we can scrape data:

1. **Wikipedia**: https://www.wikipedia.org/
   - Wikipedia is a free online encyclopedia with articles covering a wide range of topics. It provides valuable information on various subjects, making it a useful source for web scraping.

2. **Quotes to Scrape**: http://quotes.toscrape.com/
   - Quotes to Scrape is a website specifically designed for practicing web scraping. It contains a collection of quotes from various authors, along with their tags.

3. **Books to Scrape**: http://books.toscrape.com/
   - Books to Scrape is another website designed for practicing web scraping. It contains a collection of books, including their titles, prices, ratings, and availability.


### **Wikipedia: https://www.wikipedia.org/**

This script performs web scraping on the Wikipedia homepage (https://www.wikipedia.org/) to extract the titles of the languages available on the site.

1. **Import Libraries**: The script imports the necessary libraries: `requests` for making HTTP requests and `BeautifulSoup` for parsing HTML.

2. **Define URL**: The URL of the Wikipedia homepage is defined as `url`.

3. **Make GET Request**: The script makes a GET request to fetch the HTML content of the Wikipedia homepage using the `requests.get()` function. The response is stored in the `response` variable.

4. **Parse HTML**: The HTML content of the Wikipedia homepage is parsed using BeautifulSoup's `BeautifulSoup` function. The parsed HTML is stored in the `soup` variable.

5. **Find Language Links**: BeautifulSoup's `find_all()` method is used to find all the `<a>` elements with the class `link-box`, which represent language links on the Wikipedia homepage. These links contain the titles of the languages available on the site.

6. **Create CSV File**: A CSV file named "wikipedia_languages.csv" is created using Python's built-in `csv` module. The file is opened in write mode (`'w'`), and the column header "Language" is written to the file using the `writer.writerow()` method.

7. **Write Language Titles to CSV**: For each language link found, the script extracts the title of the language using BeautifulSoup's various methods like `find()` and `text`. The extracted language title is then written to the CSV file using the `writer.writerow()` method.

8. **Print Success Message**: Finally, a success message is printed, indicating that the language titles have been written to the CSV file.

This script demonstrates a basic web scraping workflow for extracting data from a webpage and saving it to a CSV file.

In [44]:
import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the Wikipedia homepage
url = 'https://www.wikipedia.org/'

# Make a GET request to fetch the HTML content of the page
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find the list of language links on the Wikipedia homepage
language_links = soup.find_all('a', class_='link-box')

# Create a CSV file to write the data
filename = 'wikipedia_languages.csv'
with open(filename, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Language'])

    # Write language titles to the CSV file
    for link in language_links:
        language = link.find('strong').text.strip()
        writer.writerow([language])

print("Language titles have been written to", filename)


Language titles have been written to wikipedia_languages.csv


### **Quotes to Scrape: http://quotes.toscrape.com/**

This script scrapes data from the "Quotes to Scrape" website (http://quotes.toscrape.com/).

1. **Import Libraries**: The script imports the necessary libraries: `requests` for making HTTP requests and `BeautifulSoup` for parsing HTML.

2. **Define URL**: The URL of the "Quotes to Scrape" website is defined as `url`.

3. **Make GET Request**: The script makes a GET request to fetch the HTML content of the webpage using the `requests.get()` function. The response is stored in the `response` variable.

4. **Check Response**: It checks if the request was successful by verifying the status code of the response. If the status code is 200, it prints a success message; otherwise, it prints an error message along with the status code.

5. **Parse HTML**: The HTML content of the webpage is parsed using BeautifulSoup's `BeautifulSoup` function. The parsed HTML is stored in the `soup` variable.

6. **Find Quote Containers**: BeautifulSoup's `find_all()` method is used to find all the `<div>` elements with the class `quote`, which represent individual quote containers on the webpage.

7. **Create CSV File**: A CSV file named "quotes_to_scrape.csv" is created using Python's built-in `csv` module. The file is opened in write mode (`'w'`), and the column headers "Author", "Quote", and "Tags" are written to the file using the `writer.writerow()` method.

8. **Write Quote Details to CSV**: For each quote container found, the script extracts the author name, quote text, and tags using BeautifulSoup's various methods like `find()` and `text`. The extracted details are then written to the CSV file using the `writer.writerow()` method.

9. **Print Success Message**: Finally, a success message is printed, indicating that the quotes data has been written to the CSV file.

This script demonstrates a basic web scraping workflow for extracting data from a webpage and saving it to a CSV file.

In [52]:
import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of Quotes to Scrape
url = 'http://quotes.toscrape.com/'

# Make a GET request to fetch the HTML content of the page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the HTML content.")
else:
    print("Failed to fetch the HTML content. Status code:", response.status_code)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find all quote containers
quote_containers = soup.find_all('div', class_='quote')

# Create a CSV file to write the data
filename = 'quotes_to_scrape.csv'
with open(filename, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Author', 'Quote', 'Tags'])

    # Write quote details to the CSV file
    for quote in quote_containers:
        author = quote.find('small', class_='author').text.strip()
        quote_text = quote.find('span', class_='text').text.strip()
        tags = [tag.text for tag in quote.find_all('a', class_='tag')]
        writer.writerow([author, quote_text, ', '.join(tags)])

print("Quotes data from Quotes to Scrape has been written to", filename)


Successfully fetched the HTML content.
Quotes data from Quotes to Scrape has been written to quotes_to_scrape.csv


### **Books to Scrape: http://books.toscrape.com/**

This script scrapes data from the "Books to Scrape" website (http://books.toscrape.com/).

1. **Import Libraries**: The script imports the necessary libraries: `requests` for making HTTP requests and `BeautifulSoup` for parsing HTML.

2. **Define URL**: The URL of the "Books to Scrape" website is defined as `url`.

3. **Make GET Request**: The script makes a GET request to fetch the HTML content of the webpage using the `requests.get()` function. The response is stored in the `response` variable.

4. **Check Response**: It checks if the request was successful by verifying the status code of the response. If the status code is 200, it prints a success message; otherwise, it prints an error message along with the status code.

5. **Parse HTML**: The HTML content of the webpage is parsed using BeautifulSoup's `BeautifulSoup` function. The parsed HTML is stored in the `soup` variable.

6. **Find Book Containers**: BeautifulSoup's `find_all()` method is used to find all the `<article>` elements with the class `product_pod`, which represent individual book containers on the webpage.

7. **Create CSV File**: A CSV file named "books_to_scrape.csv" is created using Python's built-in `csv` module. The file is opened in write mode (`'w'`), and the column headers "Title", "Price", "Rating", and "Availability" are written to the file using the `writer.writerow()` method.

8. **Write Book Details to CSV**: For each book container found, the script extracts the title, price, rating, and availability of the book using BeautifulSoup's various methods like `find()` and `text`. The extracted details are then written to the CSV file using the `writer.writerow()` method.

9. **Print Success Message**: Finally, a success message is printed, indicating that the books data has been written to the CSV file.

This script demonstrates a basic web scraping workflow for extracting data from a webpage and saving it to a CSV file.

In [53]:
import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of Books to Scrape
url = 'http://books.toscrape.com/'

# Make a GET request to fetch the HTML content of the page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully fetched the HTML content.")
else:
    print("Failed to fetch the HTML content. Status code:", response.status_code)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find all book containers
book_containers = soup.find_all('article', class_='product_pod')

# Create a CSV file to write the data
filename = 'books_to_scrape.csv'
with open(filename, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Price', 'Rating', 'Availability'])

    # Write book details to the CSV file
    for book in book_containers:
        title = book.find('h3').find('a')['title']
        price = book.find('p', class_='price_color').text
        rating = book.find('p', class_='star-rating')['class'][1]
        availability = book.find('p', class_='instock availability').text.strip()
        writer.writerow([title, price, rating, availability])

print("Books data from Books to Scrape has been written to", filename)


Successfully fetched the HTML content.
Books data from Books to Scrape has been written to books_to_scrape.csv
