# WEB SCRAPING (Using Python)

# Project Objective : 

Website : https://quotes.toscrape.com/

The objective of this project is to scrape quotes, author names, details, and tags from multiple pages of a website. 

    This involves:
        Initializing a list to store data from multiple pages.
        Looping through each page, sending a request to the website, and parsing the HTML content.
        Extracting the desired information such as quote text, author's name, details link, and tags from each page.
        Storing the extracted data in a list.
        Converting the list of data into a DataFrame.
        Printing the DataFrame and saving it to a CSV file for both single-page and multiple-page data.
This code achieves the objective by utilizing libraries like requests for making HTTP requests, BeautifulSoup for parsing HTML, and pandas for handling data in DataFrame format.

# Pre-Requisite : 

    1.Basic knowledge of HTML and CSS is essential.
    2.Proficiency in Python programming 
        o data types
        o loops	
        o conditionals
        o functions
        o Web Scraping Libraries such as BeautifulSoup and Scrapy.
    3.	Understanding how to use these libraries to parse HTML and extract data is crucial.
    4.	Understand HTTP protocol and how to send HTTP requests using libraries like Requests in Python. Able to retrieve web pages and handle different types of responses.
    5.	Check the website's robots.txt file and terms of service to ensure that web scraping is allowed.
    6.	 Respect the website's terms and conditions to avoid legal issues.

In [2]:
pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.5 MB 960.0 kB/s eta 0:00:12
   - -------------------------------------- 0.5/11.5 MB 5.6 MB/s eta 0:00:02
   ----- ---------------------------------- 1.5/11.5 MB 12.3 MB/s eta 0:00:01
   ---------- ----------------------------- 2.9/11.5 MB 16.7 MB/s eta 0:00:01
   ------------ --------------------------- 3.5/11.5 MB 15.9 MB/s eta 0:00:01
   -------------- ------------------------- 4.1/11.5 MB 15.4 MB/s eta 0:00:01
   ---------------- ----------------------- 4.7/11.5 MB 15.0 MB/s eta 0:00:01
   ------------------ --

In [3]:
# Import necessary libraries
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML
import pandas as pd  # For data manipulation and analysis

Code fetches the HTML content of a webpage using the requests library and then parses it using BeautifulSoup, making it ready for further processing and extraction of specific information from the webpage.

    1.	Variable link stores the URL of the website that we want to scrape. In this case, it's 'https://quotes.toscrape.com/'.
    2.	requests.get() function sends a GET request to the specified URL (link). This retrieves the HTML content of the webpage.
    3.	Response from the website is stored in the variable res.
    4.	Creating a BeautifulSoup object:
        a.	BeautifulSoup is a Python library for parsing HTML and XML documents. 
        b.	Here, BeautifulSoup(res.text, 'html.parser') creates BeautifulSoup object named soup to parse the HTML content of the response (res.text). 
        c.	This object is then used to navigate and extract data from the HTML structure of the webpage.

In [4]:
# Define the URL of the website to scrape
link = 'https://quotes.toscrape.com/'

In [5]:
# Sending a request to the website and getting the response
res = requests.get(link)

In [6]:
# Creating a soup object to parse the HTML content
soup = BeautifulSoup(res.text, 'html.parser')

# Basic Initialization 

    1. Here we initialize two empty lists, quotes and authors, which will store the quotes and authors' names scraped from the website, respectively. 
    2. These lists are meant to accumulate the data extracted during the scraping process.


In [7]:
# Scraping quotes and authors' names

# Initialize empty lists to store quotes and authors
quotes = []
authors = []

# Scraping quotes and authors' names individually

    •Program loops all elements of webpage with specified class 'text' (representing the quote text) using BeautifulSoup's find_all() method. 
    •	For each element found, text content of quote is extracted, removes the leading and trailing quotation marks [1:-1], and appends cleaned text to quotes list.
    •	Similarly, it loops through all elements with the class 'author', representing the authors' names, extracts the text content (the authors' names), and appends them to the authors list.
    •	This part of the code effectively scrapes the quotes and authors' names from the webpage and stores them in their respective lists.


In [8]:
# Loop through all elements with the specified class and extract text
for quote in soup.find_all('span', class_='text'):
    quotes.append(quote.text[1:-1])  # Append the quote text

# Loop through all elements with the specified class and extract text
for i in soup.find_all('small', class_='author'):
    authors.append(i.text)  # Append the author's name

# Scraping quotes, authors, details, and tags
    For each element found on webpage with the class 'quote', it extracts specific information:
    •	The text of the quote is extracted from the <span> element with the class 'text'.
    •	The author's name is extracted from the <small> element with the class 'author'.
    •	The details link is extracted from the <a> element within the quote.
    •	Tags associated with the quote are extracted from <a> elements with the class 'tag'.
    •	After extracting above information, 
        o	We need to formats the tag into a comma-separated string 
        o	appends all the extracted data (quote text, author's name, details link, and tags) as a list to the data list. 
    •	This part effectively collects the required information for each quote and stores it in a structured format.


In [10]:
# Initialize the data list to store the scraped data
data = []

# Loop through all elements with the specified class and extract desired information
for sp in soup.find_all('div', class_='quote'):
    quote = sp.find('span', class_='text').text[1:-1]  # Extract quote text
    author = sp.find('small', class_='author').text  # Extract author's name
    details = sp.find('a').get('href')  # Extract details link
    tags = [tag.text for tag in sp.find_all('a', class_='tag')]  # Extract tags
    tags = ','.join(tags)  # Convert tags list to a comma-separated string
    data.append([quote, author, details, tags])  # Append data to the list

# Scraping data from multiple pages

In [11]:
# Initialize a list to store data from multiple pages
multiple_pages_data = []

# Loop through multiple pages and scrape data
for page in range(1, 11):
    # Construct the URL for each page
    page_link = f"http://quotes.toscrape.com/page/{page}"
    # Send a request to the website
    res = requests.get(page_link)
    # Create a soup object to parse the HTML content
    soup = BeautifulSoup(res.text, 'html.parser')
    # Loop through all elements with the specified class and extract desired information
    for sp in soup.find_all('div', class_='quote'):
        quote = sp.find('span', class_='text').text[1:-1]  # Extract quote text
        author = sp.find('small', class_='author').text  # Extract author's name
        details = sp.find('a').get('href')  # Extract details link
        tags = [tag.text for tag in sp.find_all('a', class_='tag')]  # Extract tags
        tags = ','.join(tags)  # Convert tags list to a comma-separated string
        multiple_pages_data.append([quote, author, details, tags])  # Append data to the list


# Data processing and saving

    •Convert Data to DataFrame (Single Page) with Column names 'Quote', 'Author', 'Details', and 'Tags'.
    •Print DataFrame (Single Page) to the console
    •Save DataFrame to CSV (Single Page):
    •The DataFrame is saved to a CSV file named 'scraped_data.csv' using the to_csv() method provided by pandas.
    •The parameter index=False ensures that the DataFrame index is not included in the CSV file.
    •Similarly, all above steps for Multiple Pages
    •Finally, a confirmation message is printed to console, indicating that scraped data from multiple pages has been successfully saved to CSV file.

In [12]:
# Convert the list of data into a DataFrame
df = pd.DataFrame(data, columns=['Quote', 'Author', 'Details', 'Tags'])

# Print the DataFrame
print("Scraped Data from Single Page:")
print(df)

# Save the DataFrame to a CSV file
df.to_csv('scraped_data.csv', index=False)
print("Scraped data saved to 'scraped_data.csv'.")

# Convert the list of data from multiple pages into a DataFrame
df_multiple_pages = pd.DataFrame(multiple_pages_data, columns=['Quote', 'Author', 'Details', 'Tags'])

# Print the DataFrame from multiple pages
print("\nScraped Data from Multiple Pages:")
print(df_multiple_pages)

# Save the DataFrame from multiple pages to a CSV file
df_multiple_pages.to_csv('scraped_data_multiple_pages.csv', index=False)
print("Scraped data from multiple pages saved to 'scraped_data_multiple_pages.csv'.")


Scraped Data from Single Page:
                                               Quote             Author  \
0  The world as we have created it is a process o...    Albert Einstein   
1  It is our choices, Harry, that show what we tr...       J.K. Rowling   
2  There are only two ways to live your life. One...    Albert Einstein   
3  The person, be it gentleman or lady, who has n...        Jane Austen   
4  Imperfection is beauty, madness is genius and ...     Marilyn Monroe   
5  Try not to become a man of success. Rather bec...    Albert Einstein   
6  It is better to be hated for what you are than...         André Gide   
7  I have not failed. I've just found 10,000 ways...   Thomas A. Edison   
8  A woman is like a tea bag; you never know how ...  Eleanor Roosevelt   
9   A day without sunshine is like, you know, night.       Steve Martin   

                     Details                                      Tags  
0    /author/Albert-Einstein       change,deep-thoughts,thinking,world