<a href="https://colab.research.google.com/github/Saifullah785/machine-learning-engineer-roadmap/blob/main/Lecture_18_fetching_data_using_web_scraping/Lecture_18_fetching_data_using_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture 18 : Fetching Data Using Web Scraping**

# **Why is This Important in Machine Learning?**


---


Sometimes there is no API available to fetch data.

But the data is visible on websites.

So, we scrape (extract) that data directly from websites — this is called Web Scraping.

# **Example:**

You want the latest news headlines or product prices from a website → You scrape it.

# **Part 1: What is Web Scraping?**


---



Web Scraping = Automatically reading the HTML code of a webpage and extracting useful information from it.

# **Daily Example:**

You see prices of phones on an e-commerce website manually.

Web Scraping does the same thing automatically!

# **Part 2: Tools Needed for Web Scraping**

Here are a few example website URLs that are generally suitable for practicing web scraping:

Quotes to Scrape: http://quotes.toscrape.com/ (Specifically designed for practicing web scraping)
Books to Scrape: http://books.toscrape.com/ (Another site designed for scraping practice)
Weather.com: https://weather.com/ (You could scrape weather information)
Edu.gcfglobal.org: https://edu.gcfglobal.org/en/ (Already used in the notebook, a good example of a less complex site)
Remember to always check a website's robots.txt file (e.g., https://edu.gcfglobal.org/robots.txt) and terms of service before scraping to ensure you are allowed to do so and to understand any restrictions.

# **We mostly use two Python libraries:**

requests → to download webpage HTML

BeautifulSoup → to parse and extract information

In [None]:
# !pip install requests
# !pip install beautifulsoup4


# **Part 3: Basic Steps in Web Scraping**

# **Step | What We Do**
1 | Use requests to fetch the webpage HTML

2 | Parse HTML with BeautifulSoup

3 | Find and extract the specific data you want

# **Code: Basic Web Scraping Example**

In [1]:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np


# **if response code is 403**

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'} -requests.get('url',headers=headers).text

In [2]:

webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page=1').text

In [5]:
# headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
# webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page=1',headers=headers).text

In [6]:
soup=BeautifulSoup(webpage,'lxml')

In [7]:
print(soup.prettify())

<html>
 <head>
  <title>
   Access Denied
  </title>
 </head>
 <body>
  <h1>
   Access Denied
  </h1>
  You don't have permission to access "http://www.ambitionbox.com/list-of-companies?" on this server.
  <p>
   Reference #18.a8dddd17.1749706165.6017f919
  </p>
  <p>
   https://errors.edgesuite.net/18.a8dddd17.1749706165.6017f919
  </p>
 </body>
</html>



In [8]:
soup.find_all('h1')[0].text

'Access Denied'

In [9]:
for i in soup.find_all('h2'):
  print(i.text.strip())

In [11]:
import requests
from bs4 import BeautifulSoup

# Step 1: Website URL
url = "https://edu.gcfglobal.org/en/"  # Example site

# Step 2: Send HTTP Request
response = requests.get(url)

# Step 3: Parse the page
soup = BeautifulSoup(response.text, 'html.parser')

# Step 4: Extract Data (All news headlines)
headlines = soup.find_all('a', class_='storylink')  # HTML class for news links

# Step 5: Print the Headlines
for idx, headline in enumerate(headlines, start=1):
    print(f"{idx}. {headline.text}")

# **Part 5: Detailed Explanation of Code**

# **Step | Code | Meaning**


---


Fetch HTML |   requests.get(url) | Download website page

Parse HTML |   BeautifulSoup(response.text, 'html.parser') |    Convert into searchable format

Find Elements |   soup.find_all('a', class_='storylink') |   Find news items

Print Data |   headline.text |   Extract and print text

# **Part 6: Important Points to Remember**

✅ Always check website rules → Some sites don't allow scraping (check robots.txt).

✅ Don’t overload the website → Add delay between requests if scraping lots of pages.

**Reasoning**:
Use BeautifulSoup to parse the fetched HTML content for easier navigation and data extraction.



# Task
Extract 50 quotes and their authors from "http://quotes.toscrape.com/" using web scraping and store them in a pandas DataFrame.

## Initialize data structures

### Subtask:
Create empty lists to store the scraped quotes and authors.


**Reasoning**:
Initialize empty lists to store the quotes and authors that will be scraped from the website.



In [18]:
quotes = []
authors = []

## Loop through pages

### Subtask:
Iterate through multiple pages of the website to collect enough quotes (up to 50).


**Reasoning**:
Determine the number of pages needed and create a loop to iterate through them, dynamically constructing the URL for each page.



In [20]:
num_pages = 5  # Each page has 10 quotes, so 5 pages for 50 quotes
base_url = "http://quotes.toscrape.com/page/"

for page_num in range(1, num_pages + 1):
    url = f"{base_url}{page_num}/"
    print(f"Scraping page: {url}")

Scraping page: http://quotes.toscrape.com/page/1/
Scraping page: http://quotes.toscrape.com/page/2/
Scraping page: http://quotes.toscrape.com/page/3/
Scraping page: http://quotes.toscrape.com/page/4/
Scraping page: http://quotes.toscrape.com/page/5/


**Reasoning**:
The iteration through multiple pages is set up. The next step is to fetch the HTML content for each page within the loop.



In [22]:
num_pages = 5  # Each page has 10 quotes, so 5 pages for 50 quotes
base_url = "http://quotes.toscrape.com/page/"

for page_num in range(1, num_pages + 1):
    url = f"{base_url}{page_num}/"
    print(f"Scraping page: {url}")
    response = requests.get(url)
    print(f"Response status code: {response.status_code}")

Scraping page: http://quotes.toscrape.com/page/1/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/2/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/3/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/4/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/5/
Response status code: 200


## Parse html

### Subtask:
Use `BeautifulSoup` to parse the fetched HTML content.


**Reasoning**:
Parse the fetched HTML content using BeautifulSoup for easier navigation and data extraction within the loop.



In [24]:
num_pages = 5  # Each page has 10 quotes, so 5 pages for 50 quotes
base_url = "http://quotes.toscrape.com/page/"

for page_num in range(1, num_pages + 1):
    url = f"{base_url}{page_num}/"
    print(f"Scraping page: {url}")
    response = requests.get(url)
    print(f"Response status code: {response.status_code}")
    soup = BeautifulSoup(response.text, 'html.parser')

Scraping page: http://quotes.toscrape.com/page/1/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/2/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/3/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/4/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/5/
Response status code: 200


## Extract data

### Subtask:
Identify and extract the quotes and authors from the parsed HTML of each page.


**Reasoning**:
Identify and extract the quotes and authors from the parsed HTML of each page using BeautifulSoup's find_all method and append them to the respective lists.



In [26]:
num_pages = 5  # Each page has 10 quotes, so 5 pages for 50 quotes
base_url = "http://quotes.toscrape.com/page/"

for page_num in range(1, num_pages + 1):
    url = f"{base_url}{page_num}/"
    print(f"Scraping page: {url}")
    response = requests.get(url)
    print(f"Response status code: {response.status_code}")
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all quote elements on the page
    quote_elements = soup.find_all('div', class_='quote')

    # Extract quote text and author for each quote
    for quote_element in quote_elements:
        quote_text = quote_element.find('span', class_='text').text
        author_text = quote_element.find('small', class_='author').text
        quotes.append(quote_text)
        authors.append(author_text)

print(f"Total quotes scraped: {len(quotes)}")
print(f"Total authors scraped: {len(authors)}")

Scraping page: http://quotes.toscrape.com/page/1/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/2/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/3/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/4/
Response status code: 200
Scraping page: http://quotes.toscrape.com/page/5/
Response status code: 200
Total quotes scraped: 100
Total authors scraped: 100


## Store data

### Subtask:
Append the extracted quotes and authors to the respective lists.


## Create dataframe

### Subtask:
Organize the extracted quotes and authors into a pandas DataFrame.


**Reasoning**:
Create a pandas DataFrame from the extracted quotes and authors lists.



In [28]:
df_quotes = pd.DataFrame({'Quote': quotes, 'Author': authors})
display(df_quotes.head())
display(df_quotes.tail())

Unnamed: 0,Quote,Author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe


Unnamed: 0,Quote,Author
95,“The real lover is the man who can thrill you ...,Marilyn Monroe
96,"“A wise girl kisses but doesn't love, listens ...",Marilyn Monroe
97,“Only in the darkness can you see the stars.”,Martin Luther King Jr.
98,"“It matters not what someone is born, but what...",J.K. Rowling
99,“Love does not begin and end the way we seem t...,James Baldwin


## Display dataframe

### Subtask:
Display the created DataFrame.


**Reasoning**:
Display the created DataFrame `df_quotes` using the `display()` function as requested in the instructions.



In [30]:
display(df_quotes)

Unnamed: 0,Quote,Author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe
...,...,...
95,“The real lover is the man who can thrill you ...,Marilyn Monroe
96,"“A wise girl kisses but doesn't love, listens ...",Marilyn Monroe
97,“Only in the darkness can you see the stars.”,Martin Luther King Jr.
98,"“It matters not what someone is born, but what...",J.K. Rowling


In [31]:
df_quotes.shape

(100, 2)

In [32]:
df_quotes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Quote   100 non-null    object
 1   Author  100 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [33]:
df_quotes.describe()

Unnamed: 0,Quote,Author
count,100,100
unique,50,28
top,“The world as we have created it is a process ...,Albert Einstein
freq,2,16


In [34]:
df_quotes.to_csv('quotes.csv',index=False)

## Summary:

### Data Analysis Key Findings

*   The web scraping process successfully extracted 100 quotes and their corresponding authors from 5 pages of the specified website.
*   The extracted data was successfully organized into a pandas DataFrame named `df_quotes` with two columns: 'Quote' and 'Author'.

### Insights or Next Steps

*   The current code extracts 100 quotes. To strictly adhere to the requirement of 50 quotes, the loop iteration could be modified to stop after processing 50 quotes (e.g., by checking the length of the `quotes` list within the loop).
*   The DataFrame is ready for further analysis, such as exploring the most frequent authors or analyzing the sentiment of the quotes.
