<a href="https://colab.research.google.com/github/SanjayBista1010/my-first-repo/blob/main/WebScrapingBeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Example: Fetching Computer Products Page

This notebook demonstrates how to use Python's `requests` library to fetch web pages  
and `BeautifulSoup` from `bs4` to parse the HTML content.

We will:
1. Send an HTTP GET request to a test e-commerce site.
2. Parse the returned HTML content.
3. Inspect the HTML structure for further data extraction.


In [4]:
# Import required libraries
import requests  # For sending HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML content

# Step 1: Specify the target URL
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers"

# Step 2: Send a GET request to the URL
# The response object 'r' contains the server's reply to our request
r = requests.get(url)

# Optional: You can print the raw HTML to inspect (commented to avoid flooding output)
# print(r.text)

# Step 3: Parse the HTML content using BeautifulSoup
# 'lxml' parser is used here for speed and efficiency
soup = BeautifulSoup(r.text, 'lxml')

# Display the parsed HTML object
soup


<!DOCTYPE html>
<html lang="en">
<head>
<!-- Google Tag Manager -->
<script nonce="e78D6tIDR1RmqdEfNNvYn3pIDfo1B09w">(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
<title>Allinone | Web Scraper Test Sites</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords"/>
<meta content="Test Web Scraper's features and performance on mock e-commerce sites. Extract product data, prices, and categories in a controlled environment." name="description"/>

# Web Scraping: Selecting a Specific HTML Element (`div`)

In this section, we:
1. Send an HTTP GET request to a test e-commerce site.
2. Parse the HTML using BeautifulSoup.
3. Access the **first `<div>` element** in the HTML document.


In [5]:
# Import libraries
import requests  # For HTTP requests
from bs4 import BeautifulSoup  # For HTML parsing

# Step 1: Set the target URL
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers"

# Step 2: Send GET request to fetch the web page
r = requests.get(url)

# Optional: Uncomment to see raw HTML (can be large)
# print(r.text)

# Step 3: Parse HTML using BeautifulSoup with the 'lxml' parser
soup = BeautifulSoup(r.text, 'lxml')

# Step 4: Access the first <div> element in the document
# soup.div returns the first <div> encountered in the HTML
soup.div


<div class="container">
<div class="navbar-header">
<a data-bs-target=".side-collapse" data-bs-target-2=".side-collapse-container" data-bs-toggle="collapse-side">
<button aria-controls="navbar" aria-expanded="false" class="navbar-toggler float-end collapsed" data-bs-target="#navbar" data-bs-target-2=".side-collapse-container" data-bs-target-3=".side-collapse" data-bs-toggle="collapse" type="button">
<span class="visually-hidden">Toggle navigation</span>
<span class="icon-bar top-bar"></span>
<span class="icon-bar middle-bar"></span>
<span class="icon-bar bottom-bar"></span>
<span class="icon-bar extra-bottom-bar"></span>
</button>
</a>
<div class="navbar-brand">
<a href="/"><img alt="Web Scraper" src="/img/logo_white.svg"/></a>
</div>
</div>
<div class="side-collapse in">
<nav class="navbar-collapse collapse" id="navbar" role="navigation">
<ul class="nav navbar-nav navbar-right">
<li class="nav-item">
<a class="nav-link menuitm" href="/">
<p>Web Scraper</p>
<div class="crta"></div>
</a

In [8]:
# Step 1: Set the target URL
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers"

# Step 2: Send GET request
r = requests.get(url)

# Optional: Uncomment to print raw HTML
# print(r.text)

# Step 3: Parse the HTML content
soup = BeautifulSoup(r.text, 'lxml')

# Step 4: Select the <header> tag
tag = soup.header

# Step 5: Get the attributes of the <header> tag
atb = tag.attrs

# Step 6: Access the "class" attribute value
atb["class"]

['navbar',
 'fixed-top',
 'navbar-expand-lg',
 'navbar-dark',
 'navbar-static',
 'svg-background']

# Web Scraping: Accessing Nested Tags and Strings

In this section, we:
1. Fetch the HTML content from the e-commerce test site.
2. Parse the HTML using BeautifulSoup.
3. Navigate through nested tags (`<div>` → `<p>`) to access text.
4. Use `.string` to directly extract the string content from the `<p>` tag.


In [9]:
# Step 1: Set the target URL
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers"

# Step 2: Send GET request to fetch the page
r = requests.get(url)

# Optional: Uncomment to print raw HTML
# print(r.text)

# Step 3: Parse HTML content
soup = BeautifulSoup(r.text, 'lxml')

# Step 4: Access nested tags
# Navigate: <div> → <p> → .string to get the text inside <p>
tag = soup.div.p.string

# Step 5: Display the extracted string
tag

# Alternative approach (commented out):
# tag = soup.div.p        # Selects the <p> tag
# tag.string              # Retrieves text from <p>

'Web Scraper'

In [13]:
# Step 1: Set the target URL (laptops category)
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

# Step 2: Send GET request to fetch the HTML content
r = requests.get(url)

# Step 3: Parse HTML using BeautifulSoup with the 'lxml' parser
soup = BeautifulSoup(r.text, 'lxml')

# Optional: Uncomment to view the first <div> in the HTML
# print(soup.find("div"))

# Step 4: Find the first <h4> element with the specified class attribute
print(soup.find('h4', {'class': 'price float-end card-title pull-right'}))

<h4 class="price float-end card-title pull-right" itemprop="offers" itemscope="" itemtype="https://schema.org/Offer">
<span itemprop="price">$295.99</span>
<meta content="USD" itemprop="priceCurrency"/>
</h4>


# Web Scraping: Extracting All Product Prices

In this section, we:
1. Locate **all `<h4>` tags** with the class corresponding to product prices.
2. Iterate through the results.
3. Print the text content of each tag, which represents the product price.


In [16]:
# Step 1: Find all <h4> tags with the price-related class
prices = soup.find_all('h4', class_='price float-end card-title pull-right')

# Step 2: Loop through each tag and print its text (the price)
for i in prices:
    print(i.text)


$295.99



$299



$299



$306.99



$321.94



$356.49



$364.46



$372.7



$379.94



$379.95



$391.48



$393.88



$399



$399.99



$404.23



$408.98



$409.63



$410.46



$410.66



$416.99



$433.3



$436.29



$436.29



$439.73



$454.62



$454.73



$457.38



$465.95



$468.56



$469.1



$484.23



$485.9



$487.8



$488.64



$488.78



$494.71



$497.17



$498.23



$520.99



$564.98



$577.99



$581.99



$609.99



$679



$679



$729



$739.99



$745.99



$799



$809



$899



$999



$1033.99



$1096.02



$1098.42



$1099



$1099



$1101.83



$1102.66



$1110.14



$1112.91



$1114.55



$1123.87



$1123.87



$1124.2



$1133.82



$1133.91



$1139.54



$1140.62



$1143.4



$1144.2



$1144.4



$1149



$1149



$1149.73



$1154.04



$1170.1



$1178.19



$1178.99



$1179



$1187.88



$1187.98



$1199



$1199



$1199.73



$1203.41



$1212.16



$1221.58



$1223.99



$1235.49



$1238.37



$1239.2



$1244.99


In [17]:
desc = soup.find_all('p', class_='description')
desc[3]

<p class="description card-text" itemprop="description">15.6", Pentium N3520 2.16GHz, 4GB, 500GB, Linux</p>

# Web Scraping: Searching Text with Regular Expressions

In this section, we:
1. Fetch the HTML content from the tablets category page.
2. Parse the HTML using BeautifulSoup.
3. Use a **regular expression** with `.find_all()` to find tags containing text that matches a pattern (e.g., any text containing "Lenovo").
4. Store and display the results.


In [33]:
import re

# Step 1: Set the target URL (tablets category)
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets'

# Step 2: Send GET request to fetch the page
r = requests.get(url)

# Step 3: Parse HTML using BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')

# Step 4: Use a regular expression to find all strings containing "Lenovo"
# re.compile() allows partial matches and pattern flexibility
data = soup.find_all(string=re.compile('Lenovo'))

# Step 5: Display the matched results
data

[]

In [34]:
# Find all strings containing "Asus" using a regular expression
data = soup.find_all(string=re.compile('Asus'))

# Print the number of matches found
print(len(data))

1


In [40]:
# Find all product names
name = soup.find_all('a', class_='title')
# print(name)

product_name = []
for i in name:
    name = i.text  # Extract text from <a> tag
    product_name.append(name)

print(product_name)

# Find all product prices
prices = soup.find_all('h4', class_='price float-end card-title pull-right')

prices_list = []
for i in prices:
    price = i.text  # Extract text from <h4> tag
    prices_list.append(price)

print(prices_list)

# Find all product descriptions
desc = soup.find_all('p', class_='description')
desc_list = []
for i in desc:
    des = i.text  # Extract text from <p> tag
    desc_list.append(des)

print(desc_list)

# Find all review counts
reviews = soup.find_all('p', class_='review-count float-end')
reviews_list = []
for i in reviews:
    rev = i.text  # Extract text from <p> tag
    reviews_list.append(rev)

print(reviews_list)

['\n\t\t\t\t\t\tLenovo IdeaTab\n\t\t\t\t\t', '\n\t\t\t\t\t\tIdeaTab A3500L\n\t\t\t\t\t', '\n\t\t\t\t\t\tAcer Iconia\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Tab 3\n\t\t\t\t\t', '\n\t\t\t\t\t\tIconia B1-730H...\n\t\t\t\t\t', '\n\t\t\t\t\t\tMemo Pad HD 7\n\t\t\t\t\t', '\n\t\t\t\t\t\tAsus MeMO Pad\n\t\t\t\t\t', '\n\t\t\t\t\t\tAmazon Kindle\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Tab 3\n\t\t\t\t\t', '\n\t\t\t\t\t\tIdeaTab A8-50\n\t\t\t\t\t', '\n\t\t\t\t\t\tMeMO Pad 7\n\t\t\t\t\t', '\n\t\t\t\t\t\tIdeaTab A3500-...\n\t\t\t\t\t', '\n\t\t\t\t\t\tIdeaTab S5000\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Tab 4\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Tab\n\t\t\t\t\t', '\n\t\t\t\t\t\tMeMo PAD FHD 1...\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Note\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Note\n\t\t\t\t\t', '\n\t\t\t\t\t\tiPad Mini Reti...\n\t\t\t\t\t', '\n\t\t\t\t\t\tGalaxy Note 10...\n\t\t\t\t\t', '\n\t\t\t\t\t\tApple iPad Air\n\t\t\t\t\t']
['\n$69.99\n\n', '\n$88.99\n\n', '\n$96.99\n\n', '\n$97.99\n\n', '\n$99.99\n\n', '\n$101

In [41]:
# Clean product names
product_name = [name.strip() for name in product_name]
print(product_name)

# Clean prices
prices_list = [price.strip() for price in prices_list]
print(prices_list)

# Clean descriptions
desc_list = [des.strip() for des in desc_list]
print(desc_list)

# Clean review counts
reviews_list = [rev.strip() for rev in reviews_list]
print(reviews_list)


['Lenovo IdeaTab', 'IdeaTab A3500L', 'Acer Iconia', 'Galaxy Tab 3', 'Iconia B1-730H...', 'Memo Pad HD 7', 'Asus MeMO Pad', 'Amazon Kindle', 'Galaxy Tab 3', 'IdeaTab A8-50', 'MeMO Pad 7', 'IdeaTab A3500-...', 'IdeaTab S5000', 'Galaxy Tab 4', 'Galaxy Tab', 'MeMo PAD FHD 1...', 'Galaxy Note', 'Galaxy Note', 'iPad Mini Reti...', 'Galaxy Note 10...', 'Apple iPad Air']
['$69.99', '$88.99', '$96.99', '$97.99', '$99.99', '$101.99', '$102.99', '$103.99', '$107.99', '$121.99', '$130.99', '$148.99', '$172.99', '$233.99', '$251.99', '$320.99', '$399.99', '$489.99', '$537.99', '$587.99', '$603.99']
['7" screen, Android', 'Black, 7" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2', '7" screen, Android, 16GB', '7", 8GB, Wi-Fi, Android 4.2, White', 'Black, 7", 1.6GHz Dual-Core, 8GB, Android 4.4', 'IPS, Dual-Core 1.2GHz, 8GB, Android 4.3', '7" screen, Android, 8GB', '6" screen, wifi', '7", 8GB, Wi-Fi, Android 4.2, Yellow', 'Blue, 8" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2', 'White, 7", Atom 1.2GHz, 8GB, Andro

In [44]:
# Create a DataFrame from the scraped data
df = pd.DataFrame({"Product Name": product_name,
                  'Prices' : prices_list,
                  'Description': desc_list,
                  'Reviews': reviews_list})

df

Unnamed: 0,Product Name,Prices,Description,Reviews
0,Lenovo IdeaTab,$69.99,"7"" screen, Android",7 reviews
1,IdeaTab A3500L,$88.99,"Black, 7"" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2",7 reviews
2,Acer Iconia,$96.99,"7"" screen, Android, 16GB",7 reviews
3,Galaxy Tab 3,$97.99,"7"", 8GB, Wi-Fi, Android 4.2, White",2 reviews
4,Iconia B1-730H...,$99.99,"Black, 7"", 1.6GHz Dual-Core, 8GB, Android 4.4",1 reviews
5,Memo Pad HD 7,$101.99,"IPS, Dual-Core 1.2GHz, 8GB, Android 4.3",10 reviews
6,Asus MeMO Pad,$102.99,"7"" screen, Android, 8GB",14 reviews
7,Amazon Kindle,$103.99,"6"" screen, wifi",3 reviews
8,Galaxy Tab 3,$107.99,"7"", 8GB, Wi-Fi, Android 4.2, Yellow",14 reviews
9,IdeaTab A8-50,$121.99,"Blue, 8"" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2",13 reviews


In [45]:
#save to csv file
df.to_csv('tablet_prices.csv')