*Web scraping* is the process of extracting data from websites. It involves fetching the content of web pages and parsing it to extract useful information, such as text, images, or specific elements like product prices, news headlines, or user reviews.

### Common Uses of Web Scraping
- Collecting product prices and reviews from e-commerce websites.
- Aggregating news from various news portals.
- Extracting data for market research or competitive analysis.
- Building datasets for machine learning or research.

### The Web Scraping Process
1. *Identify the Website and Data*: Start by deciding what website you want to scrape and which specific data you need. For example, scraping the price and rating of products from an e-commerce site.

2. *Inspect the Web Page*: Use the browser's developer tools (right-click > Inspect) to analyze the HTML structure of the web page and identify the tags, classes, and attributes where the target data is located.

3. *Choose the Tools/Libraries*: Common Python libraries for web scraping include:
   - *Requests*: For sending HTTP requests to get the raw HTML content.
   - *BeautifulSoup*: For parsing HTML and XML documents, allowing easy navigation and extraction of elements.
   - *Selenium*: For scraping dynamic content rendered by JavaScript, as it can simulate user interactions like clicking buttons and filling out forms.
   - *Scrapy*: A more advanced and powerful web scraping framework used for large-scale scraping.

4. *Send a Request to the Web Page*: Use the requests library to send an HTTP request and retrieve the HTML content of the page.

5. *Parse the HTML Content*: Use BeautifulSoup to parse the HTML and locate the elements containing the desired data based on tags, classes, or IDs.

6. *Extract and Process the Data*: Navigate through the parsed HTML to extract and store the data in a structured format (e.g., CSV, JSON, or database).

7. *Handle Dynamic Content (Optional)*: If the website uses JavaScript to load content dynamically, you might need to use Selenium or the requests-html library to wait for the page to load fully and interact with elements.

8. *Store the Data*: Save the extracted data in a file or database for further analysis.

### Example: Basic Web Scraping with Python

Here’s a simple example of scraping product titles from an e-commerce website using Python with requests and BeautifulSoup:

python
import requests
from bs4 import BeautifulSoup

# Step 1: Send a request to the web page
url = "https://example.com/products"
response = requests.get(url)

# Step 2: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Step 3: Find and extract the relevant data
product_titles = soup.find_all("h2", class_="product-title")

# Step 4: Print the extracted product titles
for title in product_titles:
    print(title.get_text())


### Handling Ethical Issues and Legal Considerations
- *Respect Robots.txt*: Always check the website’s robots.txt file, which specifies which parts of the site can be crawled or scraped.
- *Avoid Overloading Servers*: Implement delays between requests to avoid overwhelming the server (e.g., using time.sleep()).
- *Terms of Service*: Be mindful of the website’s terms and conditions, as some websites explicitly forbid scraping.

### Advanced Considerations
- *Pagination*: If the data spans multiple pages, your scraper should handle pagination by following the “next” links.
- *Authentication*: Some websites require you to log in or pass through CAPTCHA challenges. Selenium can help bypass these.

### Summary
Web scraping is a powerful technique for extracting information from websites. The process involves sending an HTTP request, parsing HTML, and extracting specific data. Python provides several libraries like requests, BeautifulSoup, and Selenium to perform these tasks effectively.

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install selenium

Collecting selenium
  Downloading selenium-4.24.0-py3-none-any.whl (9.6 MB)
     ---------------------------------------- 9.6/9.6 MB 368.7 kB/s eta 0:00:00
Collecting typing_extensions~=4.9
  Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.26.2-py3-none-any.whl (475 kB)
     ------------------------------------ 476.0/476.0 kB 426.1 kB/s eta 0:00:00
Collecting websocket-client~=1.8
  Downloading websocket_client-1.8.0-py3-none-any.whl (58 kB)
     -------------------------------------- 58.8/58.8 kB 445.7 kB/s eta 0:00:00
Collecting sniffio>=1.3.0
  Downloading sniffio-1.3.1-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting exceptiongroup
  Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
Collecting attrs>=23.2.0
  Downloading attrs-24.2.0-py3-none-any.whl (63 kB)
 

In [3]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np # for arithmetic operations
import pandas as pd # data manipulation
import requests # getting access to the web page
from bs4 import BeautifulSoup as bs # parsing the data extracted from the webpage
import selenium # getting access and parsing data from webpage

In [2]:
url ="https://www.jumia.com.ng/catalog/?q=clothes"
response= requests.get(url)

In [3]:
html_content=response.content

In [4]:
soup=bs(html_content)

In [5]:
single_product=soup.find("div", class_="info") # find one product listed
single_product

<div class="info"><h3 class="name">Kingskartel Kings-Kartel Stylish Sweatshirt &amp; Joggers Pant (Brown &amp; Black)</h3><div class="prc">₦ 19,003 - ₦ 19,999</div><div class="s-prc-w"><div class="old">₦ 25,000</div><div class="bdg _dsct _sm">24%</div></div><div class="rev"><div class="stars _s">3.9 out of 5<div class="in" style="width:78%"></div></div>(20)</div></div>

In [6]:
name=single_product.find("h3", class_="name").text
name

'Kingskartel Kings-Kartel Stylish Sweatshirt & Joggers Pant (Brown & Black)'

In [7]:
price= single_product.find("div", class_="prc").text
price

'₦ 19,003 - ₦ 19,999'

In [8]:
old_price= single_product.find("div", class_="old").text
old_price

'₦ 25,000'

In [9]:
discount= single_product.find("div", class_="bdg _dsct _sm").text
discount

'24%'

In [10]:
no_ppl_rating= single_product.find("div", class_="rev").text
no_ppl_rating

'3.9 out of 5(20)'

In [11]:
rating= single_product.find("div", class_="rev").text.split()[0]
rating

'3.9'

In [12]:
no_ppl_rating2= single_product.find("div", class_="rev").text.split("(")[-1].strip(")")
no_ppl_rating2

'20'

In [16]:
dic = {"product_name":[],
      "product_price":[],
      "old_price":[],
      "discount_percentage":[],
      "rating":[],
      "number_of_people_rated":[]}
dic["product_name"].append(name)
dic["product_price"].append(price)
dic["discount_percentage"].append(discount)
dic["old_price"].append(old_price)
dic["number_of_people_rated"].append(no_ppl_rating2)
dic["rating"].append(rating)
df=pd.DataFrame(dic, index=[0])
df

Unnamed: 0,product_name,product_price,old_price,discount_percentage,rating,number_of_people_rated
0,Kingskartel Kings-Kartel Stylish Sweatshirt & ...,"₦ 19,003 - ₦ 19,999","₦ 25,000",24%,3.9,20


In [36]:
dic = {"product_name":[],
      "product_price":[],
      "old_price":[],
      "discount_percentage":[],
      "rating":[],
      "number_of_people_rated":[]}

all_product=soup.findAll("div", class_="info") # returns list of all the products listed
for i in all_product:
    try:
        dic["discount_percentage"].append(i.find("div", class_="bdg _dsct _sm").text)
    except:
        dic["discount_percentage"].append(np.nan)
    try:
        dic["product_name"].append(i.find("h3", class_="name").text)
    except:
        dic["product_name"].append(np.nan)
    try:
        dic["product_price"].append(i.find("div", class_="prc").text)
    except:
        dic["product_price"].append(np.nan)
    try:
        dic["old_price"].append(i.find("div", class_="old").text)
    except:
        dic["old_price"].append(np.nan)
    try:
        dic["rating"].append(i.find("div", class_="rev").text.split()[0])
    except:
        dic["rating"].append(np.nan)
    try:
        dic["number_of_people_rated"].append(i.find("div", class_="rev").text.split("(")[-1].strip(")"))
    except:
        dic["number_of_people_rated"].append(np.nan)

df=pd.DataFrame(dic)
df

Unnamed: 0,product_name,product_price,old_price,discount_percentage,rating,number_of_people_rated
0,Men's Shorts And Short Sleeves Suit Outdoor Sp...,"₦ 42,640","₦ 50,840",16%,,
1,Men's Casual Suits Hoodies And Trousers Sports...,"₦ 50,840","₦ 59,040",14%,,
2,Loose Waffle Casual Suit Men Loose Shortsleeve...,"₦ 40,305","₦ 40,334",1%,,
3,Men'S Sports Suit Fashion Casual Lightweight Q...,"₦ 32,497","₦ 36,639",11%,,
4,Loose Waffle Casual Suit Men Loose Shortsleeve...,"₦ 40,305","₦ 40,334",1%,,
5,Summer Men Shorts Set Matching Shirts Letter S...,"₦ 50,774","₦ 74,750",32%,,
6,Summer Men Shorts Set Matching Shirts Letter S...,"₦ 50,801","₦ 74,777",32%,,
7,Trousers Men's 2in1 Short Sleeved Tshirt And P...,"₦ 56,957","₦ 96,876",41%,,
8,Men Sets 2 Pieces Tracksuit Summer Short Sleev...,"₦ 52,934","₦ 53,001",1%,,
9,Couple Shorts Men's And Women's Tide Tshirt Fa...,"₦ 66,731","₦ 126,792",47%,,


In [37]:
page =1
dic = {"product_name":[],
      "product_price":[],
      "old_price":[],
      "discount_percentage":[],
      "rating":[],
      "number_of_people_rated":[]}
while page<=50:
    url= f"https://www.jumia.com.ng/mens-clothing-bundles/?page={page}#catalog-listing"
    response= requests.get(url)
    html_content=response.content
    soup=bs(html_content)
    all_products=soup.findAll("div", class_="info") # returns list of all products listed
    for i in all_product:
        try:
            dic["discount_percentage"].append(i.find("div", class_="bdg _dsct _sm").text)
        except:
            dic["discount_percentage"].append(np.nan)
        try:
            dic["product_name"].append(i.find("h3", class_="name").text)
        except:
            dic["product_name"].append(np.nan)
        try:
            dic["product_price"].append(i.find("div", class_="prc").text)
        except:
            dic["product_price"].append(np.nan)
        try:
            dic["old_price"].append(i.find("div", class_="old").text)
        except:
            dic["old_price"].append(np.nan)
        try:
            dic["rating"].append(i.find("div", class_="rev").text.split()[0])
        except:
            dic["rating"].append(np.nan)
        try:
            dic["number_of_people_rated"].append(i.find("div", class_="rev").text.split("(")[-1].strip(")"))
        except:
            dic["number_of_people_rated"].append(np.nan)
    page+=1
df=pd.DataFrame(dic)
df

Unnamed: 0,product_name,product_price,old_price,discount_percentage,rating,number_of_people_rated
0,Men's Shorts And Short Sleeves Suit Outdoor Sp...,"₦ 42,640","₦ 50,840",16%,,
1,Men's Casual Suits Hoodies And Trousers Sports...,"₦ 50,840","₦ 59,040",14%,,
2,Loose Waffle Casual Suit Men Loose Shortsleeve...,"₦ 40,305","₦ 40,334",1%,,
3,Men'S Sports Suit Fashion Casual Lightweight Q...,"₦ 32,497","₦ 36,639",11%,,
4,Loose Waffle Casual Suit Men Loose Shortsleeve...,"₦ 40,305","₦ 40,334",1%,,
...,...,...,...,...,...,...
1995,Men's casual suits Sports youth hooded sweatsh...,"₦ 49,200","₦ 57,400",14%,,
1996,2Pcs Portable Wardrobe Closet Storage Organize...,"₦ 50,000",,,,
1997,Men Sets 2 Pieces Tracksuit Summer Short Sleev...,"₦ 52,056","₦ 52,092",1%,,
1998,Cool The Wolf 3D Printed T_ShirtSuit Summer Sh...,"₦ 47,817","₦ 73,670",35%,,


In [40]:
df.to_csv("clothing.csv", index=False)

## Time when request is not working

In [40]:
url="https://www.amazon.com/s?k=clothes&crid=3VUXXM9I6LGFR&sprefix=clothes%2Caps%2C298&ref=nb_sb_noss_1"

In [42]:
res = requests.get(url)

In [43]:
res.content

b'<!--\n        To discuss automated access to Amazon data please contact api-services-support@amazon.com.\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.\n-->\n<!doctype html>\n<html>\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="x-ua-compatible" content="ie=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n  <title>Sorry! Something went wrong!</title>\n  <style>\n  html, body {\n    padding: 0;\n    margin: 0\n  }\n\n  img {\n    border: 0\n  }\n\n  #a {\n    background: #232f3e;\n    padding: 11px 11px 11px 192px\n  }\n\n  #b {\n    position: absolute;\n    left: 22px;\n    top: 12px\n  }\n\n  #c {\n    position: relative;\n    max-width: 800px;\n    padding: 0 40px 0 0\n  }\n\n  #e, #f {\n    hei

## extracting info using selenium

In [41]:
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome

driver_path = "C:\\Users\\BlessedREI\\Desktop\\Data Science Course\\software\\data class web scrapping\\chromedriver.exe"
opt = Options() ## opt=Options("")
ser = Service(driver_path)
driver = Chrome(options=opt, service=ser)

In [42]:
url="https://www.amazon.com/s?k=clothes&crid=3VUXXM9I6LGFR&sprefix=clothes%2Caps%2C298&ref=nb_sb_noss_1"
res= driver.get(url)
html= driver.page_source

In [43]:
sp=bs(html)

In [44]:
sp

<html class="a-js a-audio a-video a-canvas a-svg a-drag-drop a-geolocation a-history a-webworker a-autofocus a-input-placeholder a-textarea-placeholder a-local-storage a-gradients a-hires a-transform3d a-touch-scrolling a-text-shadow a-text-stroke a-box-shadow a-border-radius a-border-image a-opacity a-transform a-transition null" data-19ax5a9jf="dingo" data-aui-build-date="3.24.7-2024-09-02" lang="en-us"><!-- sp:feature:head-start --><head><script async="" crossorigin="anonymous" src="https://c.amazon-adsystem.com/bao-csm/forensics/a9-tq-forensics-incremental.min.js"></script><script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/I/31bJewCvY-L.js"></script><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:end-feature:head-start -->
<!-- sp:feature:csm:head-open-part1 -->
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<!-- sp:end-feature:csm:head-open-part1 -->
<!-- sp:feature:cs-optimiz

In [102]:
sp.find("div", class_="a-section a-spacing-small puis-padding-left-small puis-padding-right-small")

<div class="a-section a-spacing-small puis-padding-left-small puis-padding-right-small"><div class="a-section a-spacing-none a-spacing-top-small s-title-instructions-style" data-cy="title-recipe"><div class="a-row a-spacing-micro"><span class="a-declarative" data-a-popover='{"activate":"onmouseover","position":"triggerVertical","inlineContent":"You are seeing this product from an Amazon brand based on the product\u2019s relevance to your search query.","closeButton":"true","dataStrategy":"inline"}' data-action="a-popover" data-csa-c-func-deps="aui-da-a-popover" data-csa-c-id="xlcyqc-81t6fe-9lyzjm-oyovv" data-csa-c-type="widget" data-render-id="rr2i7sdyxcj7r2kvvyppqhjhqo" data-version-id="v1vcm3c30a2jx02lrsa34oun9uo"><a class="puis-label-popover puis-sponsored-label-text" href="javascript:void(0)" role="button" style="text-decoration: none;"><span class="puis-label-popover-default"><span class="a-size-micro a-color-secondary">Featured from Amazon brands</span></span><span class="puis-la

In [51]:
name=sp.find("a", class_="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal").text
name

"Amazon Essentials Men's Slim-Fit Long-Sleeve Henley Shirt "

In [111]:
review=sp.find("span", class_="a-icon-alt").text
review

'4.3 out of 5 stars'

In [101]:
no_ppl_review=sp.find("span", class_="a-size-base s-underline-text").text
no_ppl_review

'11,289'

In [98]:
price=sp.find("span", class_="a-price-whole").text + sp.find("span", class_="a-price-fraction").text
price

'16.10'

In [108]:
old_price=sp.find("span", class_="a-offscreen").text
old_price

'$16.10'

In [116]:
coupon=sp.find("div", class_="a-row a-size-small a-color-secondary").text
coupon

'8% off coupon appliedSave 8%  with coupon (some sizes/colors) '

In [67]:
delivery=sp.find("div", class_="a-row a-size-base a-color-secondary s-align-children-center").text
delivery

'Delivery Tue, Sep 24 '

In [92]:
shipping_options=sp.find("span", class_="a-size-small a-color-base").text
shipping_options

'Ships to Nigeria'

In [105]:
all_product=sp.findAll("div", class_="a-section a-spacing-small puis-padding-left-small puis-padding-right-small")
all_product

[<div class="a-section a-spacing-small puis-padding-left-small puis-padding-right-small"><div class="a-section a-spacing-none a-spacing-top-small s-title-instructions-style" data-cy="title-recipe"><div class="a-row a-spacing-micro"><span class="a-declarative" data-a-popover='{"activate":"onmouseover","position":"triggerVertical","inlineContent":"You are seeing this product from an Amazon brand based on the product\u2019s relevance to your search query.","closeButton":"true","dataStrategy":"inline"}' data-action="a-popover" data-csa-c-func-deps="aui-da-a-popover" data-csa-c-id="xlcyqc-81t6fe-9lyzjm-oyovv" data-csa-c-type="widget" data-render-id="rr2i7sdyxcj7r2kvvyppqhjhqo" data-version-id="v1vcm3c30a2jx02lrsa34oun9uo"><a class="puis-label-popover puis-sponsored-label-text" href="javascript:void(0)" role="button" style="text-decoration: none;"><span class="puis-label-popover-default"><span class="a-size-micro a-color-secondary">Featured from Amazon brands</span></span><span class="puis-l

In [117]:
dic = {"name":[],
      "price":[],
      "old_price":[],
     
      "review":[],
      "no_ppl_review":[],
      "delivery":[],
       "coupon":[]}
dic["name"].append(name)
dic["price"].append(price)
dic["old_price"].append(old_price)
dic["coupon"].append(coupon)
dic["review"].append(review)
dic["no_ppl_review"].append(no_ppl_review)
dic["delivery"].append(delivery)
amazon=pd.DataFrame(dic, index=[0])
amazon

Unnamed: 0,name,price,old_price,review,no_ppl_review,delivery,coupon
0,Amazon Essentials Men's Slim-Fit Long-Sleeve H...,16.1,$16.10,4.3 out of 5 stars,11289,Ships to Nigeria,8% off coupon appliedSave 8% with coupon (som...


In [None]:
page = 1
dic = {"name":[],
      "price":[],
      "old_price":[],
      "review":[],
      "no_ppl_review":[],
      "delivery":[],
       "coupon":[]}
while page<=7:
    url= f"https://www.amazon.com/s?k=clothes&{page}&crid=3VUXXM9I6LGFR&qid=1725554647&sprefix=clothes%2Caps%2C298&ref=sr_pg_{page}"
    response= requests.get(url)
    html_content=response.content
    soup=bs(html_content)
    all_product=sp.findAll("div", class_="a-section a-spacing-small puis-padding-left-small puis-padding-right-small")
    for i in all_product:
        try:
            dic["name"].append(i.find("a", class_="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal").text)
        except:
            dic["name"].append(np.nan)
        try:
            dic["price"].append(i.find("span", class_="a-price-whole").text)
        except:
            dic["price"].append(np.nan)
        try:
            dic["old_price"].append(i.find("div", class_="a-row a-size-base a-color-secondary s-align-children-center").text)
        except:
            dic["old_price"].append(np.nan)
        try:
            dic["review"].append(i.find("span", class_="a-icon-alt").text)
        except:
            dic["review"].append(np.nan)
        try:
            dic["no_ppl_review"].append(i.find("span", class_="a-size-base s-underline-text").text
        except:
            dic["no_ppl_review"].append(np.nan)
        try:
            dic["delivery"].append(i.find("div", class_="rev").text.split("(")[-1].strip(")"))
        except:
            dic["delivery"].append(np.nan)
        try:
            dic["coupon"].append(i.find("div", class_="rev").text.split("(")[-1].strip(")"))
        except:
            dic["coupon"].append(np.nan)
    page+=1
df=pd.DataFrame(dic)
df