*Web scraping* is the process of extracting data from websites. It involves fetching the content of web pages and parsing it to extract useful information, such as text, images, or specific elements like product prices, news headlines, or user reviews.

### Common Uses of Web Scraping
- Collecting product prices and reviews from e-commerce websites.
- Aggregating news from various news portals.
- Extracting data for market research or competitive analysis.
- Building datasets for machine learning or research.

### The Web Scraping Process
1. *Identify the Website and Data*: Start by deciding what website you want to scrape and which specific data you need. For example, scraping the price and rating of products from an e-commerce site.

2. *Inspect the Web Page*: Use the browser's developer tools (right-click > Inspect) to analyze the HTML structure of the web page and identify the tags, classes, and attributes where the target data is located.

3. *Choose the Tools/Libraries*: Common Python libraries for web scraping include:
   - *Requests*: For sending HTTP requests to get the raw HTML content.
   - *BeautifulSoup*: For parsing HTML and XML documents, allowing easy navigation and extraction of elements.
   - *Selenium*: For scraping dynamic content rendered by JavaScript, as it can simulate user interactions like clicking buttons and filling out forms.
   - *Scrapy*: A more advanced and powerful web scraping framework used for large-scale scraping.

4. *Send a Request to the Web Page*: Use the requests library to send an HTTP request and retrieve the HTML content of the page.

5. *Parse the HTML Content*: Use BeautifulSoup to parse the HTML and locate the elements containing the desired data based on tags, classes, or IDs.

6. *Extract and Process the Data*: Navigate through the parsed HTML to extract and store the data in a structured format (e.g., CSV, JSON, or database).

7. *Handle Dynamic Content (Optional)*: If the website uses JavaScript to load content dynamically, you might need to use Selenium or the requests-html library to wait for the page to load fully and interact with elements.

8. *Store the Data*: Save the extracted data in a file or database for further analysis.

### Example: Basic Web Scraping with Python

Here’s a simple example of scraping product titles from an e-commerce website using Python with requests and BeautifulSoup:

python
import requests
from bs4 import BeautifulSoup

# Step 1: Send a request to the web page
url = "https://example.com/products"
response = requests.get(url)

# Step 2: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Step 3: Find and extract the relevant data
product_titles = soup.find_all("h2", class_="product-title")

# Step 4: Print the extracted product titles
for title in product_titles:
    print(title.get_text())


### Handling Ethical Issues and Legal Considerations
- *Respect Robots.txt*: Always check the website’s robots.txt file, which specifies which parts of the site can be crawled or scraped.
- *Avoid Overloading Servers*: Implement delays between requests to avoid overwhelming the server (e.g., using time.sleep()).
- *Terms of Service*: Be mindful of the website’s terms and conditions, as some websites explicitly forbid scraping.

### Advanced Considerations
- *Pagination*: If the data spans multiple pages, your scraper should handle pagination by following the “next” links.
- *Authentication*: Some websites require you to log in or pass through CAPTCHA challenges. Selenium can help bypass these.

### Summary
Web scraping is a powerful technique for extracting information from websites. The process involves sending an HTTP request, parsing HTML, and extracting specific data. Python provides several libraries like requests, BeautifulSoup, and Selenium to perform these tasks effectively.

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install selenium

Collecting selenium
  Downloading selenium-4.24.0-py3-none-any.whl (9.6 MB)
     ---------------------------------------- 9.6/9.6 MB 368.7 kB/s eta 0:00:00
Collecting typing_extensions~=4.9
  Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.26.2-py3-none-any.whl (475 kB)
     ------------------------------------ 476.0/476.0 kB 426.1 kB/s eta 0:00:00
Collecting websocket-client~=1.8
  Downloading websocket_client-1.8.0-py3-none-any.whl (58 kB)
     -------------------------------------- 58.8/58.8 kB 445.7 kB/s eta 0:00:00
Collecting sniffio>=1.3.0
  Downloading sniffio-1.3.1-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting exceptiongroup
  Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
Collecting attrs>=23.2.0
  Downloading attrs-24.2.0-py3-none-any.whl (63 kB)
 

In [3]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np # for arithmetic operations
import pandas as pd # data manipulation
import requests # getting access to the web page
from bs4 import BeautifulSoup as bs # parsing the data extracted from the webpage
import selenium # getting access and parsing data from webpage

In [2]:
url ="https://www.jumia.com.ng/catalog/?q=clothes"
response= requests.get(url)

ConnectionError: HTTPSConnectionPool(host='www.jumia.com.ng', port=443): Max retries exceeded with url: /catalog/?q=clothes (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001F86071C640>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [None]:
html_content=response.content

In [None]:
soup=bs(html_content)

In [None]:
single_product=soup.find("div", class_="info") # find one product listed
single_product

In [None]:
name=single_product.find("h3", class_="name").text
name

In [None]:
price= single_product.find("div", class_="prc").text
price

In [None]:
old_price= single_product.find("div", class_="old").text
old_price

In [None]:
discount= single_product.find("div", class_="bdg _dsct _sm").text
discount

In [None]:
no_ppl_rating= single_product.find("div", class_="rev").text
no_ppl_rating

In [3]:
rating= single_product.find("div", class_="rev").text.split()[0]
rating

NameError: name 'single_product' is not defined

In [4]:
no_ppl_rating2= single_product.find("div", class_="rev").text.split("(")[-1].strip(")")
no_ppl_rating2

NameError: name 'single_product' is not defined

In [5]:
dic = {"product_name":[],
      "product_price":[],
      "old_price":[],
      "discount_percentage":[],
      "rating":[],
      "number_of_people_rated":[]}
dic["product_name"].append(name)
dic["product_price"].append(price)
dic["discount_percentage"].append(discount)
dic["old_price"].append(old_price)
dic["number_of_people_rated"].append(no_ppl_rating2)
dic["rating"].append(rating)
df=pd.DataFrame(dic, index=[0])
df

NameError: name 'name' is not defined

In [6]:
dic = {"product_name":[],
      "product_price":[],
      "old_price":[],
      "discount_percentage":[],
      "rating":[],
      "number_of_people_rated":[]}

all_product=soup.findAll("div", class_="info") # returns list of all the products listed
for i in all_product:
    try:
        dic["discount_percentage"].append(i.find("div", class_="bdg _dsct _sm").text)
    except:
        dic["discount_percentage"].append(np.nan)
    try:
        dic["product_name"].append(i.find("h3", class_="name").text)
    except:
        dic["product_name"].append(np.nan)
    try:
        dic["product_price"].append(i.find("div", class_="prc").text)
    except:
        dic["product_price"].append(np.nan)
    try:
        dic["old_price"].append(i.find("div", class_="old").text)
    except:
        dic["old_price"].append(np.nan)
    try:
        dic["rating"].append(i.find("div", class_="rev").text.split()[0])
    except:
        dic["rating"].append(np.nan)
    try:
        dic["number_of_people_rated"].append(i.find("div", class_="rev").text.split("(")[-1].strip(")"))
    except:
        dic["number_of_people_rated"].append(np.nan)

df=pd.DataFrame(dic)
df

NameError: name 'soup' is not defined

In [7]:
page =1
dic = {"product_name":[],
      "product_price":[],
      "old_price":[],
      "discount_percentage":[],
      "rating":[],
      "number_of_people_rated":[]}
while page<=50:
    url= f"https://www.jumia.com.ng/mens-clothing-bundles/?page={page}#catalog-listing"
    response= requests.get(url)
    html_content=response.content
    soup=bs(html_content)
    all_products=soup.findAll("div", class_="info") # returns list of all products listed
    for i in all_product:
        try:
            dic["discount_percentage"].append(i.find("div", class_="bdg _dsct _sm").text)
        except:
            dic["discount_percentage"].append(np.nan)
        try:
            dic["product_name"].append(i.find("h3", class_="name").text)
        except:
            dic["product_name"].append(np.nan)
        try:
            dic["product_price"].append(i.find("div", class_="prc").text)
        except:
            dic["product_price"].append(np.nan)
        try:
            dic["old_price"].append(i.find("div", class_="old").text)
        except:
            dic["old_price"].append(np.nan)
        try:
            dic["rating"].append(i.find("div", class_="rev").text.split()[0])
        except:
            dic["rating"].append(np.nan)
        try:
            dic["number_of_people_rated"].append(i.find("div", class_="rev").text.split("(")[-1].strip(")"))
        except:
            dic["number_of_people_rated"].append(np.nan)
    page+=1
df=pd.DataFrame(dic)
df

ConnectionError: HTTPSConnectionPool(host='www.jumia.com.ng', port=443): Max retries exceeded with url: /mens-clothing-bundles/?page=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001F8608D5D00>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [8]:
df.to_csv("clothing.csv", index=False)

NameError: name 'df' is not defined

## Time when request is not working

In [9]:
url="https://www.amazon.com/s?k=clothes&crid=3VUXXM9I6LGFR&sprefix=clothes%2Caps%2C298&ref=nb_sb_noss_1"

In [10]:
res = requests.get(url)

ConnectionError: HTTPSConnectionPool(host='www.amazon.com', port=443): Max retries exceeded with url: /s?k=clothes&crid=3VUXXM9I6LGFR&sprefix=clothes%2Caps%2C298&ref=nb_sb_noss_1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001F8608D5550>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [11]:
res.content

NameError: name 'res' is not defined

## extracting info using selenium

In [12]:
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome

driver_path = "C:\\Users\\BlessedREI\\Desktop\\Data Science Course\\software\\data class web scrapping\\chromedriver.exe"
opt = Options() ## opt=Options("")
ser = Service(driver_path)
driver = Chrome(options=opt, service=ser)

In [13]:
url="https://www.amazon.com/s?k=clothes&crid=3VUXXM9I6LGFR&sprefix=clothes%2Caps%2C298&ref=nb_sb_noss_1"
res= driver.get(url)
html= driver.page_source

WebDriverException: Message: unknown error: net::ERR_INTERNET_DISCONNECTED
  (Session info: chrome=128.0.6613.137)
Stacktrace:
	GetHandleVerifier [0x00007FF7A0DDB5D2+29090]
	(No symbol) [0x00007FF7A0D4E689]
	(No symbol) [0x00007FF7A0C0B1CA]
	(No symbol) [0x00007FF7A0C03204]
	(No symbol) [0x00007FF7A0BF4179]
	(No symbol) [0x00007FF7A0BF5F42]
	(No symbol) [0x00007FF7A0BF443F]
	(No symbol) [0x00007FF7A0BF3CD1]
	(No symbol) [0x00007FF7A0BF3C10]
	(No symbol) [0x00007FF7A0BF1AD3]
	(No symbol) [0x00007FF7A0BF214C]
	(No symbol) [0x00007FF7A0C0E231]
	(No symbol) [0x00007FF7A0CA73FE]
	(No symbol) [0x00007FF7A0C866EA]
	(No symbol) [0x00007FF7A0CA65D9]
	(No symbol) [0x00007FF7A0C86493]
	(No symbol) [0x00007FF7A0C509B1]
	(No symbol) [0x00007FF7A0C51B11]
	GetHandleVerifier [0x00007FF7A10F8C5D+3295277]
	GetHandleVerifier [0x00007FF7A1144843+3605523]
	GetHandleVerifier [0x00007FF7A113A707+3564247]
	GetHandleVerifier [0x00007FF7A0E96EB6+797318]
	(No symbol) [0x00007FF7A0D5980F]
	(No symbol) [0x00007FF7A0D553F4]
	(No symbol) [0x00007FF7A0D55580]
	(No symbol) [0x00007FF7A0D44A1F]
	BaseThreadInitThunk [0x00007FFDDA6A257D+29]
	RtlUserThreadStart [0x00007FFDDB9AAF28+40]


In [None]:
sp=bs(html)

In [None]:
sp

In [None]:
sp.find("div", class_="a-section a-spacing-small puis-padding-left-small puis-padding-right-small")

In [None]:
name=sp.find("a", class_="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal").text
name

In [None]:
review=sp.find("span", class_="a-icon-alt").text
review

In [None]:
no_ppl_review=sp.find("span", class_="a-size-base s-underline-text").text
no_ppl_review

In [None]:
price=sp.find("span", class_="a-price-whole").text + sp.find("span", class_="a-price-fraction").text
price

In [None]:
old_price=sp.find("span", class_="a-offscreen").text
old_price

In [None]:
coupon=sp.find("div", class_="a-row a-size-small a-color-secondary").text
coupon

In [None]:
delivery=sp.find("div", class_="a-row a-size-base a-color-secondary s-align-children-center").text
delivery

In [None]:
shipping_options=sp.find("span", class_="a-size-small a-color-base").text
shipping_options

In [None]:
all_product=sp.findAll("div", class_="a-section a-spacing-small puis-padding-left-small puis-padding-right-small")
all_product

In [None]:
dic = {"name":[],
      "price":[],
      "old_price":[],
     
      "review":[],
      "no_ppl_review":[],
      "delivery":[],
       "coupon":[]}
dic["name"].append(name)
dic["price"].append(price)
dic["old_price"].append(old_price)
dic["coupon"].append(coupon)
dic["review"].append(review)
dic["no_ppl_review"].append(no_ppl_review)
dic["delivery"].append(delivery)
amazon=pd.DataFrame(dic, index=[0])
amazon

In [14]:
page = 1

driver_path = "C:\\Users\\BlessedREI\\Desktop\\Data Science Course\\software\\data class web scrapping\\chromedriver.exe"
opt = Options() ## opt=Options("")
ser = Service(driver_path)
driver = Chrome(options=opt, service=ser)
dic = {"name":[],
      "price":[],
      "old_price":[],
      "review":[],
      "no_ppl_review":[],
      "delivery":[],
       "coupon":[]}
while page<=7:
    url= f"https://www.amazon.com/s?k=clothes&{page}&crid=3VUXXM9I6LGFR&qid=1725554647&sprefix=clothes%2Caps%2C298&ref=sr_pg_{page}"
    res= driver.get(url)
    html= driver.page_source
    sp=bs(html_content)
    all_product=sp.findAll("div", class_="a-section a-spacing-small puis-padding-left-small puis-padding-right-small")
    for i in all_product:
        try:
            dic["name"].append(i.find("a", class_="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal").text)
        except:
            dic["name"].append(np.nan)
        try:
            dic["price"].append(i.find("span", class_="a-price-whole").text)+(i.find("span", class_="a-price-fraction").text)
        except:
            dic["price"].append(np.nan)
        try:
            dic["old_price"].append(i.find("span", class_="a-offscreen").text)
        except:
            dic["old_price"].append(np.nan)
        try:
            dic["review"].append(i.find("span", class_="a-icon-alt").text)
        except:
            dic["review"].append(np.nan)
        try:
            dic["no_ppl_review"].append(i.find("span", class_="a-size-base s-underline-text").text)
        except:
            dic["no_ppl_review"].append(np.nan)
        try:
            dic["delivery"].append(i.find("div", class_="a-row a-size-base a-color-secondary s-align-children-center").text) 
        except:
            dic["delivery"].append(np.nan)
        try:
            dic["coupon"].append(i.find("div", class_="a-row a-size-small a-color-secondary").text)
        except:
            dic["coupon"].append(np.nan)
    page+=1
amazon=pd.DataFrame(dic)
amazon

WebDriverException: Message: unknown error: net::ERR_INTERNET_DISCONNECTED
  (Session info: chrome=128.0.6613.137)
Stacktrace:
	GetHandleVerifier [0x00007FF7A0DDB5D2+29090]
	(No symbol) [0x00007FF7A0D4E689]
	(No symbol) [0x00007FF7A0C0B1CA]
	(No symbol) [0x00007FF7A0C03204]
	(No symbol) [0x00007FF7A0BF4179]
	(No symbol) [0x00007FF7A0BF5F42]
	(No symbol) [0x00007FF7A0BF443F]
	(No symbol) [0x00007FF7A0BF3CD1]
	(No symbol) [0x00007FF7A0BF3C10]
	(No symbol) [0x00007FF7A0BF1AD3]
	(No symbol) [0x00007FF7A0BF214C]
	(No symbol) [0x00007FF7A0C0E231]
	(No symbol) [0x00007FF7A0CA73FE]
	(No symbol) [0x00007FF7A0C866EA]
	(No symbol) [0x00007FF7A0CA65D9]
	(No symbol) [0x00007FF7A0C86493]
	(No symbol) [0x00007FF7A0C509B1]
	(No symbol) [0x00007FF7A0C51B11]
	GetHandleVerifier [0x00007FF7A10F8C5D+3295277]
	GetHandleVerifier [0x00007FF7A1144843+3605523]
	GetHandleVerifier [0x00007FF7A113A707+3564247]
	GetHandleVerifier [0x00007FF7A0E96EB6+797318]
	(No symbol) [0x00007FF7A0D5980F]
	(No symbol) [0x00007FF7A0D553F4]
	(No symbol) [0x00007FF7A0D55580]
	(No symbol) [0x00007FF7A0D44A1F]
	BaseThreadInitThunk [0x00007FFDDA6A257D+29]
	RtlUserThreadStart [0x00007FFDDB9AAF28+40]


In [None]:
amazon

In [None]:
url ="https://www.konga.com/category/smartphones-7539"
response= requests.get(url)

In [None]:
html_content=response.content

In [None]:
soup=bs(html_content)

In [None]:
phones=soup.find("div", class_="a2cf5_2S5q5 cf5dc_3HhOq") 
phones

In [None]:
name=phones.find("div", class_="af885_1iPzH").text
name

In [96]:
old_price=perfume.find("div", class_="_4e81a_39Ehs").text
old_price

AttributeError: 'NoneType' object has no attribute 'text'

In [37]:
price=perfume.find("span", class_="price-item price-item--regular").text
price

'\n        ₦480,000.00 NGN\n      '

In [97]:
name=perfume.find("a", class_="product-item__title text--strong link").text
name

'Xerjoff Erba Pura EDP'