*Web scraping* is the process of extracting data from websites. It involves fetching the content of web pages and parsing it to extract useful information, such as text, images, or specific elements like product prices, news headlines, or user reviews.

### Common Uses of Web Scraping
- Collecting product prices and reviews from e-commerce websites.
- Aggregating news from various news portals.
- Extracting data for market research or competitive analysis.
- Building datasets for machine learning or research.

### The Web Scraping Process
1. *Identify the Website and Data*: Start by deciding what website you want to scrape and which specific data you need. For example, scraping the price and rating of products from an e-commerce site.

2. *Inspect the Web Page*: Use the browser's developer tools (right-click > Inspect) to analyze the HTML structure of the web page and identify the tags, classes, and attributes where the target data is located.

3. *Choose the Tools/Libraries*: Common Python libraries for web scraping include:
   - *Requests*: For sending HTTP requests to get the raw HTML content.
   - *BeautifulSoup*: For parsing HTML and XML documents, allowing easy navigation and extraction of elements.
   - *Selenium*: For scraping dynamic content rendered by JavaScript, as it can simulate user interactions like clicking buttons and filling out forms.
   - *Scrapy*: A more advanced and powerful web scraping framework used for large-scale scraping.

4. *Send a Request to the Web Page*: Use the requests library to send an HTTP request and retrieve the HTML content of the page.

5. *Parse the HTML Content*: Use BeautifulSoup to parse the HTML and locate the elements containing the desired data based on tags, classes, or IDs.

6. *Extract and Process the Data*: Navigate through the parsed HTML to extract and store the data in a structured format (e.g., CSV, JSON, or database).

7. *Handle Dynamic Content (Optional)*: If the website uses JavaScript to load content dynamically, you might need to use Selenium or the requests-html library to wait for the page to load fully and interact with elements.

8. *Store the Data*: Save the extracted data in a file or database for further analysis.

### Example: Basic Web Scraping with Python

Here’s a simple example of scraping product titles from an e-commerce website using Python with requests and BeautifulSoup:

python
import requests
from bs4 import BeautifulSoup

# Step 1: Send a request to the web page
url = "https://example.com/products"
response = requests.get(url)

# Step 2: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Step 3: Find and extract the relevant data
product_titles = soup.find_all("h2", class_="product-title")

# Step 4: Print the extracted product titles
for title in product_titles:
    print(title.get_text())


### Handling Ethical Issues and Legal Considerations
- *Respect Robots.txt*: Always check the website’s robots.txt file, which specifies which parts of the site can be crawled or scraped.
- *Avoid Overloading Servers*: Implement delays between requests to avoid overwhelming the server (e.g., using time.sleep()).
- *Terms of Service*: Be mindful of the website’s terms and conditions, as some websites explicitly forbid scraping.

### Advanced Considerations
- *Pagination*: If the data spans multiple pages, your scraper should handle pagination by following the “next” links.
- *Authentication*: Some websites require you to log in or pass through CAPTCHA challenges. Selenium can help bypass these.

### Summary
Web scraping is a powerful technique for extracting information from websites. The process involves sending an HTTP request, parsing HTML, and extracting specific data. Python provides several libraries like requests, BeautifulSoup, and Selenium to perform these tasks effectively.

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install selenium

Collecting selenium
  Downloading selenium-4.24.0-py3-none-any.whl (9.6 MB)
     ---------------------------------------- 9.6/9.6 MB 368.7 kB/s eta 0:00:00
Collecting typing_extensions~=4.9
  Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.26.2-py3-none-any.whl (475 kB)
     ------------------------------------ 476.0/476.0 kB 426.1 kB/s eta 0:00:00
Collecting websocket-client~=1.8
  Downloading websocket_client-1.8.0-py3-none-any.whl (58 kB)
     -------------------------------------- 58.8/58.8 kB 445.7 kB/s eta 0:00:00
Collecting sniffio>=1.3.0
  Downloading sniffio-1.3.1-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting exceptiongroup
  Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
Collecting attrs>=23.2.0
  Downloading attrs-24.2.0-py3-none-any.whl (63 kB)
 

In [3]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [6]:
import numpy as np # for arithmetic operations
import pandas as pd # data manipulation
import requests # getting access to the web page
from bs4 import BeautifulSoup as bs # parsing the data extracted from the webpage
import selenium # getting access and parsing data from webpage

In [7]:
url ="https://www.jumia.com.ng/catalog/?q=clothes"
response= requests.get(url)

In [8]:
html_content=response.content

In [9]:
soup=bs(html_content)

In [21]:
single_product=soup.find("div", class_="info") # find one product listed
single_product

<div class="info"><h3 class="name">Kingskartel Kings-Kartel Stylish Sweatshirt &amp; Joggers Pant (Brown &amp; Black)</h3><div class="prc">₦ 19,003 - ₦ 19,999</div><div class="s-prc-w"><div class="old">₦ 25,000</div><div class="bdg _dsct _sm">24%</div></div><div class="rev"><div class="stars _s">3.9 out of 5<div class="in" style="width:78%"></div></div>(20)</div></div>

In [23]:
name=single_product.find("h3", class_="name").text
name

'Kingskartel Kings-Kartel Stylish Sweatshirt & Joggers Pant (Brown & Black)'

In [24]:
price= single_product.find("div", class_="prc").text
price

'₦ 19,003 - ₦ 19,999'

In [25]:
old_price= single_product.find("div", class_="old").text
old_price

'₦ 25,000'

In [26]:
discount= single_product.find("div", class_="bdg _dsct _sm").text
discount

'24%'

In [28]:
no_ppl_rating= single_product.find("div", class_="rev").text
no_ppl_rating

'3.9 out of 5(20)'

In [29]:
rating= single_product.find("div", class_="rev").text.split()[0]
rating

'3.9'

In [32]:
no_ppl_rating2= single_product.find("div", class_="rev").text.split("(")[-1].strip(")")
no_ppl_rating2

'20'

In [34]:
dic = {"product name":[],
      "product price":[],
      "old price":[],
      "discount percentage":[],
      "rating":[],
      "number of people rated":[]}
dic["product name"].append(name)
dic["product price"].append(price)
dic["discount percentage"].append(discount)
dic["old price"].append(old_price)
dic["number of people rated"].append(no_ppl_rating2)
dic["rating"].append(rating)
df=pd.DataFrame(dic, index=[0])
df

Unnamed: 0,product name,product price,old price,discount percentage,rating,number of people rated
0,Kingskartel Kings-Kartel Stylish Sweatshirt & ...,"₦ 19,003 - ₦ 19,999","₦ 25,000",24%,3.9,20


In [36]:
dic = {"product name":[],
      "product price":[],
      "old price":[],
      "discount percentage":[],
      "rating":[],
      "number of people rated":[]}

all_product=soup.findAll("div", class_="info") # returns list of all the products listed
for i in all_product:
    try:
        dic["discount percentage"].append(i.find("div", class_="bdg _dsct _sm").text)
    except:
        dic["discount percentage"].append(np.nan)
    try:
        dic["product name"].append(i.find("h3", class_="name").text)
    except:
        dic["product name"].append(np.nan)
    try:
        dic["product price"].append(i.find("div", class_="prc").text)
    except:
        dic["product price"].append(np.nan)
    try:
        dic["old price"].append(i.find("div", class_="old").text)
    except:
        dic["old price"].append(np.nan)
    try:
        dic["rating"].append(i.find("div", class_="rev").text)
    except:
        dic["rating"].append(np.nan)
    try:
        dic["number of people rated"].append(i.find("div", class_="rev").text)
    except:
        dic["number of people rated"].append(np.nan)

df=pd.DataFrame(dic)
df

Unnamed: 0,product name,product price,old price,discount percentage,rating,number of people rated
0,Kingskartel Kings-Kartel Stylish Sweatshirt & ...,"₦ 19,003 - ₦ 19,999","₦ 25,000",24%,3.9 out of 5(20),3.9 out of 5(20)
1,Kingskartel Kings-Kartel Stylish Hoodie & Jogg...,"₦ 19,997 - ₦ 19,999","₦ 25,000",20%,3.6 out of 5(18),3.6 out of 5(18)
2,Danami 2 In One Sweat Shorts- Black & Light Grey,"₦ 12,529","₦ 13,999",11%,4.1 out of 5(35),4.1 out of 5(35)
3,Three In One Smart Chinos Black + Chocolate B...,"₦ 37,700 - ₦ 38,999",,,,
4,Trousers Men's 2-in-1 Short Sleeved T-shirt An...,"₦ 9,850","₦ 14,350",31%,3.5 out of 5(62),3.5 out of 5(62)
5,VEJARO BC002 Infant Baby Clothes Sets Pattern ...,"₦ 5,180","₦ 10,800",52%,4.1 out of 5(125),4.1 out of 5(125)
6,2pcs Men's Sports Fashion Printed Hoodie Long ...,"₦ 10,784","₦ 20,400",47%,3.8 out of 5(4),3.8 out of 5(4)
7,Men's Loose Casual Sports Suit Ice Silk Breath...,"₦ 8,520","₦ 17,230",51%,2 out of 5(4),2 out of 5(4)
8,5-Tiers Metal Shoe Rack Hat Rack With 16 Hooks,"₦ 10,695","₦ 29,800",64%,4 out of 5(86),4 out of 5(86)
9,"Men's 2-piece Set - T-shirt + Trousers, Hip-ho...","₦ 11,444 - ₦ 12,990","₦ 25,800",56%,3.5 out of 5(11),3.5 out of 5(11)


In [38]:
page =1
dic = {"product name":[],
      "product price":[],
      "old price":[],
      "discount percentage":[],
      "rating":[],
      "number of people rated":[]}
while page<=50:
    url= f"https://www.jumia.com.ng/mens-clothing-bundles/?page={page}#catalog-listing"
    response= requests.get(url)
    html_content=response.content
    soup=bs(html_content)
    all_products=soup.findAll("div", class_="info") # returns list of all products listed
    for i in all_product:
        try:
            dic["discount percentage"].append(i.find("div", class_="bdg _dsct _sm").text)
        except:
            dic["discount percentage"].append(np.nan)
        try:
            dic["product name"].append(i.find("h3", class_="name").text)
        except:
            dic["product name"].append(np.nan)
        try:
            dic["product price"].append(i.find("div", class_="prc").text)
        except:
            dic["product price"].append(np.nan)
        try:
            dic["old price"].append(i.find("div", class_="old").text)
        except:
            dic["old price"].append(np.nan)
        try:
            dic["rating"].append(i.find("div", class_="rev").text)
        except:
            dic["rating"].append(np.nan)
        try:
            dic["number of people rated"].append(i.find("div", class_="rev").text)
        except:
            dic["number of people rated"].append(np.nan)
    page+=1
df=pd.DataFrame(dic)
df

Unnamed: 0,product name,product price,old price,discount percentage,rating,number of people rated
0,Kingskartel Kings-Kartel Stylish Sweatshirt & ...,"₦ 19,003 - ₦ 19,999","₦ 25,000",24%,3.9 out of 5(20),3.9 out of 5(20)
1,Kingskartel Kings-Kartel Stylish Hoodie & Jogg...,"₦ 19,997 - ₦ 19,999","₦ 25,000",20%,3.6 out of 5(18),3.6 out of 5(18)
2,Danami 2 In One Sweat Shorts- Black & Light Grey,"₦ 12,529","₦ 13,999",11%,4.1 out of 5(35),4.1 out of 5(35)
3,Three In One Smart Chinos Black + Chocolate B...,"₦ 37,700 - ₦ 38,999",,,,
4,Trousers Men's 2-in-1 Short Sleeved T-shirt An...,"₦ 9,850","₦ 14,350",31%,3.5 out of 5(62),3.5 out of 5(62)
...,...,...,...,...,...,...
1995,Best Styled Up And Down Wears With Prints Pk1,"₦ 16,500 - ₦ 20,000",,,4.3 out of 5(4),4.3 out of 5(4)
1996,Stylish Multicolor Unisex Up And Down Wears Pk3,"₦ 16,000 - ₦ 20,000",,,4.6 out of 5(5),4.6 out of 5(5)
1997,Stylish Multicolor Unisex Up And Down Wears Pk3,"₦ 14,000 - ₦ 17,500",,,4 out of 5(1),4 out of 5(1)
1998,Stylish Multicolor Unisex Up And Down PK13,"₦ 16,000 - ₦ 20,000",,,4 out of 5(2),4 out of 5(2)


In [39]:
df.to_csv("clothing.csv", index=False)

## Time when request is not working

In [40]:
url="https://www.amazon.com/s?k=clothes&crid=3VUXXM9I6LGFR&sprefix=clothes%2Caps%2C298&ref=nb_sb_noss_1"

In [42]:
res = requests.get(url)

In [43]:
res.content

b'<!--\n        To discuss automated access to Amazon data please contact api-services-support@amazon.com.\n        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.\n-->\n<!doctype html>\n<html>\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="x-ua-compatible" content="ie=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n  <title>Sorry! Something went wrong!</title>\n  <style>\n  html, body {\n    padding: 0;\n    margin: 0\n  }\n\n  img {\n    border: 0\n  }\n\n  #a {\n    background: #232f3e;\n    padding: 11px 11px 11px 192px\n  }\n\n  #b {\n    position: absolute;\n    left: 22px;\n    top: 12px\n  }\n\n  #c {\n    position: relative;\n    max-width: 800px;\n    padding: 0 40px 0 0\n  }\n\n  #e, #f {\n    hei

## extracting info using selenium

In [49]:
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome

driver_path = "C:\\Users\\BlessedREI\\Desktop\\Data Science Course\\software\\data class web scrapping\\chromedriver.exe"
opt = Options()
ser = Service(driver_path)
driver = Chrome(options=opt, service=ser)

In [50]:
res= driver.get(url)

In [52]:
res.page_source

AttributeError: 'NoneType' object has no attribute 'page_source'