What is Web Scrapping?

Web scrapping is simply a way of obtaining large amounts of data from websites on the internet. This is an automated approach, as opposed to copying and pasting the data you want from a website (because gosh, what if you want 100s of data points)

What is happening under the hood?

Web scrapping involves what we call a scrapper in this case that is our python script/Program. So there are the steps:

1. The scrapper (python program) sends a request to a website and asks for a specific page-eg, give me the air fryer page on jumia 
2. The websites send back the page
3. Data extraction-you, the programmer have coded the scrapper to parse through the (returned) 'page' for specific data. Say, you only want the prices from the airfyer page on Jumia and nothing else . 
4. Clean, Organize and store data 


When you run the request.get() you get a response object as output.This object contains
1. status_code (404, 500 error)
2. text(HTML format)
3. Content 
4. Headers (Who make the request)

In [None]:
#import necessary libraries
#run the code pip install requests 
#run the code pip install beautifulsoup4
#read on cellenium documentation for more information

The path of the information is found 

In [4]:
#Scrapping the name and the prices of the products
import requests
from bs4 import BeautifulSoup

In [8]:
#define the URL you want to scrap
url = 'https://www.jumia.co.ke/catalog/?q=air+fryer'

#create a fake user- so that you do not get blocked 

headers = {
    
    "User-Agent": "Chrome/5.0"
}



In [9]:
# now send a get request

response = requests.get(url, headers=headers)
#then use the html parser to read the content off the web page 
#Parsing the HTML 
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())


<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Shop &amp; Buy Online | Jumia Kenya
  </title>
  <meta content="product" property="og:type"/>
  <meta content="Jumia Kenya" property="og:site_name"/>
  <meta content=" Shop &amp; Buy Online | Jumia Kenya" property="og:title"/>
  <meta content="/catalog/" property="og:url"/>
  <meta content="https://ke.jumia.is/cms/icons/jumialogo-x-4.png" property="og:image"/>
  <meta content="en_KE" property="og:locale"/>
  <meta content=" Shop &amp; Buy Online | Jumia Kenya" name="title"/>
  <meta content="noindex,follow" name="robots"/>
  <link href="android-app://com.jumia.android/JUMIA/KE/s/air fryer?utm_source=google&amp;utm_medium=organic&amp;adjust_tracker=j1hd8h&amp;adjust_campaign=GOOGLE_SEARCH&amp;adjust_adgroup=https%3A%2F%2Fwww.jumia.co.ke%2Fcatalog%2F%3Fq%3Dair%2Bfryer" rel="alternate"/>
  <link href="https://www.jumia.co.ke/catalog/" rel="canonical"/>
  <link href="https://www.jumia.co.ke/catalog/?q

In [10]:
products = soup.find_all('article', class_='prd _fb col c-prd')
products

[<article class="prd _fb col c-prd"><a class="btn _i _rnd -mas -fsh0 -me-start _wslt _sec" data-ga4-discount="0.87" data-ga4-is_second_chance="false" data-ga4-item_brand="Amoi" data-ga4-item_category="Home &amp; Office" data-ga4-item_category2="Home &amp; Kitchen" data-ga4-item_category3="Kitchen &amp; Dining" data-ga4-item_category4="Small Appliances" data-ga4-item_category5="Air Fryers" data-ga4-item_id="AM454HA5DK9C8NAFAMZ" data-ga4-item_name="Double Button Air Fryer 6.5L" data-ga4-item_variant="" data-ga4-price="23.10" data-moengage-brand_key="amoi" data-moengage-brand_name="Amoi" data-moengage-category_key="category-url-en-air-fryers" data-moengage-category_name="Air Fryers" data-moengage-discount="0.87" data-moengage-item_variant="" data-moengage-product_image="https://ke.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/25/1111123/1.jpg?1641" data-moengage-product_name="Double Button Air Fryer 6.5L" data-moengage-product_price="23.10" data-moengage-product_sku="AM454HA5

In [None]:
#Loop through the products and extract the name and price 

for product in products:
    name = product.find("h3", class_="name")
    new_price = product.find("div", class_="prc")
    old_price = product.find("div", class_="old")

    if name and new_price:
        print(f"Product: {name.text.strip()}")
        print(f"New Price: {new_price.text.strip()}")
        if old_price:
            print(f"Old Price: {old_price.text.strip()}")
        else:
            print("Old Price: Not listed")
        print("-" * 40)
#strip() is used to remove any whitespaces characters

In [12]:
#Save the data on an excel file or a csv file 

import csv

with open("air_fryers_jumia.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Product_name", "New_Price", "Old_Price"])

    for product in products:
        name = product.find("h3", class_="name")
        new_price = product.find("div", class_="prc")
        old_price = product.find("div", class_="old")

        if name and new_price:
            product_name=name.text.strip()
            current_price=new_price.text.strip()
            if old_price:
                original_price=old_price.text.strip()
            else:
                original_price="Not listed"
                writer.writerow([product_name, current_price, original_price])



Scrapping for various pages 

In [None]:
import requests
from bs4 import BeautifulSoup
import csv


with open("air_fryers_multi_page2.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "New Price", "Old Price"])  # CSV Header

    # We modify our code to loop through all the pages and scrap, all else remains the same
    for page in range(1, 7):  # You can increase this to 10 or more
        #print(f"Scraping Page {page}...")

        url = f"https://www.jumia.co.ke/catalog/?q=air+fryer&page={page}"
        headers = {
            "User-Agent": "Mozilla/5.0"
        }

        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")

        products = soup.find_all("article", class_="prd")

        for product in products:
            name = product.find("h3", class_="name")
            new_price = product.find("div", class_="prc")
            old_price = product.find("div", class_="old")

            if name and new_price:
                product_name = name.text.strip()
                current_price = new_price.text.strip()
                original_price = old_price.text.strip() if old_price else "Not listed"

                writer.writerow([product_name, current_price, original_price])
