What is Web Scrapping?

Web scrapping is simply a way of obtaining large amounts of data from websites on the internet. This is an automated approach, as opposed to copying and pasting the data you want from a website (because gosh, what if you want 100s of data points)

What is happening under the hood?

Web scrapping involves what we call a scrapper in this case that is our python script/Program. So there are the steps:

1. The scrapper (python program) sends a request to a website and asks for a specific page-eg, give me the air fryer page on jumia 
2. The websites send back the page
3. Data extraction-you, the programmer have coded the scrapper to parse through the (returned) 'page' for specific data. Say, you only want the prices from the airfyer page on Jumia and nothing else . 
4. Clean, Organize and store data 


When you run the request.get() you get a response object as output.This object contains
1. status_code (404, 500 error)
2. text(HTML format)
3. Content 
4. Headers (Who make the request)

In [None]:
#import necessary libraries
#run the code pip install requests 
#run the code pip install beautifulsoup4
#read on cellenium documentation for more information

The path of the information is found 

In [8]:
#Scrapping the name and the prices of the products
import requests
from bs4 import BeautifulSoup

In [6]:
#define the URL you want to scrap
url = 'https://www.jumia.co.ke/catalog/?q=air+fryer'

#create a fake user- so that you do not get blocked 

headers = {
    
    "User-Agent": "Chrome/5.0"
}



In [None]:
# now send a get request

response = requests.get(url, headers=headers)
#then use the html parser to read the content off the web page 
#Parsing the HTML 
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())


In [None]:
products = soup.find_all('article', class_='prd _fb col c-prd')
products

In [None]:
#Loop through the products and extract the name and price 

for product in products:
    name = product.find("h3", class_="name")
    new_price = product.find("div", class_="prc")
    old_price = product.find("div", class_="old")

    if name and new_price:
        print(f"Product: {name.text.strip()}")
        print(f"New Price: {new_price.text.strip()}")
        if old_price:
            print(f"Old Price: {old_price.text.strip()}")
        else:
            print("Old Price: Not listed")
        print("-" * 40)
#strip() is used to remove any whitespaces characters

In [12]:
#Save the data on an excel file or a csv file 

import csv

with open("air_fryers_jumia.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Product_name", "New_Price", "Old_Price"])

    for product in products:
        name = product.find("h3", class_="name")
        new_price = product.find("div", class_="prc")
        old_price = product.find("div", class_="old")

        if name and new_price:
            product_name=name.text.strip()
            current_price=new_price.text.strip()
            if old_price:
                original_price=old_price.text.strip()
            else:
                original_price="Not listed"
                writer.writerow([product_name, current_price, original_price])



Scrapping for various pages 

In [None]:
import requests
from bs4 import BeautifulSoup
import csv


with open("air_fryers_multi_page2.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "New Price", "Old Price"])  # CSV Header

    # We modify our code to loop through all the pages and scrap, all else remains the same
    for page in range(1, 7):  # You can increase this to 10 or more
        #print(f"Scraping Page {page}...")

        url = f"https://www.jumia.co.ke/catalog/?q=air+fryer&page={page}"
        headers = {
            "User-Agent": "Mozilla/5.0"
        }

        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")

        products = soup.find_all("article", class_="prd")

        for product in products:
            name = product.find("h3", class_="name")
            new_price = product.find("div", class_="prc")
            old_price = product.find("div", class_="old")

            if name and new_price:
                product_name = name.text.strip()
                current_price = new_price.text.strip()
                original_price = old_price.text.strip() if old_price else "Not listed"

                writer.writerow([product_name, current_price, original_price])
