# The Jumia Product Data Scraper

This a scraper that allows you to scrape product links from the top selling division of https://jumia.co.ke homepage, follow the links and scrape for product data from all these product links. This data can then be used to determine popularity of a product and make recommendations to sellers about the sort of products they should be selling.

Here is the product data that this scraper will collect:
- Product Name
- Brand Name
- Price (Ksh)
- Discount if Available (%)
- Total Number of Reviews
- The Product Rating (out of 5)

*Note* that we make certain assumptions to deduce insights from our product data:
1. Number of reviews correlates directly to the number of product purchases. This data point will help us determine the popularity of a particular product.
2. The product rating does not depict the actual customer satisfaction given the disparity in review count. Therefore, we add one negative and one positive review to better estaimate the customer satisfaction from a product (the more accurate product rating)

In [339]:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup

#### Test Whether Site Can Be Scraped

In [267]:
url = input("Enter your url:")

response = requests.get(url)

print("This URL can be scraped: ", response.ok)
print("The Status Code is: ", response.status_code)

Enter your url: https://jumia.co.ke


This URL can be scraped:  True
The Status Code is:  200


## Product Page Scraper

### Retrieve the Product Name

In [268]:
def getproductname(url):
    
    #convert url to a soup object
    response = requests.get(url)
    html_data = response.text
    soup = BeautifulSoup(html_data, "html")

    #Extract product name from soup
    product_page = soup.title.text
    product_data = re.findall("(.+) @",product_page)
    product_name = product_data[0]

    return product_name

### Retrieve the Brand Name

In [269]:
def getbrandname(url):
    #convert url to a soup object
    response = requests.get(url)
    html_data = response.text
    soup = BeautifulSoup(html_data, "html")
    
    #extract brand name from soup
    brand_div = str(soup.find_all("div",{"class":"-fs14 -pvxs"}))
    brand_list = re.findall('Similar products from (.+)</a></div>',brand_div)
    brand = brand_list[0]
    
    return brand

### Retrieve the Price Data – Price and Discount

In [270]:
def getpricedata(url):
    #convert url to a soup object
    response = requests.get(url)
    html_data = response.text
    soup = BeautifulSoup(html_data, "html")

    #extract price from soup
    price_div = str(soup.find_all("div",{"class": "-hr -pvs -mtxs"}))
    price_data = re.findall(">KSh ([0-9,.]+)",price_div)
    price = float(re.sub(",", "",price_data[0]))
    
    #extract price discount from soup
    if len(price_data) > 1:
        discount_data = re.findall("([0-9]+)%",price_div)
        discount = float(discount_data[0])/100
    else:
        discount = 0
    
    return price, discount

### Retrieve the Review Data – Review Count and Rating

In [308]:
def getreviewdata(url):
    #convert url to a soup object
    response = requests.get(url)
    html_data = response.text
    soup = BeautifulSoup(html_data, "html")

    #extract rating from soup
    ratings_div = str(soup.find_all("div", {"class": "-df -i-ctr -pvxs"}))
    ratings_data = re.findall("([0-9.]+) out of", ratings_div)
    rating = float(ratings_data[0])

    #extract review count from soup
    reviews_data = re.findall("([0-9]+) rating", ratings_div)
    if len(reviews_data) > 0:
        reviews = int(reviews_data[0])
    else:
        reviews = 0

    return reviews, rating

#### Determine Customer Satisfaction (Actual Rating)

In [324]:
def getactualrating(reviews, rating):
    #Add two reviews (one negative, one positive) to determine satisfaction
    actual_rating = round(((reviews * rating) + 1 + 5)/(reviews + 2),2)

    return actual_rating

### Test Product Page Scraper

In [325]:
url = input("Enter url: ")

name = getproductname(url)
brand = getbrandname(url)
price,dicount = getpricedata(url)
reviews,rating = getreviewdata(url)
actual_rating = getactualrating(reviews, rating)

print(name, brand, price, discount, reviews, rating, actual_rating)

Enter url:  https://www.jumia.co.ke/gold-beer-330ml-24-pcs.-ruhr-gold-mpg227567.html


Ruhr Gold Gold Beer - 330ml (24 Pcs). Ruhr Gold 1250.0 0 48 4.5 4.44


## Top-Selling Products Div Scraper

### List All Top-Selling Product Links

In [294]:
def getproductlinks(url,links): #takes in the url and a list
    #convert url to a soup object
    response = requests.get(url)
    html_data = response.text
    soup = BeautifulSoup(html_data, "html")

    #Extract links from top-products division
    top_items = str(soup.find_all("div", {"class":"crs-w _main -pvs -phxs"}))
    product_links = re.findall('/(\S+)\.html',top_items)

    #Update the list of product links
    for link in product_links:
        link = url + link + ".html"
        links.append(link)

    return links #returns an updated list

### List Product links from Top-Selling Div

In [332]:
def gettopselling(url,links):
    count=len(links)
    top_selling = list()
    error_links = list()
    
    #loop through links scraped from top products div
    for link in getproductlinks(url,links):
        count += 1
        
        try:
            #retrieve product data
            name = getproductname(link)
            brand = getbrandname(link)
            price, discount = getpricedata(link)
            reviews,rating = getreviewdata(link)
            actual_rating = getactualrating(reviews, rating)

            #add product data to list output
            product = [name,brand,price,discount,reviews,rating,actual_rating]
            top_selling.append(product)
        except:
            #list all links that output errors
            error_links.append(count)
        
    return top_selling, error_links

### Test Top-Selling Scraper

In [333]:
url = "https://www.jumia.co.ke/"
links = list()

top_selling, error_links = gettopselling(url,links)

print(top_selling)
print("The number of links with errors:", len(error_links))

[['Ruhr Gold Gold Beer - 330ml (24 Pcs).', 'Ruhr Gold', 1250.0, 0.64, 49, 4.5, 4.44], ['Kabras Premium White Sugar - 2kg', 'Kabras', 200.0, 0.13, 1027, 4.5, 4.5], ['Lifebuoy Antibacterial Hand Sanitizer - 50ml', 'Lifebuoy', 135.0, 0.1, 143, 4.4, 4.38], ['A General 1x Anti-dust Mouth Face Mask Cycling Surgical Respirator Adult Reusable', 'A General', 230.0, 0.47, 0, 0.0, 3.0], ['Nice & Lovely Hand Sanitizing Gel - 65 Ml', 'Nice &amp; Lovely', 123.0, 0.05, 39, 4.5, 4.43], ['Jumia Chakula Box (One Click, One Delivery)', 'Jumia', 1499.0, 0, 8, 3.6, 3.48], ['Ajab All-Purpose Fortified Wheat Flour 2Kg', 'Ajab', 113.0, 0.05, 29, 4.8, 4.68], ['Dairy Dairy Top Milk 500ml-A  Pack of 12 Pieces', 'Dairy', 470.0, 0.22, 188, 4.6, 4.58], ['Top Fry Cooking Oil - 3 Litres', 'Top Fry', 480.0, 0.05, 233, 4.6, 4.59], ['Omo Hand Washing Powder Extra Fresh - 3.5kg', 'Omo', 665.0, 0.3, 64, 4.8, 4.75], ['Jogoo Maize Meal  - 2kg', 'Jogoo', 124.0, 0, 567, 4.4, 4.4], ['Exe All-Purpose Fortified Wheat Flour - 2Kg

## Save and Preview Product Data

### Export Product Data to CSV

In [337]:
with open('jumia_products.csv', 'w') as jumia_file:
    fieldnames = ["name", "brand", "price", "discount", "reviews", "rating", "actual_rating"]
    
    csvwriter = csv.writer(jumia_file)
    csvwriter.writerow(fieldnames)
    
    #loop through product list to update csv file
    for product in top_selling:
        csvwriter.writerow(product)
        
    print("Done! All products have been added to CSV file")

Done! All products have been added to CSV file


### Preview Product DataFrame

In [346]:
jumia = pd.read_csv("jumia_products.csv")

jumia.head(11)

Unnamed: 0,name,brand,price,discount,reviews,rating,actual_rating
0,Ruhr Gold Gold Beer - 330ml (24 Pcs).,Ruhr Gold,1250.0,0.64,49,4.5,4.44
1,Kabras Premium White Sugar - 2kg,Kabras,200.0,0.13,1027,4.5,4.5
2,Lifebuoy Antibacterial Hand Sanitizer - 50ml,Lifebuoy,135.0,0.1,143,4.4,4.38
3,A General 1x Anti-dust Mouth Face Mask Cycling...,A General,230.0,0.47,0,0.0,3.0
4,Nice & Lovely Hand Sanitizing Gel - 65 Ml,Nice &amp; Lovely,123.0,0.05,39,4.5,4.43
5,"Jumia Chakula Box (One Click, One Delivery)",Jumia,1499.0,0.0,8,3.6,3.48
6,Ajab All-Purpose Fortified Wheat Flour 2Kg,Ajab,113.0,0.05,29,4.8,4.68
7,Dairy Dairy Top Milk 500ml-A Pack of 12 Pieces,Dairy,470.0,0.22,188,4.6,4.58
8,Top Fry Cooking Oil - 3 Litres,Top Fry,480.0,0.05,233,4.6,4.59
9,Omo Hand Washing Powder Extra Fresh - 3.5kg,Omo,665.0,0.3,64,4.8,4.75


In [344]:
jumia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
name             26 non-null object
brand            26 non-null object
price            26 non-null float64
discount         26 non-null float64
reviews          26 non-null int64
rating           26 non-null float64
actual_rating    26 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.5+ KB
