<a href="https://colab.research.google.com/github/JaimeOlivaresNV/BizOpsExperimental/blob/main/Lazada_Shopee_Scrape_Prototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lazada Scrape

I’ve been told frequently that between the two major e-commerce platforms in Malaysia (Lazada and Shopee), one is generally cheaper and attracts bargain hunters while the other generally caters to the less price sensitive.
Well, I’ve decided to find out myself… in a battle of the e-commerce platforms!
To do so I’ll be writing a Python script using Selenium and the Chrome web driver to automate the scraping process and build our dataset. Here, we will be scraping for the following:
Product name; and
Product price
I will then conduct some basic data analysis using Pandas on the dataset we have scraped. As part of this exercise, some data cleaning will also be required and at the end of the exercise I will be presenting the price comparison on a simple visual chart using Matplotlib and Seaborn.
Between the two platforms, I’ve found the Shopee website more difficult to scrape for data for a couple of reasons: (1) it contains annoying popup boxes which appear when entering the page; and (2) the website class elements are not as well defined (some elements have multiple classes).
For this reason we will start with scraping the Lazada website first, and then we will deal with Shopee in Part 2!
First, we import the necessary packages:

In [1]:
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# Web Scraping
from selenium import webdriver
from selenium.common.exceptions import *
# Data manipulation
import pandas as pd
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('--headless')
# chrome_options.add_argument('--no-sandbox')
# chrome_options.add_argument('--disable-dev-shm-usage')
# wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
# wd.get("https://www.webite-url.com")

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |████████████████████████████████| 904 kB 8.7 MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0
Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:6 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http:

We then initialize the global variables, in this case:
the path of the Chrome webdriver;
the website url; and
the item we want to search.

In [2]:
# webdriver_path = 'C://Users//me//chromedriver.exe' # Enter the file directory of the Chromedriver
Lazada_url = 'https://www.lazada.com.ph'
search_item = 'Nescafe Gold refill 170g' # Chose this because I often search for coffee!

A little bit about the options. The ‘ — headless’ argument allows you to run the script with the browser operating in the background. Normally I would recommend not adding this argument to your Chrome options, so that you will be able to see the automation in action and identify bugs more easily. The downside to this is that it’s less efficient, of course!


The other arguments, ‘start-maximised’, ‘disable-infobars’ and ‘ — disable-extensions’ are added to ensure a smoother operation of the browser (extensions that interfere with webpages especially can derail the automation process).


Running this short block of code will open the browser.
Once the browser is opened, we will need to automate the searching of the item. The Selenium tool allows you to find browser HTML elements using various methods including the id, class, CSS selectors, and also XPath which is an XML path expression.

But how do you identify which elements to find? An easy way to do this is to use Chrome’s very own inspect tool:

In [20]:
!pip install kora -q
from kora.selenium import wd


[K     |████████████████████████████████| 57 kB 3.7 MB/s 
[K     |████████████████████████████████| 56 kB 4.7 MB/s 
[?25h

In [26]:
wd.get(Lazada_url)

# Display webdriver's current url
print(wd.current_url)


https://bixi.alicdn.com/punish/70571.html?uuid=b854260074946501af16bce4bc49ee12&action=deny


https://bixi.alicdn.com/punish/70571.html?uuid=3a075efa155b11cf7fc20191c91abad8&action=deny


In [19]:
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

WebDriverException: ignored

In [16]:
## Select custom Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless') 
options.add_argument('start-maximized')
options.add_argument('--no-sandbox')
# options.add_argument('--disable-dev-shm-usage') 
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
# Open the Chrome browser
browser = webdriver.Chrome('chromedriver', options=options)
browser.get(Lazada_url)
browser.maximize_window() # For maximizing window
browser.implicitly_wait(20) # gives an implicit wait for 20 seconds
# chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('--headless')
# chrome_options.add_argument('--no-sandbox')
# chrome_options.add_argument('--disable-dev-shm-usage')
# wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
# wd.get("https://www.webite-url.com")

You can open the inspect tool with CTRL+SHIFT+I. Use the element selector identified in the red circle to hover around elements you want to find. Here we can find that the search bar has an id = ‘q’ (seen within the red box).

In [18]:
browser.maximize_window() # For maximizing window
browser.implicitly_wait(20) # gives an implicit wait for 20 seconds
search_bar = browser.find_element_by_id('q')
# search_bar.send_keys(search_item)
# search_bar.submit()

search_bar.send_keys(search_item).submit()

NoSuchElementException: ignored

Okay so that’s the easy part. Now comes the part which can be challenging, and even more so if you’re trying to scrape from the Shopee website!

To figure out how you would scrape the item names and prices from the Lazada, imagine how you would do it manually. What you might do is this:

Copy each of the item name and its price onto a spreadsheet table;
Go to the next page and repeat the first step until you’ve reached the last page

That’s exactly how we would do it as well in this automation process! To do so, we will need to find the elements containing the item names and prices, and also the next page button.

Using the same Chrome inspect tool, we can see that the product titles and prices have the class names ‘c16H9d’ and ‘c13VH6’ respectively. It’s important to check that the same class names apply to all the items on the page, in order to ensure successful scraping of all the items on the page.

In [None]:
item_titles = browser.find_elements_by_class_name('c16H9d')
item_prices = browser.find_elements_by_class_name('c13VH6')

In [None]:
'''
Next, we unpack the item_titles and item_prices variables onto lists:
'''

# Initialize empty lists
titles_list = []
prices_list = []
# Loop over the item_titles and item_prices
for title in item_titles:
  titles_list.append(title.text)
for price in item_prices:
  prices_list.append(prices.text)

**Sample Output**

[‘NESCAFE GOLD Refill 170g x2 packs’, ‘NESCAFE GOLD Original Refill Pack 170g’, ‘Nescafe Gold Refill Pack 170g’, ‘NESCAFE GOLD Refill 170g’, ‘NESCAFE GOLD REFILL 170g’, ‘NESCAFE GOLD Refill 170g’, ‘Nescafe Gold Refill 170g’, ‘[EXPIRY 09/2020] NESCAFE Gold Refill Pack 170g x 2 — NEW PACKAGING!’, ‘NESCAFE GOLD Refill 170g’] 
[‘RM55.00’, ‘RM22.50’, ‘RM26.76’, ‘RM25.99’, ‘RM21.90’, ‘RM27.50’, ‘RM21.88’, ‘RM27.00’, ‘RM26.76’, ‘RM23.00’, ‘RM46.50’, ‘RM57.30’, ‘RM28.88’]



Once we’re done scraping from this page, let’s move on to the next page. 

Again here we will use the find_element method, but this time using XPath. 

Using XPath is necessary here because the next page button has two classes, and the find_element_by_class_name method only finds elements from a single class.

Also important to note here, we need to tell the browser what to do if the next page button is disabled (meaning if the results are shown only in one page or if we’ve reached the end page of the results).

In [None]:
'''
Here we’ve instructed the browser to close if the button is disabled. If it’s not disabled, it will proceed to the next page and we will then need to repeat the scraping process.
'''
try:
  browser.find_element_by_xpath(‘//*[@class=”ant-pagination-next” and not(@aria-disabled)]’).click()
except NoSuchElementException: 
  browser.quit()



We now begin to analyze the data we’ve scraped using Pandas. We start by converting the two lists into a dataframe:

In [None]:
dfL = pd.DataFrame(zip(titles_list, prices_list), columns=[‘ItemName’, ‘Price’])

While the dataset looks good, it isn’t very clean. If you print the information of the dataframe using the Pandas .info() method it shows that the Price column type is a string object, rather than a float type. This is expected as each entry in the Price column contains the currency symbol ‘RM’ (Malaysian Ringgit). However, if the Price column is not an integer or float type column, we will not be able to extract any statistical features on it.
We will therefore need to remove the currency symbol and convert the entire column into a float type with the following method:

In [None]:
dfL[‘Price’] = dfL[‘Price’].str.replace(‘RM’, ‘’).astype(float)

Awesome! However, there’s still some more cleaning to be done. You might have noticed an anomaly in the dataset. One of the items is actually a twin pack, which we will need to remove from our dataset.
Data cleaning is essential for any sort of data analysis and here we will weed out entries that we don’t want with the following:

In [None]:
# This removes any entry with 'x2' in its title
dfL = dfL[dfL[‘ItemName’].str.contains(‘x2’) == False]

Although unnecessary here, you may also want to ensure that the items that appear are the items we specifically searched for. Sometimes other related products may appear in your search list, especially if your search term isn’t specific enough.
For example, if we had searched ‘nescafe gold refill’ instead of ‘nescafe gold refill 170g’, 117 items would have appeared instead of just the 9 we scraped earlier. The additional items aren’t the refill packs we were searching for, but rather capsule filter cups instead. Nonetheless, it doesn’t hurt to filter your dataset again with your search term:

In [None]:
dfL = dfL[dfL[‘ItemName’].str.contains(‘170g’) == True]

SyntaxError: ignored

As a final touch, we will also create a column ‘Platform’ and assign ‘Lazada’ to each of the entries here. This is done so that we can later group the entries by platforms (Lazada and Shopee) when we later conduct the price comparison between the two platforms.

In [None]:
dfL[‘Platform’] = ‘Lazada’

Now it’s time to visualize our data, with Matplotlib and Seaborn. We will be using a box plot, as it uniquely represents all the following key statistical features (also known as the five number summary) in one chart:
Lowest price
Highest price
Median price
25th and 75th percentile price

In [None]:
# Plot the chart
sns.set()
_ = sns.boxplot(x=’Platform’, y=’Price’, data=dfL)
_ = plt.title(‘Comparison of Nescafe Gold Refill 170g prices between e-commerce platforms in Malaysia’)
_ = plt.ylabel(‘Price (RM)’)
_ = plt.xlabel(‘E-commerce Platform’)
# Show the plot
plt.show()

# Shopee Scrape

Is a bit more difficult because:



1.   Popup alerts (additional complexity = low)
2.   Multiple prices for the same item (additional complexity = low)
3.   Search returns 50 items per page but only 15 were selected (additional complexity = high)
4.   The item name elements cannot be selected

Instead, we can use Shopee API

I was incredibly lucky to come across this on the web. Not all sites will have (or will share) their API to share with you. As Shopee allows you to use their API to scrape product details directly, it is much easier to use this instead of automating the scraping process using Selenium with the following short block of code:



In [None]:
import requests
Shopee_url = 'https://shopee.com.my'
keyword_search = 'Nescafe Gold refill 170g'
headers = {
 ‘User-Agent’: ‘Chrome’,
 ‘Referer’: ‘{}search?keyword={}’.format(Shopee_url, keyword_search)
}
url = ‘https://shopee.com.my/api/v2/search_items/?by=relevancy&keyword={}&limit=100&newest=0&order=desc&page_type=search'.format(keyword_search)
# Shopee API request
r = requests.get(url, headers = headers).json()
# Shopee scraping script
titles_list = []
prices_list = []
for item in r['items']:
    titles_list.append(item['name'])
    prices_list.append(item['price_min'])

In [None]:
Shopee = pd.DataFrame(zip(titles_list, prices_list), columns=[‘ItemName’, ‘Price’])

SyntaxError: ignored

As with the Lazada dataset, we will also need do conduct some cleaning with this dataset. The main things we need to do are the following:
Transform the price column from integer type into a two decimal float type
Remove unrelated entries from the dataset (I’m looking for coffee, not collagen eye masks!)
Remove the twin packs

In [None]:
# Remove the ‘RM’ string from Price and change column type to float
dfS[‘Price’] = dfS[‘Price’] / 100000
# Remove false entries i.e. those which are not actually Nescafe Gold Refill 170g
dfS = dfS[dfS[‘ItemName’].str.contains(‘170g’) == True] # Poor search function Shopee!!!
# Some of the items are actually x2 packs. Remove them too
dfS = dfS[dfS[‘ItemName’].str.contains(‘[2x\s]{3}|twin’, flags=re.IGNORECASE, regex=True) == False]

# Combine Datasets

Now let's combine the Lazada and Shopee datasets! We do this by using the pandas concatenation method:


In [None]:
# Add column [‘Platform’] for each platforms
dfL[‘Platform’] = ‘Lazada’
dfS[‘Platform’] = ‘Shopee’
# Concatenate the Dataframes
df = pd.concat([dfL,dfS])

In [None]:
print(df.groupby([‘Platform’]).describe())

In [None]:
sns.set()
_ = sns.boxplot(x=’Platform’, y=’Price’, data=df)
_ = plt.title(‘Comparison of Nescafe Gold Refill 170g prices between e-commerce platforms in Malaysia’)
_ = plt.ylabel(‘Price (RM)’)
_ = plt.xlabel(‘E-commerce Platform’)
# Show the plot
plt.show()

In [None]:
import time
# Add Timestamp
dfL[‘datetime’] = pd.Timestamp.today()
dfS[‘datetime’] = pd.Timestamp.today()
# Save dataframe to a csv file
timestamp = str(pd.Timestamp.today()).replace(":", ".")
df.to_csv('PriceComparison_{}.csv'.format(timestamp))