# Lyst Scraper: a quick tutorial on web scraping

This notebook is a quick tutorial on how to scrap a website. I am not a web scraping expert at all, I am mainly still a beginner, so this tutorail is aimed at beginners mostly. However, I tried not to focus only on the __code__ but also on other aspects of web scraping. You can find plenty of complete tutorials for the use of Beautiful Soup or Selenium online. Instead, I wanted to show a glimpse of how to scrap a website, some useful tricks I learnt during my work. I am __machine learning researcher__, so I am not at all a scraping specialist nor a web programer, but I think web scraping is a useful trick to learn for data scientists and machine learning students to create databases or simply gather data. For other examples, you can check my other scrapers on my GitHub, they are more detailed from a syntaxic point of view.

# Part 1: Web scraping ?

Prior to start coding, it is important to understand what is a web scraper, how and __when__ to use it. More important, web scraping has a tricky __legal__ aspect, so be sure to read these few lines before you start playing with data.

## What is a web scraper ?

A web scraper is a program that __extracts__ data from the World Wide Web. In all my examples, I use web scraper to download images to create databases, but I can be used to extract pretty much any type of data you can find online.
Web scraping is different from __web crawling__, which is basically gathering and indexing every byte of info you can find on a website. As a crawler visits all the links on a page, I still use the term __crawl__ to describe that a scraper is visiting a link.

## Why would you scrap a website ?

There are many reasons to scrap a website. Personnally, I started to scrap to create databases for my __machine learning__ projects. But as I got into web scraping, I enjoyed it so much that I started to scrap some websites for fun (like Google Trends or op.gg for instance).

## The big question: is web scraping legal ?

Wellll... This is a good one. I am no lawyer, so I can't give you a detailed answer here, but here's what I understood after spending some time looking for a definitive answer. There is no __definitive__ answer. It heavily depends on what you do with the data you scrap. Here's a good blog about the subject : https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/. But always keep in mind that the fact that the data is publicly accesible __does not mean you have the right to sracp it and use it for your own projects__. Simples advices are:
-  If you're not sure, ask the owner of the website for his __permisssion__ to scrap it
-  If you think the data might be sensible, __ask a specialized lawyer__
-  Follow the rules specified in the __robots.txt__ file. Every website has a basic HTML page describing its policy about web scraping. You can find it at _http://www.website.com/robots.txt_. You can have nice surprises, as some websites allow web scraping to an extend (ex: op.gg for League of Legends statistics).
-  Be __polite__. I will insist a lot on this during this tutorial, but servers are not made to handle 10 requests a second per user. This is not a normal behaviour for a human being, and it can seriously harm a website. If you're taking their data - especially without permission - at least do it right.
   

# Part 2: "Naive" web scraping

This "Naive" part is very important. I call it naive because it requires some manual work to understand how to retrieve the data from a website, but this step is essential to understand how a scraper works, especially if - like me - you don't have much knowledge on webprograming.

Keep in mind that this notebook is not designed to be a complete all-in-one tutorial for beginners. You can find plenty of tutorials for Beautiful Soup or the Selenium syntax on the web, so I won't detail each ligne of code I wrote. This notebook is written by a beginner for beginners, it is here to give a glimpse on how to scrap a website, and some useful tips I found by myself.

During this tutorial, we will try to get images from _lyst.fr_, an online fashion shopping website.

First, let's import the useful librairies we're gonna use during our project. They are all straight-forward to install using __pip__, except Selenium which needs you to install a bit more but you can find everything on the documentation of Selenium online.
During this tutorial, I use Beautiful Soup, a Python librairy for extracting data from HTML files. Here's the link to the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Beautiful Soup is a very useful and complete tool for basic webscraping, however some more complex tasks cannot be achieved. For example, _lyst.fr_ features an __infinite scroll__ which requires us to use an automated browser using Selenium (Documentation at https://www.seleniumhq.org/ ). Tip for installing: go to https://github.com/mozilla/geckodriver/releases and get the geckodriver corresponding to your Mozilla Browser if you use Mozilla.


In [None]:
from bs4 import BeautifulSoup
import requests
import urllib.request
import sys
import os
import traceback
import re
import time
import json
import unidecode
import shutil

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

When you scrap a website using Selenium or Beautiful, __your__ computer is accesing the data. It seems dumb to state this, but remember that a website expects human user to access its data, not automated bots. That's why we need to disguise our bot to make him look like a normal human user (_spoiler: we won't succeed_).

Headers are the __fingerprint__ of your browser. I always specify my true headers (in my case, Firefox Headers) when I scrap a website, because it doesn't give much info on myself, and it seems more realistic. You can find your headers using this website https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending, look for the __USER_AGENT__ info.

We're gonna try with one headers first. If this doesn't work, we can try using proxy rotation and multi-headers to hide our identity. But given the fact that we are not going to scrap a lot of sensitive data, we might be able to go anonymous. I've been nlocked from several websites beacause I was careless before so we should be careful. Anyway, when you crawl a website, remember that your request are not normal behaviour from a human user. You may do 10 request/second, which is way too much for a human user and might harm the website server. Please, be polite and use some sleep to relieve the servers.

In [None]:
# Specify your web browser's headers
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0'}

# Specify the location to save the data
DownloadPath = "/media/arthur/DATA/Databases/LystScrapingH"

# We create a dictionary with all cat. and subcat. of the Lyst catalogue
words_to_search =  {
    'shirts':['casual+shirts','formal+shirts'],
    'suits':['2+piece+suits', '3+piece+suits','evening+suits'],
    'jeans':['bootcut+jeans', 'straight+jeans', 'relaxed+jeans', 'tapered+jeans', 'skinny+jeans', 'slim+jeans'],
    'coats':['trench+coats', 'short+coats', 'long+coats',  'parka+coats'],
    'pants':['casual+pants', 'formal+pants'],
    'beachwear':['trunks', 'boardshorts'],
    'knitwear':['cardigans','v+neck', 'crew+neck', 'turtlenecks', 'zipped+sweaters', 'sleeveless+sweaters'],
    'shorts':['bermudas+shorts', 'formal+shorts', 'casual+shorts','cargo+shorts'],
    'underwear':['boxers','socks', 'undershirt','briefs'],
    'sweats':['sweatpants', 'sweatshorts','tracksuits','sweatshirts', 'hoodies'],
    't-shirts':['polo+shirts','sleeveless+t-shirts','short+sleeve+t-shirts','long+sleeve+t-shirts'],
    'jackets':['formal+jackets', 'leather+jackets','waistcoast','casual+jackets','parka+jackets'],
    'nightwear':['pyjamas', 'robes']
}

# Create the tree in your hard disk 
for category in words_to_search:
    os.makedirs(DownloadPath + "/" + category)
    subcategories = words_to_search[category]
    for clothing in subcategories:
        os.makedirs(DownloadPath + "/" + category + "/" + clothing)
print("Tree created")

url_download_base = "https://cdna.lystit.com/520/650/n/"


Pay attention to the __url_download_base__ variable. This is common when scraping images: websites only show you small images, so when you __inspect__ the lement, you get the source of a small image. However, every website has the image saved somewhere in good quality, you just need to dig a little to find it.

First, we're gonna do a little test on just the 'Shirts' category. Our goal is to download all the images from this category and to save them locally on our computer. Selenium is gonna create an instance of our web browser (in my case, Mozilla Firefox) which will be controlled by this script. Be careful which retriever you use for the image, for example I first tried urllib.urlretrieve, but I got blocked by the website. If that happens, try another retriever (ex here, requests) or build your own.

In [None]:
to_scrap = words_to_search['shirts']
    
# Use selenium to create a firefox instance
driver = webdriver.Firefox()
extensions = {"jpg", "jpeg", "png"}

# Loop on all the clothes
for clothing in to_scrap:
    url = "https://www.lyst.fr/parcourir/vetements-pour-homme/?category=" + 'shirts' +  "&subcategory=" + clothing
    driver.get(url)
    for _ in range(100):
        driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")
        
        # If the website is a bit buggy, be smart
        time.sleep(0.7)
        driver.execute_script("window.scrollBy(0, -300)")
        
        # Don't forget this sleep to let the image appear on the page
        time.sleep(1.2)
        
    # Now we download all the images. BE POLITE AND DON'T GET BAN.
    images = driver.find_elements_by_xpath('//div[contains(@class, "product-card__image")]')
    i = 0
    for img in images:
        time.sleep(0.1)
        i+=1
        # Find the image and its src
        image = img.find_element_by_tag_name("img")
        print("Dowloading image no", i, "/", len(images))
        
        # We modify the src to access the full size image
        try:
            image_url = url_download_base + image.get_attribute("src")[35::]
        except:
            print("Source not found.")
            continue
        image_name = image_url.split("/")[-1]
        file_name =DownloadPath + '/shirts/' + clothing + "/" + image_name
        try:
            r = requests.get(image_url, stream=True, headers=headers)
            f = open(file_name, 'wb')
            shutil.copyfileobj(r.raw, f)
        except:
            print("Error while downloading the image.")
            continue
        

    print("Downloading over.")


This was a "basic" scraper: it only downloaded the images, saving them where we wanted (plus a lot of images can't be downloaded because the source is not specified). Now, we need all the info about the product: retailer, brand, description, save the url... Let's do a smarter scraper, not losing all this.

First, we try to get all the data from a single url and save it to a nice dictionary.

In [None]:
url = 'https://www.lyst.fr/vetements/etoile-isabel-marant-pre-owned-chemise-en-lin/'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
dico = {}

# Name of the product
name_brand = soup.find('div', {'itemprop':'brand'}).contents[1].string.replace("\n", "")
dico['brand'] = unidecode.unidecode(name_brand)
print(name_brand)

# Short description
short_description = soup.find('div', {'itemprop':'name'}).contents[0].string.replace("\n", "")
dico['short-description'] = unidecode.unidecode(short_description)
print(short_description)

# Retailer 
retailer = soup.find('a', {'reason':'retailer-link'}).contents[1].string[19::].replace("\n", "")
dico['retailer'] = unidecode.unidecode(retailer)
print(retailer)

# Long description
long_description = soup.find('div', {'class':'product-description__details text-paragraph mb0'}).contents[1].string.replace("\n", "")
dico['long-description'] = unidecode.unidecode(long_description)
print(long_description)

# Url of the main image
url_main_img = soup.find('img', {'class':'image-gallery-main-img'})['src'].replace("\n", "")
dico['url-main-image'] = url_main_img
print(url_main_img)

# Url of thumbnail images (if they exist)
other_imgs = soup.find('div', {'is':'gallery-thumbnails'}).contents

# Other_imgs is a list of length: 2*NB_OF_OTHER_IMGS+1
nb_imgs = int(len(other_imgs)/2)
dico['other-images-url'] = []
for i in range(1, nb_imgs):
    print(other_imgs[2*i+1]['href'])
    dico['other-images-url'].append(other_imgs[2*i+1]['href'])
    
print(dico)


Now that we have a Beautiful Soup instance that gets all of this data, we can test it on a full category. We just have to be careful not to download twice the same image. Assuming the fact that an image is located at only one adress, we can take care of this issue by keeping a list of the product we download.

In [None]:
# Same start as before, we need to reach the bottom 
to_scrap = words_to_search['shirts']
Base = 'https://www.lyst.fr'
    
# Use selenium to create a firefox instance
driver = webdriver.Firefox()
extensions = {"jpg", "jpeg", "png"}

# Create a list to save the data in json format
dico = []


# Loop on all the clothes
for clothing in to_scrap:
    print("Scrolling to the bottom of the page for category", clothing, "...")

    url = "https://www.lyst.fr/parcourir/vetements-pour-homme/?category=" + 'shirts' +  "&subcategory=" + clothing
    driver.get(url)
    
    for _ in range(number_of_scrolls):
        driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")
        
        # If the website is a bit buggy, be smart
        time.sleep(0.7)
        driver.execute_script("window.scrollBy(0, -300)")
        
        # Don't forget this sleep to let the image appear on the page
        time.sleep(1.2)
    print("Done !")
    
    # Now, we're gonna use the Lyst page of the product
    url_products_driver = driver.find_elements_by_xpath('//div[contains(@class, "product-card__details")]')
    for ind, product in enumerate(list(set(url_products_driver))):
        print("Products ready to be scraped for category", clothing, ":", len(list(set(url_products_driver))))
        time.sleep(0.1)
        print("Extracting data for product no", ind, "/", len(url_products_driver))
        dico_product = {}
        prod = product.find_element_by_tag_name("a")
        # Get the url of the product page
        url_prod = prod.get_attribute("href")
        
        # We use Beautiful Soup now, cause it is simpler to use and doesn't open a browser page
        html = requests.get(url_prod, headers = headers)
        soup = BeautifulSoup(html.text, 'lxml')
        
        # Be careful: not all products have every features
        if soup.find('div', {'itemprop':'brand'}) is not None:
            name_brand = soup.find('div', {'itemprop':'brand'}).contents[1].string.replace("\n", "")
            dico_product['brand'] = unidecode.unidecode(name_brand)
        if soup.find('div', {'itemprop':'name'}) is not None:
            short_description = soup.find('div', {'itemprop':'name'}).contents[0].string.replace("\n", "")
            dico_product['short-description'] = unidecode.unidecode(short_description)
        if soup.find('a', {'reason':'retailer-link'}) is not None:
            retailer = soup.find('a', {'reason':'retailer-link'}).contents[1].string[19::].replace("\n", "")
            dico_product['retailer'] = unidecode.unidecode(retailer)
        if soup.find('div', {'class':'product-description__details text-paragraph mb0'}) is not None:
            long_description = soup.find('div', {'class':'product-description__details text-paragraph mb0'}).contents[1].string.replace("\n", "")
            dico_product['long-description'] = unidecode.unidecode(long_description)
        # Download the main image with the name of the retailer in its name
        url_main_img = soup.find('img', {'class':'image-gallery-main-img'})['src'].replace("\n", "")
        
        # Choose the name carefully, as several products from different catalogues have the same name...
        image_name = + url_main_img.split("/")[-2] + url_main_img.split("/")[-1] 
        file_name =DownloadPath + '/shirts/' + clothing + "/" + retailer.replace(" ", "_") + "_main_" + image_name
        try:
            r = requests.get(url_main_img, stream=True, headers=headers)
            f = open(file_name, 'wb')
            shutil.copyfileobj(r.raw, f)
            dico_product['url-main-image'] = url_main_img
        except Exception as e:
            print("Error while downloading the image:", e)
            continue

        
        # Download the thumbnail image also with the name of the retailer in their name
        other_imgs = soup.find('div', {'is':'gallery-thumbnails'})
        dico_product['other-images-url'] = []
        
        # Not all images have thumbnails images
        if other_imgs is not None:
            other_imgs = soup.find('div', {'is':'gallery-thumbnails'}).contents
            nb_imgs = int(len(other_imgs)/2)
            for i in range(1, nb_imgs):
                dico_product['other-images-url'].append(other_imgs[2*i+1]['href'])
                image_name = other_imgs[2*i+1]['href'].split("/")[-2] + other_imgs[2*i+1]['href'].split("/")[-1]
                file_name =DownloadPath + '/shirts/' + clothing + "/" + retailer.replace(" ", "_") + "_" + str(i) + image_name
                if (other_imgs[2*i+1]['href'].replace("\n", "") != url_main_img):
                    try:
                        r = requests.get(other_imgs[2*i+1]['href'].replace("\n", ""), stream=True, headers=headers)
                        f = open(file_name, 'wb')
                        shutil.copyfileobj(r.raw, f)
                    except Exception as e:
                        print("Error while downloading the image:", e)
                        continue
        
        # Put the dico of the product in the list of all data
        dico.append(dico_product)
        
# Save all to  a json file
with open("shirts.json", "w") as fout:
    json.dump(dico, fout)
        

Now we can write our "brutal" scraper that crawls all the subcategories we chose. This methods works, but has several flaws: 
-  it is __long__. Selenium's infinite scroll needs the image to load each time we crawl, plus we had to be a little bit tricky by scrolling up to de-bug the website. This is not a major concern if you have time, but it is always good to go fast when processing web scraping. Quick maths: our scraper downloads politely at a speed of ~1 image per second. Given the fact that we plan to download 1 million images, our scraper should run for approximately 11.5 days. This is not ok.
-   you can get __blocked__ by the admin of the website's servers. You ask for many requests in a quite short amount of time, so you are likely to get caught.
-  to scrap the website this way, a lot of __manual__ work has to be done to understand how the website is built. This takes some time, and can be tricky if the website is obscure or complex.

In [None]:
# Complete naive webscraper for lyst.fr Men's catalogue

# Create a unique list of entries for all the products
dico = []

# Use selenium to create a firefox instance
driver = webdriver.Firefox()
extensions = {"jpg", "jpeg", "png"}

# Iterate over the keys of the words_to_search dictionary
for category in words_to_search:
    # Image counter, needed because not every product has the same number of images available
    cpt_img = 0
    print("Processing category", category, "...")
    subcategories = words_to_search[category]
    # Loop on all the clothes
    for clothing in subcategories:
        print("Scrolling to the bottom of the page for category", clothing, "...")

        url = "https://www.lyst.fr/parcourir/vetements-pour-homme/?category=" + category +  "&subcategory=" + clothing
        driver.get(url)
        for _ in range(number_of_scrolls):
            driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")
        
            # If the website is a bit buggy, be smart
            time.sleep(0.7)
            driver.execute_script("window.scrollBy(0, -300)")
        
            # Don't forget this sleep to let the image appear on the page
            time.sleep(1.2)
        print("Done !")
    
        # Now, we're gonna use the Lyst page of the product
        url_products_driver = driver.find_elements_by_xpath('//div[contains(@class, "product-card__details")]')
        print("Products ready to be scraped for category", clothing, ":", len(list(set(url_products_driver))))
        for ind, product in enumerate(list(set(url_products_driver))):
            time.sleep(0.1)
            print("Extracting data for product no", ind, "/", len(url_products_driver))
            dico_product = {}
            prod = product.find_element_by_tag_name("a")
            # Get the url of the product page
            url_prod = prod.get_attribute("href")
        
            # We use Beautiful Soup now, cause it is simpler to use and doesn't open a browser page
            html = requests.get(url_prod, headers = headers)
            soup = BeautifulSoup(html.text, 'lxml')
            
            # Be careful: not all products have every features
            if soup.find('div', {'itemprop':'brand'}) is not None:
                if len(soup.find('div', {'itemprop':'brand'}).contents)>1:
                    name_brand = soup.find('div', {'itemprop':'brand'}).contents[1].string.replace("\n", "")
                    dico_product['brand'] = unidecode.unidecode(name_brand)
            if soup.find('div', {'itemprop':'name'}) is not None:
                short_description = soup.find('div', {'itemprop':'name'}).contents[0].string.replace("\n", "")
                dico_product['short-description'] = unidecode.unidecode(short_description)
            if soup.find('a', {'reason':'retailer-link'}) is not None:
                retailer = soup.find('a', {'reason':'retailer-link'}).contents[1].string[19::].replace("\n", "")
                dico_product['retailer'] = unidecode.unidecode(retailer)
            if soup.find('div', {'class':'product-description__details text-paragraph mb0'}) is not None:
                long_description = soup.find('div', {'class':'product-description__details text-paragraph mb0'}).contents[1].string.replace("\n", "")
                dico_product['long-description'] = unidecode.unidecode(long_description)
            # Download the main image with the name of the retailer in its name
            url_main_img = soup.find('img', {'class':'image-gallery-main-img'})['src'].replace("\n", "")
            image_name = url_main_img.split("/")[-2] + url_main_img.split("/")[-1]
            file_name =DownloadPath + '/'+ category + '/' + clothing + "/" + retailer.replace(" ", "_") + "_main_" + image_name
            try:
                r = requests.get(url_main_img, stream=True, headers=headers)
                f = open(file_name, 'wb')
                shutil.copyfileobj(r.raw, f)
                dico_product['url-main-image'] = url_main_img
                cpt +=1
            except Exception as e:
                print("Error while downloading the image:", e)
                continue

        
            # Download the thumbnail image also with the name of the retailer in their name
            other_imgs = soup.find('div', {'is':'gallery-thumbnails'})
            dico_product['other-images-url'] = []
        
            # Not all images have thumbnails images
            if other_imgs is not None:
                other_imgs = soup.find('div', {'is':'gallery-thumbnails'}).contents
                nb_imgs = int(len(other_imgs)/2)
                for i in range(1, nb_imgs):
                    dico_product['other-images-url'].append(other_imgs[2*i+1]['href'])
                    image_name = other_imgs[2*i+1]['href'].split("/")[-2] + other_imgs[2*i+1]['href'].split("/")[-1] 
                    file_name =DownloadPath + '/' + category + '/' + clothing + "/" + retailer.replace(" ", "_") + "_" + str(i) + image_name
                    if (other_imgs[2*i+1]['href'].replace("\n", "") != url_main_img):
                        try:
                            r = requests.get(other_imgs[2*i+1]['href'].replace("\n", ""), stream=True, headers=headers)
                            f = open(file_name, 'wb')
                            shutil.copyfileobj(r.raw, f)
                            cpt +=1
                        except Exception as e:
                            print("Error while downloading the image:", e)
                            continue
                            
    # Get the number of images downloaded for each category
    print("Downloaded", cpt, "images from category", category)

                    
            # Put the dico of the product in the list of all data
            dico.append(dico_product)
        
# Save all to  a json file
with open("lyst.json", "w") as fout:
    json.dump(dico, fout)
    

# Part 3: Using the inspection tool

In part 2, we implemented a basic web scraper. It required some sweat to understand how the website is built, inspect many elements... In short, it was enlightening but somewhat long. If we want to get faster to the data, or simply to spend less time inspecting element and infinitely scrolling, we need to be smarter ie to inspect our website more in depth.

To inspect a website, I use Chrome - just a personal preference, everything I am going to explain work on Mozilla or IE as well - because its interface seems more friendly to me. Until now, we only used the __elements__ tab of the browser inspect tool. Now, the other tabs can also be very useful. In our case, we're gonna look at the __Network__ tab. This tab records all network requests. This is especiall for infinite scrolling. When you scroll down the page, new images appear (this is trackable with the network tab). This is nothing spectacular, but everytime you reach the bottom of the page, new images need to be downloaded and that's where it gets interesting. In the record of network activity, we find a call to an HTML page that is not the link to an image. It is a page called _"parcourir"_, try to open it. You see an HTML page filled with info. When you look closely, this is actually a dictionary containing all the info available for the products loaded. 

The url is https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2Fparcourir%2Fchemises-pour-homme%2F%3Fpage%3D5. After some quick test, we realize that we can load all the products by modiffying only the last figure. I can go up to 100, let's see how many products we can get using this methods (on the category 'Shirts' to compare with our first scraper).


In [None]:
# Get the HTML file and process it. 
url = 'https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2Fparcourir%2Fchemises-pour-homme%2F%3Fpage%3D50'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
text_data = soup.text

# Try to access some useful data, it is a bit messy at first glance
dico = json.loads(text_data)
for product in dico['data']['feed_items']:
    dico_product = {}
    dico_product["id"] = product['product_card']['id']
    dico_product["price"] = product['product_card']['full_price_with_currency_symbol']
    dico_product["retailer"] = product['product_card']['retailer_name']
    dico_product["designer"] = product['product_card']['designer_name']
    # We could keep all this data, but some are not useful for us

# Let's first check that we have a same number of products using Selenium and this method on the category 'Shirts'
url_dico = 'https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2Fparcourir%2Fchemises-pour-homme%2F%3Fsubcategory%3Dcasual%2Bshirts%26final_price_from%3D0%26final_price_to%3D1000000%26ref%3D%252Fparcourir%252Fchemises-pour-homme%252F%26page%3D'
nb_products = 0
for i in range (2, 100):
    url = url_dico + str(i)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')
    text_data = soup.text
    dico = json.loads(text_data)
    nb_products += len(dico['data']['feed_items'])
print("Total products found:", nb_products)

    

This seems promising ! Now what we need to understand before launching the definitive scraper is how the name of the html page containing the dictionary is chosen for each category. We first notice that this part 'https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2Fparcourir%2F' is common to all the categories. And the next part, starting by '_url=_' is just the url of the category of subcategory, with all "/" being replaced by "%". So we have done all the work already actually, this is free meal !

We are actually now using Lyst's __API__ do get the data, which a good behaviour __usually__. If you can, always use provided API to scrap data. For instance, using Firefox, I can load the html page in a beautiful Json format, which allows me to see quickly how the data is organized. What is funny here is that LYst probably doesn't want to provide this API, it is linkely to be a private API for their engineers to organize their data. 

We can now get all the images using this API quite simply. Again, we test it on the category shirt. As all the infos (urls of thumbnail images, long description...) are not available in the API, we're going to get the url of the product and use our Beautiful Soup parser to retrieve the data we can't get through the API. So our "naive" work is useful after all ! We expects our scraper to be faster, as we skip the (long) infinite scroll.

In [None]:
# Test scraper using the API on the category "Shirts"

# Create a unique list of entries for all the products
dico_final = []

# Number of image total
total_img = 0

Base_api = 'https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2F'
Base = 'https://www.lyst.fr'
to_scrap = words_to_search['shirts']
nb_products = 0
# Loop on the subcategories
for clothing in to_scrap:
    url_true = "parcourir/vetements-pour-homme/?category=" + 'shirts' +  "&subcategory=" + clothing + "&page="
    url_api_base = Base_api + url_true.replace("+", "%2B").replace("&", "%26").replace("=", "%3D").replace("/", "%2F").replace("?", "%3F")
    cpt = 0
    for i in range(2, 100):
        print("Parsing page no", i, "/", 100, "...")
        url = url_api_base + str(i)
        try:
            html = requests.get(url)
        except Execption as e:
            print("Error: can't reach.", e)
            continue
        soup = BeautifulSoup(html.text, 'lxml')
        text_data = soup.text
        dico = json.loads(text_data)
        num = 0
        for product in dico['data']['feed_items']:
            time.sleep(0.1)
            num +=1
            dico_product = {}
            dico_product['id'] = product['product_card']['id']
            url_product = Base + product['product_card']['url']
            html_product = requests.get(url_product)
            soup_product = BeautifulSoup(html_product.text, 'lxml')
            print("Extracting data for product no", num)
            
            
            # We retrieve all the data
            if soup_product.find('div', {'itemprop':'brand'}) is not None:
                if len(soup_product.find('div', {'itemprop':'brand'}).contents)>1:
                    name_brand = soup_product.find('div', {'itemprop':'brand'}).contents[1].string.replace("\n", "")
                    dico_product['brand'] = unidecode.unidecode(name_brand)
            if soup_product.find('div', {'itemprop':'name'}) is not None:
                short_description = soup_product.find('div', {'itemprop':'name'}).contents[0].string.replace("\n", "")
                dico_product['short-description'] = unidecode.unidecode(short_description)
            if soup_product.find('a', {'reason':'retailer-link'}) is not None:
                retailer = soup_product.find('a', {'reason':'retailer-link'}).contents[1].string[19::].replace("\n", "")
                dico_product['retailer'] = unidecode.unidecode(retailer)
            if soup_product.find('div', {'class':'product-description__details text-paragraph mb0'}) is not None:
                long_description = soup_product.find('div', {'class':'product-description__details text-paragraph mb0'}).contents[1].string.replace("\n", "")
                dico_product['long-description'] = unidecode.unidecode(long_description)
                
            # Download the main image with the name of the retailer in its name
            if soup_product.find('img', {'class':'image-gallery-main-img'}) is not None:
                url_main_img = soup_product.find('img', {'class':'image-gallery-main-img'})['src'].replace("\n", "")
                image_name = str(dico_product["id"]) + url_main_img.split("/")[-1] 
                file_name =DownloadPath + '/'+ 'shirts' + '/' + clothing + "/" + retailer.replace(" ", "_") + "_main_" + image_name
                try:
                    r = requests.get(url_main_img, stream=True, headers=headers)
                    f = open(file_name, 'wb')
                    shutil.copyfileobj(r.raw, f)
                    dico_product['url-main-image'] = url_main_img
                    cpt +=1
                except Exception as e:
                    print("Error while downloading the image:", e)
                    continue
            else:
                dico_product = {}
                continue
                
                
            

        
            # Download the thumbnail image also with the name of the retailer in their name
            other_imgs = soup_product.find('div', {'is':'gallery-thumbnails'})
            dico_product['other-images-url'] = []
        
            # Not all images have thumbnails images
            if other_imgs is not None:
                other_imgs = soup_product.find('div', {'is':'gallery-thumbnails'}).contents
                nb_imgs = int(len(other_imgs)/2)
                for i in range(1, nb_imgs):
                    dico_product['other-images-url'].append(other_imgs[2*i+1]['href'])
                    image_name = str(dico_product["id"]) + other_imgs[2*i+1]['href'].split("/")[-1]
                    file_name =DownloadPath + '/' + 'shirts'+ '/' + clothing + "/" + retailer.replace(" ", "_") + "_" + str(i) + image_name
                    if (other_imgs[2*i+1]['href'].replace("\n", "") != url_main_img):
                        try:
                            r = requests.get(other_imgs[2*i+1]['href'].replace("\n", ""), stream=True, headers=headers)
                            f = open(file_name, 'wb')
                            shutil.copyfileobj(r.raw, f)
                            cpt +=1
                        except Exception as e:
                            print("Error while downloading the image:", e)
                            continue
            # Append the cio of your products to your final list
            dico_final.append(dico_product)
            total_img += cpt
            
    # Count the number of images for this subcategory
    print("Number of images downloaded for subcategory", clothing, ":", cpt)
    
# Total images downloaded
print("Total images downloaded:", total_img)
    

# Save all to  a json file
with open("lyst_shirts.json", "w") as fout:
    json.dump(dico, fout)


It works ! We can now write the complete scraper for mensware using this method.

In [None]:
# Final scraper for Lyst men's catalogue

# We create 2 save lists: 1 for all the product, and 1 for each category in case an issue occurs.
dico_final = []
total_imgs = 0
Base_api = 'https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2F'
Base = 'https://www.lyst.fr'

# Let's loop on all the words to search
for category in words_to_search:
    print("Processing category", category, "...")
    dico_category = []
    total_cat = 0
    subcategories = words_to_search[category]

    # Loop on the subcategories
    for clothing in subcategories:
        print("Processing subcategory", clothing, "...")
        url_true = "parcourir/vetements-pour-homme/?category=" + category +  "&subcategory=" + clothing + "&page="
        url_api_base = Base_api + url_true.replace("+", "%2B").replace("&", "%26").replace("=", "%3D").replace("/", "%2F").replace("?", "%3F")
        cpt = 0
        for i in range(2, 100):
            print("Parsing page no", i, "/", 100, "...")
            url = url_api_base + str(i)
            try:
                html = requests.get(url)
            except Execption as e:
                print("Error: can't reach.", e)
                continue
            soup = BeautifulSoup(html.text, 'lxml')
            text_data = soup.text
            dico = json.loads(text_data)
            num = 0
            for product in dico['data']['feed_items']:
                time.sleep(0.1)
                num +=1
                dico_product = {}
                dico_product['id'] = product['product_card']['id']
                url_product = Base + product['product_card']['url']
                html_product = requests.get(url_product)
                soup_product = BeautifulSoup(html_product.text, 'lxml')
                print("Extracting data for product no", num)
            
            
                # We retrieve all the data
                if soup_product.find('div', {'itemprop':'brand'}) is not None:
                    if len(soup_product.find('div', {'itemprop':'brand'}).contents)>1:
                        name_brand = soup_product.find('div', {'itemprop':'brand'}).contents[1].string.replace("\n", "")
                        dico_product['brand'] = unidecode.unidecode(name_brand)
                if soup_product.find('div', {'itemprop':'name'}) is not None:
                    short_description = soup_product.find('div', {'itemprop':'name'}).contents[0].string.replace("\n", "")
                    dico_product['short-description'] = unidecode.unidecode(short_description)
                if soup_product.find('a', {'reason':'retailer-link'}) is not None:
                    retailer = soup_product.find('a', {'reason':'retailer-link'}).contents[1].string[19::].replace("\n", "")
                    dico_product['retailer'] = unidecode.unidecode(retailer)
                if soup_product.find('div', {'class':'product-description__details text-paragraph mb0'}) is not None:
                    long_description = soup_product.find('div', {'class':'product-description__details text-paragraph mb0'}).contents[1].string.replace("\n", "")
                    dico_product['long-description'] = unidecode.unidecode(long_description)
                
                # Download the main image with the name of the retailer in its name
                if soup_product.find('img', {'class':'image-gallery-main-img'}) is not None:
                    url_main_img = soup_product.find('img', {'class':'image-gallery-main-img'})['src'].replace("\n", "")
                    image_name = str(dico_product["id"]) + "_" + url_main_img.split("/")[-1]
                    file_name =DownloadPath + '/'+ category + '/' + clothing + "/" + retailer.replace(" ", "_") + "_main_" + image_name
                    try:
                        r = requests.get(url_main_img, stream=True, headers=headers)
                        f = open(file_name, 'wb')
                        shutil.copyfileobj(r.raw, f)
                        dico_product['url-main-image'] = url_main_img
                        cpt +=1
                    except Exception as e:
                        print("Error while downloading the image:", e)
                        continue
                else:
                    dico_product = {}
                    continue
                
                
                # Download the thumbnail image also with the name of the retailer in their name
                other_imgs = soup_product.find('div', {'is':'gallery-thumbnails'})
                dico_product['other-images-url'] = []
        
                # Not all images have thumbnails images
                if other_imgs is not None:
                    other_imgs = soup_product.find('div', {'is':'gallery-thumbnails'}).contents
                    nb_imgs = int(len(other_imgs)/2)
                    for i in range(1, nb_imgs):
                        dico_product['other-images-url'].append(other_imgs[2*i+1]['href'])
                        image_name = str(dico_product["id"]) + "_" + other_imgs[2*i+1]['href'].split("/")[-1]
                        file_name =DownloadPath + '/' + category + '/' + clothing + "/" + retailer.replace(" ", "_") + "_" + str(i) + image_name
                        if (other_imgs[2*i+1]['href'].replace("\n", "") != url_main_img):
                            try:
                                r = requests.get(other_imgs[2*i+1]['href'].replace("\n", ""), stream=True, headers=headers)
                                f = open(file_name, 'wb')
                                shutil.copyfileobj(r.raw, f)
                                cpt +=1
                            except Exception as e:
                                print("Error while downloading the image:", e)
                                continue
                # Append the cio of your products to your final list
                dico_final.append(dico_product)
                dico_category.append(dico_product)
        total_cat += cpt
            
        # Count the number of images for this subcategory
        print("Number of images downloaded for subcategory", clothing, ":", cpt)
    
    # Total images downloaded for this category 
    print("Total images downloaded for category", category, ":", total_cat)
    total_imgs += total_cat
    

    # Save all the image of the category to  a json file
    name = "lyst"+ category + ".json"
    with open(name, "w") as fout:
        json.dump(dico_category, fout)

# Total images downloaded 
print("Total images downloaded:", total_imgs)
# Finally, create the final json
with open("lyst_total.json", "w") as fout:
    json.dump(dico_final, fout)

    
    
    

Now, we can do the same with the women's catalogue, we just need to change a few parameters.

In [None]:
# Specify the location to save the data
DownloadPath = "/media/arthur/DATA/Databases/LystScrapingF"

# We create a dictionary with all cat. and subcat. of the Lyst catalogue
words_to_search_f =  {
    'hosiery':['stocking','socks', 'tights'],
    'jumpsuits':['full+length+jumpsuits', 'playsuits'],
    'skirts':['maxi+skirts', 'knee+length+skirts', 'mid+length+skirts', 'mini+skirts'],
    'lingerie':['bodysuits', 'camisoles', 'panties','sets','corsetry','bras'],
    'coats':['capes','trench+coats','short+coats','fur+coats','long+coats','parka+coats'],
    'jeans':['bootcut+jeans', 'straight+jeans', 'wide-leg+jeans', 'flared+jeans', 'skinny+jeans', 'cropped+jeans'],
    'coats':['trench+coats', 'short+coats', 'long+coats',  'parka+coats'],
    'pants':['leggings','cropped+pants', 'straight-leg+pants', 'full+length+pants','skinny+pants', 'wide-leg+pants', 'harem+pants', 'cargo+pants'],
    'beachwear':['bikinis', 'one-piece+swimsuits', 'sarongs','towels', 'kaftans'],
    'knitwear':['cardigans','ponchos', 'sweaters', 'turtlenecks', 'zipped+sweaters', 'sleeveless+sweaters'],
    'dresses':['mini+dresses', 'cocktail+dresses','gowns','casual+dresses','formal+dresses','maxi+dresses'],
    'shorts':['bermudas+shorts','cargo+shorts','denim+shorts', 'mini+shorts','formal+shorts', 'knee+length+shorts'],
    'sweats':['sweatpants','tracksuits','sweatshirts', 'hoodies'],
    'tops':['shirts','blouses','short+sleeve+tops','long+sleeved+tops', 't-shirts','sleeveless+tops'],
    'jackets':['formal+jackets','leather+jackets','fur+jackets','denim+jackets','waistcoats', 'casual+jackets','parka+jackets'],
    'nightwear':['nightgowns','robes','pyjamas'],
    'suits':['skirt+suits', 'pant+suits']
}

# Create the tree in your hard disk 
for category in words_to_search_f:
    os.makedirs(DownloadPath + "/" + category)
    subcategories = words_to_search_f[category]
    for clothing in subcategories:
        os.makedirs(DownloadPath + "/" + category + "/" + clothing)
print("Tree created")

In [None]:
# Final scraper for womenswear on Lyst.fr

# We create 2 save lists: 1 for all the product, and 1 for each category in case an issue occurs.
dico_final_f = []
total_imgs_f = 0
Base_api = 'https://www.lyst.fr/api/rothko/modules/product_feed/?url=%2F'
Base = 'https://www.lyst.fr'

# Let's loop on all the words to search
for category in words_to_search_f:
    print("Processing category", category, "...")
    dico_category = []
    total_cat = 0
    subcategories = words_to_search_f[category]

    # Loop on the subcategories
    for clothing in subcategories:
        print("Processing subcategory", clothing, "...")
        url_true = "parcourir/vetements/?category=" + category +  "&subcategory=" + clothing + "&page="
        print(url_true)
        url_api_base = Base_api + url_true.replace("+", "%2B").replace("&", "%26").replace("=", "%3D").replace("/", "%2F").replace("?", "%3F")
        cpt = 0
        for i in range(2, 100):
            print("Parsing page no", i, "/", 100, "...")
            url = url_api_base + str(i)
            try:
                html = requests.get(url)
            except Execption as e:
                print("Error: can't reach.", e)
                continue
            soup = BeautifulSoup(html.text, 'lxml')
            text_data = soup.text
            dico = json.loads(text_data)
            num = 0
            for product in dico['data']['feed_items']:
                time.sleep(0.1)
                num +=1
                dico_product = {}
                dico_product['id'] = product['product_card']['id']
                url_product = Base + product['product_card']['url']
                html_product = requests.get(url_product)
                soup_product = BeautifulSoup(html_product.text, 'lxml')
                print("Extracting data for product no", num)
            
            
                # We retrieve all the data
                if soup_product.find('div', {'itemprop':'brand'}) is not None:
                    if len(soup_product.find('div', {'itemprop':'brand'}).contents)>1:
                        name_brand = soup_product.find('div', {'itemprop':'brand'}).contents[1].string.replace("\n", "")
                        dico_product['brand'] = unidecode.unidecode(name_brand)
                if soup_product.find('div', {'itemprop':'name'}) is not None:
                    short_description = soup_product.find('div', {'itemprop':'name'}).contents[0].string.replace("\n", "")
                    dico_product['short-description'] = unidecode.unidecode(short_description)
                if soup_product.find('a', {'reason':'retailer-link'}) is not None:
                    retailer = soup_product.find('a', {'reason':'retailer-link'}).contents[1].string[19::].replace("\n", "")
                    dico_product['retailer'] = unidecode.unidecode(retailer)
                if soup_product.find('div', {'class':'product-description__details text-paragraph mb0'}) is not None:
                    long_description = soup_product.find('div', {'class':'product-description__details text-paragraph mb0'}).contents[1].string.replace("\n", "")
                    dico_product['long-description'] = unidecode.unidecode(long_description)
                
                # Download the main image with the name of the retailer in its name
                if soup_product.find('img', {'class':'image-gallery-main-img'}) is not None:
                    url_main_img = soup_product.find('img', {'class':'image-gallery-main-img'})['src'].replace("\n", "")
                    image_name = str(dico_product["id"]) + "_" + url_main_img.split("/")[-1] 
                    file_name =DownloadPath + '/'+ category + '/' + clothing + "/" + retailer.replace(" ", "_") + "_main_" + image_name
                    try:
                        r = requests.get(url_main_img, stream=True, headers=headers)
                        f = open(file_name, 'wb')
                        shutil.copyfileobj(r.raw, f)
                        dico_product['url-main-image'] = url_main_img
                        cpt +=1
                    except Exception as e:
                        print("Error while downloading the image:", e)
                        continue
                else:
                    dico_product = {}
                    continue
                
                
                # Download the thumbnail image also with the name of the retailer in their name
                other_imgs = soup_product.find('div', {'is':'gallery-thumbnails'})
                dico_product['other-images-url'] = []
        
                # Not all images have thumbnails images
                if other_imgs is not None:
                    other_imgs = soup_product.find('div', {'is':'gallery-thumbnails'}).contents
                    nb_imgs = int(len(other_imgs)/2)
                    for i in range(1, nb_imgs):
                        dico_product['other-images-url'].append(other_imgs[2*i+1]['href'])
                        image_name = str(dico_product["id"]) + "_" + other_imgs[2*i+1]['href'].split("/")[-1]
                        file_name =DownloadPath + '/' + category + '/' + clothing + "/" + retailer.replace(" ", "_") + "_" + str(i) + image_name
                        if (other_imgs[2*i+1]['href'].replace("\n", "") != url_main_img):
                            try:
                                r = requests.get(other_imgs[2*i+1]['href'].replace("\n", ""), stream=True, headers=headers)
                                f = open(file_name, 'wb')
                                shutil.copyfileobj(r.raw, f)
                                cpt +=1
                            except Exception as e:
                                print("Error while downloading the image:", e)
                                continue
                # Append the cio of your products to your final list
                dico_final_f.append(dico_product)
                dico_category.append(dico_product)
        total_cat += cpt
            
        # Count the number of images for this subcategory
        print("Number of images downloaded for subcategory", clothing, ":", cpt)
    
    
    # Total images downloaded for this category 
    print("Total images downloaded for category", category, ":", total_cat)
    total_imgs_f += total_cat
    

    # Save all the image of the category to  a json file
    name = "lyst_f"+ category + ".json"
    with open(name, "w") as fout:
        json.dump(dico_category, fout)
        
# Total image downloaded
print("Total image downloaded:", total_imgs_f)

# Finally, create the final json
with open("lyst_total_f.json", "w") as fout:
    json.dump(dico_final_f, fout)

This is it, you are ready to extract all the data we wanted. Now, there is still an issue, as downloading the image might take too much time depending on your Internet connexion and your hardware. As mine is not great and I don't use an SSD disk for this job, I will be using a __Virtual Machine (VM)__ to do this job. IN fact, from a performance point of view, the best solution would be to use several VM in parallel with different IPs to download all the data, this way you save a lot of time and shouldn't be banned. But as I said, remember to be polite when scraping data so I'll only use one Google Cloud instance. With the right parameters, these VMs are relatively cheap for such works.

I hope you found this quick tutorial useful, as I said I am still a beginner so any feedback is very welcome ! 