## Web scraping prices form sargenta.se

Web scraping of the Swedish smithing website sargenta.se

Dad wanted an Excel file with the prices and the article number of all tools and machines at sargenta.se. This is a web scraper that saves the asked for excel sheet.

Importing packages

In [1]:
import requests
import re
from bs4 import BeautifulSoup

The website we are going to scrape https://www.sargenta.se/ and the categories of interest are _verktyg_ (swe) = _tools_ (eng) and _maskiner_ (swe) = _machines_ (eng).

In [2]:
sargeta = 'https://www.sargenta.se/'
categories_of_interest = ['verktyg', 'maskiner']

The url of the sargenta.se shop for `category`.

In [3]:
def url_from_category(category, url = 'https://www.sargenta.se/'):
    """
    Returns the url of the sargenta.se shop for category.
    """
    return url + 'shop/' + category

Creates a webscraping `BeautifulSoup` object `soup` for scraping.

In [4]:
def soup_from_url(url):
    """
    Returns a BeautifulSoup object
    """
    client = requests.get(url).text
    soup = BeautifulSoup(client, "html.parser")
    return soup

Collects the sidebar on the webpage from `soup`.

In [5]:
def sidebar_from_soup(soup):
    """
    Returns a BeautifulSoup object containing the sidebar of the webpage
    """
    return soup.find_all('div', {'id': 'sidebar_v'})[0].find_all('a')

Finds all items in a catageory. For example the category _tools_ contains _hammer_.

In [6]:
def list_urls_from_sidebar(sidebar, url):
    """
    Returns list of all items beloning to the category
    """
    item_list = []
    article_type = url.split('/')[-1]
    shop_article = "/shop/" + str(article_type) + "/"
    
    for item in sidebar:
        if shop_article in item.attrs['href']:
            item_list.append(item.attrs['href'])
    
    return item_list

Goes through an item and returns all article numbers and prices for the item.

In [7]:
def append_from_item(item, url):
    """
    Returns all article numbers and prices for an item.
    """
    this_item_list = []
    prev_this_item_list = [[]]
    page_number = 1
    
    while this_item_list != prev_this_item_list:
        item_url = url + item + '?p=' + str(page_number)
        soup_item = soup_from_url(item_url)
        articles = soup_item.find_all('div', {'class':'listpris'})
        
        for article in articles:
            price_article = article.text.replace('\t', '')\
                                        .replace('\n', '')\
                                        .replace(' ', '')\
                                        .replace('\r', '')\
                                        .replace('Artnr', '')\
                                        .split('*')
            
            if len(this_item_list) > 1:
                 if price_article[1] == this_item_list[0][1]:
                    break

            this_item_list.append(price_article)
            
        page_number += 1
        prev_this_item_list = this_item_list
        
    return this_item_list

Loops through all items in a category and creates a list of article number and prices for all items.

In [8]:
def articles_from_item_list(item_list, url, articles_list):
    """
    Returns a list of article number and prices for all items in a category.
    """
    for item in item_list:
        this_item_list = append_from_item(item, url)
        articles_list = articles_list + this_item_list
            
    return articles_list

Finaly a function that puts it all together. Returns the list containing all article numbers and prices for all `categories`.

In [9]:
def articles_list_from_category(categories):
    """
    Returns the list containing all article numbers and prices for all categories
    """
    articles_list = []
    
    for category in categories:
        url_category = url_from_category(str(category))
        soup = soup_from_url(url_category)
        sidebar = sidebar_from_soup(soup)
        item_list = list_urls_from_sidebar(sidebar, url_category)
        
        articles_list = articles_list + articles_from_item_list(item_list, url_category, articles_list)
        
    return articles_list

Run the function `articles_list_from_category` and print top 5 articles and corresponing price.

In [10]:
articles_list = articles_list_from_category(categories_of_interest)
articles_list[:5]

[['75,00kr/par', '7888-1'],
 ['69,00kr/st', '9220'],
 ['172,00kr/st', '7804'],
 ['495,00kr/st', '7805'],
 ['120,00kr/st', '9221']]

We want to expot the results to an excel file. This is done using `pandas`.

In [11]:
import pandas as pd

In [12]:
df = pd.DataFrame(articles_list, columns = ['Pris', 'Artikelnummer'])
df.head()

Unnamed: 0,Pris,Artikelnummer
0,"75,00kr/par",7888-1
1,"69,00kr/st",9220
2,"172,00kr/st",7804
3,"495,00kr/st",7805
4,"120,00kr/st",9221


In [13]:
df.to_excel("sargenta.xlsx")