# Galmart Shop Web Scrapping

In [12]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import re

Function opening web page:

In [13]:
def open_page(page_num, url = 'https://store.galmart.kz/shop/page'):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'}
    page_url = url + '/' + str(page_num)
    return requests.get(page_url, headers = headers)

Getting the number of pages to scrap:

In [20]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'}
page = requests.get('https://store.galmart.kz/shop/page/2', headers = headers)
pages = []
if page.status_code == 200:
    soup = BeautifulSoup(page.text, "html.parser")
    content = soup.find('div', class_ = 'ast-woocommerce-container').find(class_ = 'page-numbers').text
    string = content.split('\n')
    for i in string:
        if i.isdigit():
            pages.append(int(i))
    num_pages = max(pages)

In [21]:
num_pages

267

Reading the shop pages one by one. Information about the product is inside the _astra-shop-summary-wrap_ class; the detailed info is inside the _woocommerce-loop-product__title_ class, the price is described in the _woocommerce-Price-amount amount_ class.

In [5]:
%%time
products = []
prices = []
for i in np.arange(1,num_pages+1):
    page = open_page(i)
    if page.status_code == 200:
        soup = BeautifulSoup(page.text, "html.parser")
        content = soup.findAll('div', class_ = 'astra-shop-summary-wrap')
        for product in content:
            if product.find(class_='woocommerce-loop-product__title') is not None:
                products.append(product.find('h2').text)
                if product.find('span', class_='woocommerce-Price-amount amount') is not None:
                    prices.append(product.find('span', class_='woocommerce-Price-amount amount').text)
                else:
                    prices.append(0)
    else:
        print(page.status_code)

Wall time: 5min 57s


Deleting irrelevant symbols from the price:

In [7]:
prices_clean = []
pattern = '(\d+[,])?\d+'
for price in prices:
    price = str(price).replace(',','')
    string = re.search(pattern, str(price))
    prices_clean.append(string[0])

Taking relevant information from the product description:

In [8]:
origins = []
for product in products:
    separated = str(product).strip().split(',')
    if len(separated) > 1:
        string = str(product).strip().split(' ')
        if string[-1].istitle():
            origins.append(string[-1].strip()) 
        elif string[-2].istitle():
            origins.append(string[-2].strip()) 
        else:
            origins.append('NONE')
    else:
        origins.append('NONE')

Put the data inside dataframe:

In [9]:
data = pd.DataFrame({'product':products, 'price':prices_clean, 'origin':origins})
data

Unnamed: 0,product,price,origin
0,"Напиток Actimel кисломолочный Гранат, 100 гр, ...",210,Казахстан
1,"Молоко Родина 3,2% финпак, 1000мл, Казахстан",305,Казахстан
2,"Вода Tassay негазированная 5000 мл, Казахстан",460,Казахстан
3,"Яйцо Казгер-Құс премиум, 15шт, Казахстан",740,Казахстан
4,"Картофель молодой, 1000г, Казахстан",91,Казахстан
...,...,...,...
5601,Зубная щетка Colgate “Шелковые нити” с древесн...,1010,Китай
5602,Ополаскиватели д/полости рта Listerine total 2...,950,NONE
5603,Корм для кошек Gourmet Perle Индейка влажный 8...,190,Россия
5604,Корм д/кошек Gourmet perle ягненок влажн 85г,190,NONE


In [10]:
data.origin.value_counts(dropna = False)

NONE          1876
Россия        1124
Казахстан      808
Китай          558
Германия       267
              ... 
Литва,           1
Куба             1
Тунис            1
Беларуссия       1
Мексика          1
Name: origin, Length: 95, dtype: int64

Exporting to Excel:

In [11]:
data.to_excel('galmart.kz.xlsx')