# Galmart Shop Web Scrapping

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import re

Function opening web page:

In [2]:
def open_page(page_num, url = 'https://store.galmart.kz/shop/page'):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'}
    page_url = url + '/' + str(page_num)
    return requests.get(page_url, headers = headers)

Getting the number of pages to scrap:

In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'}
page = requests.get('https://store.galmart.kz/shop/page/2', headers = headers)
pages = []
if page.status_code == 200:
    soup = BeautifulSoup(page.text, "html.parser")
    content = soup.find('div', class_ = 'ast-woocommerce-container').find(class_ = 'page-numbers').text
    string = content.split('\n')
    for i in string:
        if i.isdigit():
            pages.append(int(i))
    num_pages = max(pages)

In [4]:
num_pages

268

Reading the shop pages one by one. Information about the product is inside the _astra-shop-summary-wrap_ class; the detailed info is inside the _woocommerce-loop-product__title_ class, the price is described in the _woocommerce-Price-amount amount_ class.

In [5]:
%%time
products = []
prices = []
for i in np.arange(1,num_pages+1):
    page = open_page(i)
    if page.status_code == 200:
        soup = BeautifulSoup(page.text, "html.parser")
        content = soup.findAll('div', class_ = 'astra-shop-summary-wrap')
        for product in content:
            if product.find(class_='woocommerce-loop-product__title') is not None:
                products.append(product.find('h2').text)
                if product.find('span', class_='woocommerce-Price-amount amount') is not None:
                    prices.append(product.find('span', class_='woocommerce-Price-amount amount').text)
                else:
                    prices.append(0)
    else:
        print(page.status_code)

ConnectionError: HTTPSConnectionPool(host='store.galmart.kz', port=443): Max retries exceeded with url: /shop/page/11 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa2489d8150>: Failed to establish a new connection: [Errno 60] Operation timed out'))

Deleting irrelevant symbols from the price:

In [6]:
prices_clean = []
pattern = '(\d+[,])?\d+'
for price in prices:
    price = str(price).replace(',','')
    string = re.search(pattern, str(price))
    prices_clean.append(string[0])

Taking relevant information from the product description:

In [7]:
origins = []
for product in products:
    separated = str(product).strip().split(',')
    if len(separated) > 1:
        string = str(product).strip().split(' ')
        if string[-1].istitle():
            origins.append(string[-1].strip()) 
        elif string[-2].istitle():
            origins.append(string[-2].strip()) 
        else:
            origins.append('NONE')
    else:
        origins.append('NONE')

Put the data inside dataframe:

In [8]:
data = pd.DataFrame({'product':products, 'price':prices_clean, 'origin':origins})
data

Unnamed: 0,product,price,origin
0,"Напиток Actimel кисломолочный Гранат, 100 гр, ...",210,Казахстан
1,"Молоко Родина 3,2% финпак, 1000мл, Казахстан",305,Казахстан
2,"Вода Tassay негазированная 5000 мл, Казахстан",480,Казахстан
3,"Яйцо Казгер-Құс премиум, 15шт, Казахстан",740,Казахстан
4,"Напиток Actimel кисломолочный Клубника, 100 гр...",210,Казахстан
...,...,...,...
205,"Оби-нан с кунжутом, 400гр, Казахстан",160,Казахстан
206,Сметана President 10% пластиковый стакан 400г...,465,Казахстан
207,"Йогурт Actimel Гуава-эхинацея, 100 гр, Казахстан",210,Казахстан
208,"Дрожжи Royal Food сухие, 80 гр, Казахстан",280,Казахстан


In [9]:
data.origin.value_counts(dropna = False)

Казахстан     118
NONE           34
Россия         34
Узбекистан      4
Турция          3
Корея           2
Италия          2
Грузия          2
Казахстан,      1
Германия        1
(Италья)        1
Италья          1
Эквадор         1
Франция         1
Польша          1
Чехия           1
Литва,          1
Мексика         1
Кыргызстан      1
Name: origin, dtype: int64

Exporting to Excel:

In [10]:
data.to_excel('galmart.kz.xlsx')