# Web Scraping with BeautifulSoup in Python

Import the required libraries

BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Pandas documentation: https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html



**First click on the icon to connect Google Drive, the third one in the side menu. We need the connection to save the data in the end**

In [1]:
import requests
from IPython.display import display, HTML
from bs4 import BeautifulSoup
import pandas as pd

Define the start URL for scraping the page

In [2]:
base_url = "https://www.setec.mk/"

In [3]:
response = requests.get(base_url)

In [4]:
response  # If the response status code is 200 it means we have successfully got the page

<Response [200]>

In [5]:
response.status_code

200

Display the HTML, raw without styles

In [6]:
display(HTML(response.text))

In [7]:
raw_html = response.text  # get the raw response, the response object also contains the status code, and other metadata

Parse the HTML using BeautifulSoup

This creates an object which allows us to access html elements using tags, css selectiors and attribute values.

In [8]:
soup = BeautifulSoup(raw_html, "html.parser")

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html class="responsive" lang="mk">
 <head>
  <title>
   Сетек | сè од техника
  </title>
  <meta content="1200" property="og:image:width"/>
  <meta content="630" property="og:image:height"/>
  <meta content="image/jpeg" property="og:image:type"/>
  <meta content="Сетек | сè од техника, Сетек се од техника" property="og:title"/>
  <meta content="product" property="og:type"/>
  <meta content="https://www.setec.mk/image/catalog/Promo/setec_logo_modal.jpg" property="og:image"/>
  <meta content="Сетек | Сè од техника" name="description" property="og:description"/>
  <meta content="Laptops, Computers and IT, Notebooks, Tablets, Accessories, PC and equipment, Tv, Audio, Video, LED TV, LCD TV, Phones and Navigation, Mobile Phones and Accessories, White Goods, Washing, Dryers, Diswasher, Refrigerators, Freezers, Photo, Camera, Built in appliances, Small domestic appliances, Kitchen appliances, Personal and beauty care, Products for home use, Kitchen equipment, Heating and Cooli

* class selector - starts with '.' and continues with the name of the class
(example '.sale')

* id selector - starts with '#' and continues with the name of the id (example '#id')

* tag names - example 'p': paragraph, 'a': anchor (link), 'div' - division

In [10]:
discounts = soup.select(".sale")  # get every element that has the class 'sale'

In [11]:
soup.select_one(".sale")  # get the first element that has the class 'sale'

<div class="sale">-45%</div>

In [12]:
soup.select("p")  # select all paragraphs

[<p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p class="close-menu"></p>,
 <p class="open-menu"></p>,
 <p class="arrow"></p>,
 <p><a href="http://midea.mk/?page_id=3425" style="border-bottom: 1px solid #e5e5e5;"><b>Мидеа VRF системи</b></a></p>,
 <p><a href="http://midea.mk/?page_id=8105">Внатрешни единици за VRF с

In [13]:
soup.find_all("a", {"target": "_blank"})  # find all links having the attribute target witha value of _blank (target='_blank')

[<a class="clearfix" href="http://midea.mk/" target="_blank"><span><strong>Професионална Климатизација</strong></span></a>,
 <a href="http://midea.mk/" target="_blank"><img src="https://setec.mk/image/catalog/Meni/Midea-Chilers.jpg"/></a>,
 <a href="https://setec.mk/index.php?route=product/search&amp;search=FIELDMANN" target="_blank"><img src="https://setec.mk/image/catalog/Meni/FIELDMANN_29x292.jpg"/></a>,
 <a class="clearfix" href="" target="_blank"><span><strong>Оптички Уреди</strong></span></a>,
 <a class="clearfix" href="https://cc.st.mk/Account/SignIn.aspx" target="_blank"><span><strong>Cashback проверка</strong></span></a>,
 <a class="clearfix" href="index.php?route=information/information&amp;information_id=15" target="_blank"><span><strong>Кариера</strong></span></a>,
 <a class="clearfix" href="index.php?route=information/contact" target="_blank"><span><strong><img alt="" src="https://setec.mk/image/catalog/stationery2/icon-phone.png"/>Онлајн нарачка 02 30 80 899</strong></spa

In [14]:
regular_price = soup.select(".price-old-new")  # select all elements having the class .price-old-new

In [15]:
regular_price

[<span class="price-old-new">39,999 Ден.</span>,
 <span class="price-old-new">49,999 Ден.</span>,
 <span class="price-old-new">40,999 Ден.</span>,
 <span class="price-old-new">45,999 Ден.</span>,
 <span class="price-old-new">159,999 Ден.</span>,
 <span class="price-old-new">64,999 Ден.</span>,
 <span class="price-old-new">74,999 Ден.</span>,
 <span class="price-old-new">27,999 Ден.</span>,
 <span class="price-old-new">39,999 Ден.</span>,
 <span class="price-old-new">43,999 Ден.</span>,
 <span class="price-old-new">28,999 Ден.</span>,
 <span class="price-old-new">19,999 Ден.</span>,
 <span class="price-old-new">28,999 Ден.</span>,
 <span class="price-old-new">42,999 Ден.</span>,
 <span class="price-old-new">27,999 Ден.</span>,
 <span class="price-old-new">28,999 Ден.</span>,
 <span class="price-old-new">10,999 Ден.</span>,
 <span class="price-old-new">11,999 Ден.</span>,
 <span class="price-old-new">4,999 Ден.</span>,
 <span class="price-old-new">12,999 Ден.</span>,
 <span class="price-

In [16]:
len(discounts)  # check the length of the list

110

In [17]:
len(regular_price)

110

In [18]:
products = soup.select(".product")  # select all elements having the class 'product'

In [19]:
product = products[0]

In [20]:
discount = product.select_one(".sale")  # select the first element having the class 'sale'

In [21]:
discount.text

'-45%'

In [22]:
product.select_one("a").get("href")  # find the first anchor element and get the value of it's "href" attrbute

'https://setec.mk/index.php?route=product/product&product_id=75937'

Defining a dictionary of lists for creating a pandas dataframe.

When you have multiple lists that need to contain values connected to some object, (if we do not have a class for this) it's a good practice to create a dictionary with keys as the lists names, and the lists as values.

In [23]:
prod_dict = {
    "discounts": [],
    "regular_prices": [],
    "names": [],
    "links": [],
}

In [24]:
prod_dict

{'discounts': [], 'regular_prices': [], 'names': [], 'links': []}

In [25]:
for product in products:
  discount = product.select_one(".sale")
  if discount:
    discount = discount.text
  else:
    discount = ""

  prod_dict['discounts'].append(discount)  # we access the value of an entry in the dictionary by providing the key in angle brackets

  regular_price = product.select_one(".price-old-new")
  if regular_price:
    regular_price = regular_price.text
  else:
    regular_price = ""

  prod_dict['regular_prices'].append(regular_price)

  name = product.select_one(".name").text
  prod_dict['names'].append(name)
  link = product.select_one("a").get("href")
  prod_dict['links'].append(link)

In [27]:
prod_dict

{'discounts': ['-45%',
  '-30%',
  '-37%',
  '-21%',
  '-31%',
  '-28%',
  '-45%',
  '-43%',
  '-18%',
  '-32%',
  '-28%',
  '-25%',
  '-28%',
  '-42%',
  '-36%',
  '-47%',
  '-45%',
  '-25%',
  '-24%',
  '-27%',
  '-23%',
  '-22%',
  '-19%',
  '-12%',
  '-23%',
  '-31%',
  '-18%',
  '-39%',
  '-24%',
  '-25%',
  '-26%',
  '-32%',
  '-18%',
  '-30%',
  '-41%',
  '',
  '-23%',
  '-28%',
  '-15%',
  '-10%',
  '-16%',
  '-17%',
  '-13%',
  '-16%',
  '-16%',
  '-18%',
  '-16%',
  '-12%',
  '-11%',
  '-33%',
  '-31%',
  '-30%',
  '-17%',
  '-23%',
  '-17%',
  '-29%',
  '-29%',
  '-14%',
  '-21%',
  '-20%',
  '-26%',
  '-21%',
  '-17%',
  '-35%',
  '-43%',
  '-36%',
  '-30%',
  '-20%',
  '-16%',
  '-25%',
  '-29%',
  '-37%',
  '-13%',
  '-18%',
  '-23%',
  '-29%',
  '-27%',
  '-19%',
  '-40%',
  '-29%',
  '-25%',
  '-25%',
  '-32%',
  '-41%',
  '-30%',
  '-30%',
  '-33%',
  '-31%',
  '-41%',
  '-35%',
  '-35%',
  '-35%',
  '-35%',
  '-60%',
  '-63%',
  '-42%',
  '-34%',
  '-38%',
  '-41%',
 

In [28]:
len(prod_dict["discounts"]) == len(prod_dict["regular_prices"]) == len(prod_dict["names"]) == len(prod_dict["links"])  # check if the lengths of all arrays are equal

True

In [30]:
# we can create a dataframe by passing the dictionary of lists
# the keys will be the column names and each list will represent the values of the column
data = pd.DataFrame(prod_dict)

In [31]:
data

Unnamed: 0,discounts,regular_prices,names,links
0,-45%,"39,999 Ден.",TCL 58P635,https://setec.mk/index.php?route=product/produ...
1,-30%,"49,999 Ден.",SAMSUNG UE55CU8072UXXH,https://setec.mk/index.php?route=product/produ...
2,-37%,"40,999 Ден.",SAMSUNG UE55AU7092UXXH,https://setec.mk/index.php?route=product/produ...
3,-21%,"45,999 Ден.",SAMSUNG UE65AU7092UXXH,https://setec.mk/index.php?route=product/produ...
4,-31%,"159,999 Ден.",SONY XR65A75KAEP,https://setec.mk/index.php?route=product/produ...
...,...,...,...,...
106,-33%,"14,999 Ден.",XIAOMI Mi Robot Vacuum E10,https://setec.mk/index.php?route=product/produ...
107,-48%,"49,999 Ден.",Midea Breezeless E CB1-12HRFN8-I / CB1-12HFNX-O,https://setec.mk/index.php?route=product/produ...
108,-21%,"11,999 Ден.",ST FOREST PLUS,https://setec.mk/index.php?route=product/produ...
109,-45%,"1,999 Ден.",ST AH-281C,https://setec.mk/index.php?route=product/produ...


In [32]:
# save the dataframe as a csv somewhere in Google Drive
data.to_csv("/content/drive/MyDrive/VNP-2023 24 - Milena/Auditoriski/g1/data.csv")