<p style="color: darkred; font-size: 30px; text-align: center;"><b>Comparative Web Scraping Analysis:</b> Extracting and Evaluating Product Data from Auchan and Biedronka stores in Glovo</p>
<p style="color: darkred; font-size: 20px; text-align: center;">Webscraping and Social Media Scraping Class Project</p>
<p style="font-size: 15px; text-align: center;">Paula Gwanchele</p>
<p style="font-size: 10px; text-align: center;">Spring 2025</p>
<p align="center">
  <img src="logo_UW_WNE.jpg" alt="WNE Logo" width="398" height="53">
</p>

## Objective: 
Scrape product data from Auchan in Glovo and compare it with Biedronka's data to determine which store offers better prices and diversity.
## Tools: 
Selenium (for dynamic content), BeautifulSoup (for parsing), Pandas (for data processing), Matplotlib/Seaborn (for visualizations).

# 1. Scraping using Selenium

In [4]:
#To proceed with Selenium, browser's webdriver need to be installed and the libraries.
#!pip install pandas numpy selenium

In [5]:
# FOR DATA PROCESSING:
import pandas as pd
import numpy as np
import os  
import re
from itertools import zip_longest

# FOR MEASURING COMPUTATION TIME, CREATING FIXED DELAYS:
import time

# FOR APPLYING SELENIUM:
import selenium 
from selenium import webdriver
from selenium import webdriver
import requests  # For downloading images

# FOR WEB DRIVER:
from webdriver_manager.chrome import ChromeDriverManager 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

# FOR HTML PARSING:
from bs4 import BeautifulSoup  # BeautifulSoup helps extract images



**Step 1** : Open the Auchan store on Glovo website with Selenium.

In [7]:
# Define Website URL
website = "https://glovoapp.com/pl/en/warsaw/auchan-waw"

# Initialize Selenium WebDriver
service_chrome = Service(ChromeDriverManager().install()) 
options_chrome = webdriver.ChromeOptions()
driver_chrome = webdriver.Chrome(service = service_chrome, options = options_chrome)

driver_chrome.maximize_window()
driver_chrome.get(website) #opens the website

# Handle Cookies
cookies_button_xpath = '''//button[@id='onetrust-accept-btn-handler']'''
try:
    WebDriverWait(driver_chrome, 10).until(
        EC.element_to_be_clickable((By.XPATH, cookies_button_xpath))
    ).click()
    print("Cookies accepted.")
except:
    print("No cookies banner found or already accepted.")

No cookies banner found or already accepted.


**Step 2:** Scale the procedure of collecting links to subpages with products.

In [9]:
start = time.time()
# Find all product links
tags = driver_chrome.find_elements(By.XPATH, "//a[@data-test-id='collection-link']")

# Collect product page links from the 'tags'
product_links = []

for tag in tags:
            href = tag.get_attribute("href")
            if (href not in product_links): # here we handle duplicates
                product_links.append(href)

print(f"✅ Collected {len(product_links)} product links.")
end = time.time()
print(end-start)

✅ Collected 132 product links.
2.134073257446289


In [10]:
product_links[0:3]

['https://glovoapp.com/pl/en/warsaw/auchan-waw/?content=prawdopodobnie-najlepsze-sc.23917251%2Fnajlepsze-piwa-c.23917252',
 'https://glovoapp.com/pl/en/warsaw/auchan-waw/?content=marki-auchan-sc.2284469%2Fnajtansze-c.2284480',
 'https://glovoapp.com/pl/en/warsaw/auchan-waw/?content=marki-auchan-sc.2284469%2Fauchan-c.2284489']

**Step 3** : Access the collected links to extract product data and store data

In [12]:
all_products = []
all_prices = []
all_images = []

for link in product_links:  
    
    try:
        driver_chrome.get(link)
        time.sleep(np.random.chisquare(1)+3)

        # Extract product details
        product_elements = driver_chrome.find_elements(By.CLASS_NAME, "tile__description")
        product_prices = driver_chrome.find_elements(By.CLASS_NAME, 'tile__price')
        product_images = driver_chrome.find_elements(By.XPATH, "//img[contains(@class, 'tile__image')]")
    

        
        all_products.extend([product.text.strip() for product in product_elements if product.text.strip()])
        all_prices.extend([price.text.strip() for price in product_prices if price.text.strip()])
        all_images.extend([img.get_attribute("src") for img in product_images if img.get_attribute("src")])

    
    except:
        continue

print(all_products[0:3])
print(all_prices[0:3])
print(all_images[0:3])

['Karmi - Piwo ciemne bezalkoholowe alk.do 0.5% obj. - 4 x 400 ml', 'Zatecky - Piwo bezalkoholowe pasteryzowane alk.0.0% obj. - 4 x 500 ml', 'Auchan - Ser mozzarella w zalewie solankowej - 100 g']
['18,72 zł', '16,75 zł', '2,79 zł']
['https://glovo.dhmedia.io/image/global-catalog-glovo/nv-global-catalog/sw/198c0bcd-c6dc-4c23-8346-67db10a574a9.jpg?t=W3siYXV0byI6eyJxIjoibG93In19LHsicmVzaXplIjp7IndpZHRoIjoxNTAsImhlaWdodCI6MTUwfX1d', 'https://glovo.dhmedia.io/image/global-catalog-glovo/nv-global-catalog/sw/4e8be438-dba7-49d2-81e0-65ad6fddb071.jpg?t=W3siYXV0byI6eyJxIjoibG93In19LHsicmVzaXplIjp7IndpZHRoIjoxNTAsImhlaWdodCI6MTUwfX1d', 'https://glovo.dhmedia.io/image/global-catalog-glovo/nv-global-catalog/sw/4774d8be-5caf-4e87-8998-fc12c50c8abb.jpg?t=W3siYXV0byI6eyJxIjoibG93In19LHsicmVzaXplIjp7IndpZHRoIjoxNTAsImhlaWdodCI6MTUwfX1d']


**Step 4**: Downloading product images

In [25]:
# Define the correct folder path
image_folder = r"C:\Users\Surface 4\OneDrive\Documents\Web&Social Media Scrapping\Project\glovo_images"
os.makedirs(image_folder, exist_ok=True)  # Ensure the folder exists

# Download and save images
for image_url in all_images:
    try:
        response = requests.get(image_url, stream=True)# Stream the image
        if response.status_code == 200:
            img_name = os.path.basename(image_url.split("?")[0])  
            image_path = os.path.join(image_folder, img_name)
            with open(image_path, "wb") as file:
                file.write(response.content)
        else:
            print('Url for this image not found')
    except requests.exceptions.RequestException as e:
        print(f"⚠️ Failed to download {image_url}: {e}")

**Step 5**: Saving the data into a dataframe

In [34]:
# Ensure all lists have the same length
max_length = max(len(all_products), len(all_prices), len(all_images))

# Pad shorter lists with None to make them equal
all_products = all_products + [None] * (max_length - len(all_products))
all_prices = all_prices + [None] * (max_length - len(all_prices))
all_images = all_images + [None] * (max_length - len(all_images))

# Function to extract category name, product name, and weight
def split_product_details(product_text):
    if not product_text:
        return None, None, None  # Handle empty values
    
    parts = product_text.split(" - ")  # Split by ' - '
    
    category_name = parts[0] if len(parts) > 0 else None
    product_name = " - ".join(parts[1:-1]) if len(parts) > 2 else (parts[1] if len(parts) > 1 else None)
    weight = parts[-1] if len(parts) > 1 else None
    
    return category_name, product_name, weight

# Process product details
category_names = []
product_names = []
weights = []

for product in all_products:
    category, name, weight = split_product_details(product)
    category_names.append(category)
    product_names.append(name)
    weights.append(weight)

# Create DataFrame
df = pd.DataFrame({
    "Category Name": category_names,
    "Product Name": product_names,
    "Weight/Size": weights,
    "Price": all_prices,
    "Image URL": all_images
})

# Save DataFrame to CSV
df.to_csv(r"C:\Users\Surface 4\OneDrive\Documents\Web&Social Media Scrapping\Project\glovo_auchan_products.csv", index=False, encoding="utf-8-sig")

print("✅ Data saved to glovo_auchan_products.csv successfully!")

# Close browser
driver_chrome.quit()

✅ Data saved to glovo_auchan_products.csv successfully!


In [28]:
df.head()

Unnamed: 0,Category Name,Product Name,Weight/Size,Price,Image URL
0,Karmi,Piwo ciemne bezalkoholowe alk.do 0.5% obj.,4 x 400 ml,"18,72 zł",https://glovo.dhmedia.io/image/global-catalog-...
1,Zatecky,Piwo bezalkoholowe pasteryzowane alk.0.0% obj.,4 x 500 ml,"16,75 zł",https://glovo.dhmedia.io/image/global-catalog-...
2,Auchan,Ser mozzarella w zalewie solankowej,100 g,"2,79 zł",https://glovo.dhmedia.io/image/global-catalog-...
3,Auchan,Ser gouda plastry,150 g,"4,98 zł",https://glovo.dhmedia.io/image/global-catalog-...
4,Auchan,Ser Edamski,150 g,"5,49 zł",https://glovo.dhmedia.io/image/global-catalog-...


In [29]:
df.tail()

Unnamed: 0,Category Name,Product Name,Weight/Size,Price,Image URL
4039,Auchan,Cienkopisy różne kolory 0.4 mm 10 kolorów,10 sztuk,"15,99 zł",https://glovo.dhmedia.io/image/global-catalog-...
4040,Auchan,Kredki ołówkowe 12 kolorów,12 sztuk,"6,99 zł",https://glovo.dhmedia.io/image/global-catalog-...
4041,Auchan,Długopis automatyczny 4 kolory,4 sztuki,"8,99 zł",https://glovo.dhmedia.io/image/global-catalog-...
4042,Agrecol,Nawóz do storczyków Biohumus Forte,500 ml,"4,99 zł",https://glovo.dhmedia.io/image/global-catalog-...
4043,Florovit,Płynny nawóz uniwersalny 1 kg,1 kg,"18,99 zł",https://glovo.dhmedia.io/image/global-catalog-...
