# Webscraper Albert Heijn

### Product information
The following details for each product (if detail is available) will be sraped from the website:
* Date (= at which time in the year the product is available and/or in 'bonus')
* Product id
* Product title
* Product category
* In bonus (yes/no)
* Price
* Quantity (= at which product is offered)
* Price per KG
* Bonus type
* Bonus price
* Nutri-score
* Characteristics
* Allergy information
* Nutrition values
* Same, same but different SKU of a particular product
* Product that are recommended (others bought)

### 1.1 Workflow

## 2. Set-up 

### 2.1 Install Selenium and Chromedrive

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
PATH = "C:\\Users\\luukv\\anaconda3\\chromedriver.exe"
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

driver.get("https://www.ah.nl")




[WDM] - Current google-chrome version is 102.0.5005
[WDM] - Get LATEST chromedriver version for 102.0.5005 google-chrome
[WDM] - Driver [C:\Users\luukv\.wdm\drivers\chromedriver\win32\102.0.5005.61\chromedriver.exe] found in cache


In [None]:
PATH = "C:\\Users\\luukv\\.wdm\\drivers\\chromedriver\\win32\\102.0.5005.61\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.ah.nl")

### 2.2 Import packages

In [4]:
import requests
from bs4 import BeautifulSoup
from time import sleep
import re
from datetime import datetime
import csv 
import pandas as pd
import json

## 3. Scrape Program

### 3.1 Define base url and generate product category urls

In [5]:
base_url = "https://www.ah.nl/producten"

def generate_product_category_urls(base_url):
    
    get_driver = driver.get(base_url)
    res = driver.page_source.encode("utf-8")
    soup = BeautifulSoup(res, "html.parser")
    product_categories = soup.find_all(class_="taxonomy-card_title__JMJ3-")
    
    product_category_urls = []
    
    for product_category in product_categories: # where 'pc' stands for 'product_category'
        title = product_category.find("a").attrs["title"].replace(" ", "-").replace("--", "-").replace(",", "").lower()
        url = base_url + product_category.find("a").attrs["href"].replace("producten/", "") + "?page="
        product_category_urls.append({"product_category": title,
                                      "url": url})  
            
    sleep(1)
    
    return product_category_urls

product_category_urls = generate_product_category_urls(base_url)

product_category_urls

[{'product_category': 'aardappel-groente-fruit',
  'url': 'https://www.ah.nl/producten/aardappel-groente-fruit?page='},
 {'product_category': 'salades-pizza-maaltijden',
  'url': 'https://www.ah.nl/producten/salades-pizza-maaltijden?page='},
 {'product_category': 'vlees-kip-vis-vega',
  'url': 'https://www.ah.nl/producten/vlees-kip-vis-vega?page='},
 {'product_category': 'kaas-vleeswaren-tapas',
  'url': 'https://www.ah.nl/producten/kaas-vleeswaren-tapas?page='},
 {'product_category': 'zuivel-plantaardig-en-eieren',
  'url': 'https://www.ah.nl/producten/zuivel-plantaardig-en-eieren?page='},
 {'product_category': 'bakkerij-en-banket',
  'url': 'https://www.ah.nl/producten/bakkerij-en-banket?page='},
 {'product_category': 'ontbijtgranen-en-beleg',
  'url': 'https://www.ah.nl/producten/ontbijtgranen-en-beleg?page='},
 {'product_category': 'snoep-koek-chips-en-chocolade',
  'url': 'https://www.ah.nl/producten/snoep-koek-chips-en-chocolade?page='},
 {'product_category': 'tussendoortjes',
  

### 3.2 Identify and generate max page number per product category

In [6]:
def generate_max_page_url(product_category_urls):
    
    max_page_urls = []
    
    for product_category in product_category_urls:
        url = product_category["url"] + "0"
        print("Scraping for:", url)
        get_driver = driver.get(url)
        res = driver.page_source.encode("utf-8") 
        soup = BeautifulSoup(res, "html.parser")
        load_max_pages = soup.find(class_="load-more_root__9MiHC").text
        load_max_pages = load_max_pages.split()
        
        results = []
        
        for char in load_max_pages:
            char = char.strip()
            if char.isdigit():
                results.append(char)
            else:
                continue
                
        tot_pages = int(round(int(results[1]) / int(results[0]), 0)) + 1
                
        max_product_category_url = product_category["url"] + str(tot_pages)
        max_page_urls.append({"product_category": product_category['product_category'],
                              "url": max_product_category_url})
            
        sleep(2)
    
    return max_page_urls
        
product_category_max_page_urls = generate_max_page_url(product_category_urls)

product_category_max_page_urls

Scraping for: https://www.ah.nl/producten/aardappel-groente-fruit?page=0
Scraping for: https://www.ah.nl/producten/salades-pizza-maaltijden?page=0
Scraping for: https://www.ah.nl/producten/vlees-kip-vis-vega?page=0
Scraping for: https://www.ah.nl/producten/kaas-vleeswaren-tapas?page=0
Scraping for: https://www.ah.nl/producten/zuivel-plantaardig-en-eieren?page=0
Scraping for: https://www.ah.nl/producten/bakkerij-en-banket?page=0
Scraping for: https://www.ah.nl/producten/ontbijtgranen-en-beleg?page=0
Scraping for: https://www.ah.nl/producten/snoep-koek-chips-en-chocolade?page=0
Scraping for: https://www.ah.nl/producten/tussendoortjes?page=0
Scraping for: https://www.ah.nl/producten/frisdrank-sappen-koffie-thee?page=0
Scraping for: https://www.ah.nl/producten/wijn-en-bubbels?page=0
Scraping for: https://www.ah.nl/producten/bier-en-aperitieven?page=0
Scraping for: https://www.ah.nl/producten/pasta-rijst-en-wereldkeuken?page=0
Scraping for: https://www.ah.nl/producten/soepen-sauzen-kruiden-

[{'product_category': 'aardappel-groente-fruit',
  'url': 'https://www.ah.nl/producten/aardappel-groente-fruit?page=33'},
 {'product_category': 'salades-pizza-maaltijden',
  'url': 'https://www.ah.nl/producten/salades-pizza-maaltijden?page=19'},
 {'product_category': 'vlees-kip-vis-vega',
  'url': 'https://www.ah.nl/producten/vlees-kip-vis-vega?page=28'},
 {'product_category': 'kaas-vleeswaren-tapas',
  'url': 'https://www.ah.nl/producten/kaas-vleeswaren-tapas?page=43'},
 {'product_category': 'zuivel-plantaardig-en-eieren',
  'url': 'https://www.ah.nl/producten/zuivel-plantaardig-en-eieren?page=31'},
 {'product_category': 'bakkerij-en-banket',
  'url': 'https://www.ah.nl/producten/bakkerij-en-banket?page=27'},
 {'product_category': 'ontbijtgranen-en-beleg',
  'url': 'https://www.ah.nl/producten/ontbijtgranen-en-beleg?page=50'},
 {'product_category': 'snoep-koek-chips-en-chocolade',
  'url': 'https://www.ah.nl/producten/snoep-koek-chips-en-chocolade?page=74'},
 {'product_category': 'tus

### 3.3 Generate product urls for each product category

In [53]:
'''
Because of demonstration purposes and time limitations, only the first two product categories
will be scraped to generate all product urls for that particular category!
'''

demo_product_category_urls = product_category_max_page_urls[7:8]          

def extract_product_urls(base_url, product_category_max_page_urls):
    
    product_urls = []
    
    for product_category in product_category_max_page_urls: 
        url = product_category["url"]
        print("Scraping for:", url)
        get_driver = driver.get(url)
        res = driver.page_source.encode("utf-8") 
        soup = BeautifulSoup(res, "html.parser")
        products = soup.find_all(attrs={"data-testhook": "product-card"})

        for product in products:
            try:
                title = product.find_all('div')[0].find("a").attrs["title"]
                url = base_url + product.find_all('div')[0].find("a").attrs["href"].replace("producten/", "")
                theme = product.find_all("button")[2].attrs["theme"]
                in_bonus = ""
                if theme == "bonus":
                    in_bonus = "Yes"
                else:
                    in_bonus = "No"
                    
                product_urls.append({"product_category": product_category["product_category"],
                                     "product_title": title,
                                     "url": url,
                                     "in_bonus": in_bonus})
            except:
                continue 
            
        sleep(5)
            
    return product_urls

product_urls = extract_product_urls(base_url, demo_product_category_urls)

product_urls

Scraping for: https://www.ah.nl/producten/snoep-koek-chips-en-chocolade?page=74


[{'product_category': 'snoep-koek-chips-en-chocolade',
  'product_title': "Lay's Paprika flavour",
  'url': 'https://www.ah.nl/producten/product/wi193679/lay-s-paprika-flavour',
  'in_bonus': 'No'},
 {'product_category': 'snoep-koek-chips-en-chocolade',
  'product_title': "Lay's Naturel",
  'url': 'https://www.ah.nl/producten/product/wi193680/lay-s-naturel',
  'in_bonus': 'No'},
 {'product_category': 'snoep-koek-chips-en-chocolade',
  'product_title': 'AH Roomboter stroopwafels',
  'url': 'https://www.ah.nl/producten/product/wi142548/ah-roomboter-stroopwafels',
  'in_bonus': 'No'},
 {'product_category': 'snoep-koek-chips-en-chocolade',
  'product_title': 'AH Ribbelchips naturel',
  'url': 'https://www.ah.nl/producten/product/wi448475/ah-ribbelchips-naturel',
  'in_bonus': 'No'},
 {'product_category': 'snoep-koek-chips-en-chocolade',
  'product_title': 'AH Chocolate chip cookies',
  'url': 'https://www.ah.nl/producten/product/wi221017/ah-chocolate-chip-cookies',
  'in_bonus': 'No'},
 {'

### 3.4 Iterate through all product urls and extract product information

In [55]:
demo_product_urls = product_urls

def extract_product_info(product_urls):
    
    product_data = []
    
    f = open("albert_heijn_product_data.json", "w", encoding = "utf-8")
    
    for product in product_urls:
        url = product["url"]
        print('Scraping for:', url)
        get_driver = driver.get(url)
        res = driver.page_source.encode("utf-8") 
        soup = BeautifulSoup(res, "html.parser")
                      
        try:
            # Extract product_id from product_url
            product_id = re.findall("\d+", url)[0]
            
            # Retrieve product_price 
            if soup.find_all(attrs={"data-testhook": "product-card"})[0].find(attrs={"data-testhook": "price-amount"}) != None:
                product_price = float(soup.find_all(attrs={"data-testhook": "product-card"})[0].find_all(attrs={"data-testhook": "price-amount"})[0].text)
            else:
                product_price = "NA"
                
            if soup.find_all(attrs={"data-testhook": "product-card"})[0].find(class_="product-card-header_unitInfo__2ncbP") != None:
                product_quantity = soup.find_all(attrs={"data-testhook": "product-card"})[0].find(class_="product-card-header_unitInfo__2ncbP").text#.split("P")[0] # splitting text by character 'P' (eg. 'per stukPrijs per kg') and indexing the first element
                product_quantity = re.split("P|N", product_quantity)[0]
                is_price_in_kg = soup.find_all(attrs={"data-testhook": "product-card"})[0].find(class_="product-card-header_unitInfo__2ncbP").text.split() # splitting text by whitespaces (eg. ['Prijs', 'per', 'KG', ..., '7,92'])
                if len(is_price_in_kg) > 2 and ("KG" in is_price_in_kg and "€" in is_price_in_kg): # checking if product page contains price in kg info
                    product_price_in_kg = float(soup.find_all(attrs={"data-testhook": "product-card"})[0].find(class_="product-card-header_unitInfo__2ncbP").text.split()[-1].replace(",", ".")) # 
                else:
                    product_price_in_kg = 'NA'
            else:
                product_quantity = "NA"
                product_price_in_kg = 'NA'
                
            if soup.find_all(attrs={"data-testhook": "product-card"})[0].find(attrs={"data-testhook": "promo-sticker"}) != None:
                product_bonus_type = soup.find_all(attrs={"data-testhook": "product-card"})[0].find(attrs={"data-testhook": "promo-sticker"}).text.replace("\n", ' ')
                if len(soup.find_all(attrs={"data-testhook": "product-card"})[0].find_all(attrs={"data-testhook": "price-amount"})) > 1:
                    product_bonus_price = soup.find_all(attrs={"data-testhook": "product-card"})[0].find_all(attrs={"data-testhook": "price-amount"})[1].text
                else:
                    product_bonus_price = "NA"
            else:
                product_bonus_type = "NA"
                product_bonus_price = "NA"
                
            if soup.find_all(attrs={"data-testhook": "product-card"})[0].find(class_="nutriscore_root__cYcXV product-card-hero_nutriscore__1g_JA") != None:
                product_nutri_score = soup.find_all(attrs={"data-testhook": "product-card"})[0].find(class_="nutriscore_root__cYcXV product-card-hero_nutriscore__1g_JA").find("title").text.split()[-1]
            else:
                product_nutri_score = "NA"
               
            # Characteristics
            if soup.find_all(attrs={"class": "product-info-content-block"}) != None:
                for content_block in soup.find_all(attrs={"class": "product-info-content-block"}):
                    content_text = content_block.text
                    content_search = re.search('Kenmerken', content_text)
                    if content_search:       
                        characteristics = content_block.find_all(attrs={"class": "product-info-icons_root__1ZWKc"})[0].find_all('p')
                        characteristics_list = []
                        for p in characteristics:
                            characteristics_list.append(p.text)  
                        break
                    else:
                        characteristics_list = "NA"
            else:
                characteristics_list = "NA"
            
            # Allergies
            if soup.find_all(attrs={"class": "product-info-content-block"}) != None:
                for content_block in soup.find_all(attrs={"class": "product-info-content-block"}):
                    content_text = content_block.text
                    content_search = re.search('Allergie-informatie', content_text)
                    if content_search:                      
                        allergies = content_block.find_all('dd')
                        allergies_string = ""
                        for allergy in allergies:
                            allergies_string += (allergy.text + ",").replace(", ", ",")
                        allergies_list = allergies_string.strip(',').split(',')
                        break
                    else:
                        allergies_list = "NA"
            else:
                allergies_list = "NA" 
            
            # Nutrition table
            if soup.find(attrs={"class": "product-info-nutrition_table__1PDio"}) != None:
                nutritions_json = {}
                for nutrition_header in soup.find_all(attrs={"class": "product-info-nutrition_table__1PDio"})[0].find('thead').find_all('tr'):
                    nutritions_json[nutrition_header.find_all('th')[0].text] = nutrition_header.find_all('th')[1].text
                for nutrition_value in soup.find_all(attrs={"class": "product-info-nutrition_table__1PDio"})[0].find('tbody').find_all('tr'):
                    nutritions_json[nutrition_value.find_all('td')[0].text] = nutrition_value.find_all('td')[1].text
            else:
                nutritions_json = "NA"     
            
            if soup.find_all(attrs={'class': 'product-recommendations_root__1LK53'}) != None:
                for content in soup.find_all(attrs={'class': 'product-recommendations_root__1LK53'}):
                    variation_text = content.text
                    variation_search = re.search('soorten', variation_text)
                    if variation_search:
                        product_variation_ids = []
                        for variation in content.find_all(attrs={"data-testhook": "product-card"}):
                            variation_url = variation.find_all('div')[0].find("a").attrs["href"]
                            product_variation_id = re.findall("\d+", variation_url)[0]
                            product_variation_ids.append(product_variation_id)
                        break
                    else:
                        product_variation_ids ="NA"
            else:
                product_variation_ids = "NA"
                
            if soup.find_all(attrs={'class': 'product-recommendations_root__1LK53'}) != None:
                for content in soup.find_all(attrs={'class': 'product-recommendations_root__1LK53'}):
                    recommendation_text = content.text
                    recommendation_search = re.search('kochten', recommendation_text)
                    if recommendation_search:
                        product_recommendation_ids = []
                        for recommendation in content.find_all(attrs={"data-testhook": "product-card"}):
                            recommendation_url = recommendation.find_all('div')[0].find("a").attrs["href"]
                            product_recommendation_id = re.findall("\d+", recommendation_url)[0]
                            product_recommendation_ids.append(product_recommendation_id)
                        break
                    else:
                        product_recommendation_ids ="NA"
            else:
                product_recommendation_ids = "NA"
        
            product_unique_id = {"product_id": product_id}
            
            if product_unique_id not in product_data:
                date_time = datetime.now().replace(microsecond = 0)
                date_time = date_time.strftime("%d/%m/%Y")
                product_data_json = {"date_time": date_time,
                                     "product_id": product_id,
                                     "product_title": product["product_title"],
                                     "product_category": product["product_category"],
                                     "in_bonus": product["in_bonus"],
                                     "price": product_price,
                                     "quantity": product_quantity,
                                     "price_in_kg": product_price_in_kg,
                                     "bonus_type": product_bonus_type,
                                     "bonus_price": product_bonus_price,
                                     "nutri_score": product_nutri_score,
                                     "characteristics": characteristics_list,
                                     "allergies": allergies_list,
                                     "nutritions": nutritions_json,
                                     "product_variation_ids": product_variation_ids,
                                     "product_recommmend_ids": product_recommendation_ids}
                
                product_data.append(product_data_json)
                f.write(json.dumps(product_data_json) + "\n")
        
        except:
            continue
        
        sleep(1)
    
    f.close()
    
    return product_data
        
product_data = extract_product_info(demo_product_urls)

product_data

Scraping for: https://www.ah.nl/producten/product/wi193679/lay-s-paprika-flavour
Scraping for: https://www.ah.nl/producten/product/wi193680/lay-s-naturel
Scraping for: https://www.ah.nl/producten/product/wi142548/ah-roomboter-stroopwafels
Scraping for: https://www.ah.nl/producten/product/wi448475/ah-ribbelchips-naturel
Scraping for: https://www.ah.nl/producten/product/wi221017/ah-chocolate-chip-cookies
Scraping for: https://www.ah.nl/producten/product/wi3640/lu-bastogne-orginal
Scraping for: https://www.ah.nl/producten/product/wi511002/ah-kristalsuiker
Scraping for: https://www.ah.nl/producten/product/wi2432/lu-mini-crackers-olijfolie-en-oregano
Scraping for: https://www.ah.nl/producten/product/wi457644/lay-s-oven-baked-roasted-paprika-flavour
Scraping for: https://www.ah.nl/producten/product/wi458839/ah-cashewnoten-ongezouten
Scraping for: https://www.ah.nl/producten/product/wi387615/cheetos-nibb-it-sticks
Scraping for: https://www.ah.nl/producten/product/wi229807/lay-s-bugles-nacho-c

Scraping for: https://www.ah.nl/producten/product/wi168710/verkade-maria
Scraping for: https://www.ah.nl/producten/product/wi523311/ah-zoute-knabbel-pakket
Scraping for: https://www.ah.nl/producten/product/wi447956/studenten-haver-ongezouten
Scraping for: https://www.ah.nl/producten/product/wi165965/oreo-original
Scraping for: https://www.ah.nl/producten/product/wi517313/ah-confettini-s
Scraping for: https://www.ah.nl/producten/product/wi517477/ah-roomboter-gevulde-koeken
Scraping for: https://www.ah.nl/producten/product/wi127297/knoppers-melk-hazelnootwafel
Scraping for: https://www.ah.nl/producten/product/wi517490/ah-roomboter-zandkoekjes
Scraping for: https://www.ah.nl/producten/product/wi397898/ah-ongebrande-cashewnoten
Scraping for: https://www.ah.nl/producten/product/wi3706/ah-suikerklontjes
Scraping for: https://www.ah.nl/producten/product/wi517312/ah-digestive-0-suiker
Scraping for: https://www.ah.nl/producten/product/wi477621/duyvis-borrelnootjes-provencale
Scraping for: https

Scraping for: https://www.ah.nl/producten/product/wi517550/jules-destrooper-natuurboterwafels
Scraping for: https://www.ah.nl/producten/product/wi508438/ah-kaneelbroodjes
Scraping for: https://www.ah.nl/producten/product/wi517426/ah-tiffins-gezouten-karamel
Scraping for: https://www.ah.nl/producten/product/wi518815/body-en-fit-perfection-bar-crunchy-chocolate-cookie
Scraping for: https://www.ah.nl/producten/product/wi450473/peijnenburg-zero-gesneden
Scraping for: https://www.ah.nl/producten/product/wi168713/verkade-san-francisco-naturel
Scraping for: https://www.ah.nl/producten/product/wi196329/ah-fourre-vanillesmaak
Scraping for: https://www.ah.nl/producten/product/wi169635/ah-bokkenpootjes
Scraping for: https://www.ah.nl/producten/product/wi413153/lay-s-mixups-paprika
Scraping for: https://www.ah.nl/producten/product/wi474778/venco-schoolkrijt
Scraping for: https://www.ah.nl/producten/product/wi177119/smint-peppermint-xl-sugarfree-2-pack
Scraping for: https://www.ah.nl/producten/prod

Scraping for: https://www.ah.nl/producten/product/wi229282/peijnenburg-ontbijtkoek-parelkandij-ongesneden
Scraping for: https://www.ah.nl/producten/product/wi211349/ah-bitterkoekjes
Scraping for: https://www.ah.nl/producten/product/wi438890/brinky-choco-fourre-volkoren
Scraping for: https://www.ah.nl/producten/product/wi447643/ah-luchtige-pinda-flips
Scraping for: https://www.ah.nl/producten/product/wi384703/lu-prince-ministars-black-en-white
Scraping for: https://www.ah.nl/producten/product/wi508489/ah-mini-crackers-olijfolie-en-oregano
Scraping for: https://www.ah.nl/producten/product/wi60106/ah-tortilla-chips-nacho-cheese-flavour
Scraping for: https://www.ah.nl/producten/product/wi457694/bolletje-goed-bezig-stevige-havermoutreep-naturel
Scraping for: https://www.ah.nl/producten/product/wi1016/delicata-reep-melk
Scraping for: https://www.ah.nl/producten/product/wi451273/haribo-bananas
Scraping for: https://www.ah.nl/producten/product/wi457646/lay-s-oven-baked-mediterranean-herbs
Scra

Scraping for: https://www.ah.nl/producten/product/wi30182/ah-haverkoekjes
Scraping for: https://www.ah.nl/producten/product/wi188031/chio-kaseln-emmentaler
Scraping for: https://www.ah.nl/producten/product/wi140736/ah-walnoten-ongezouten
Scraping for: https://www.ah.nl/producten/product/wi375962/ah-amandelen-met-rooksmaak
Scraping for: https://www.ah.nl/producten/product/wi402618/ritter-sport-melk-hazelnoot
Scraping for: https://www.ah.nl/producten/product/wi211975/milka-reep-melk-choco-biscuit
Scraping for: https://www.ah.nl/producten/product/wi142553/ah-roomboter-boterkoekreepjes
Scraping for: https://www.ah.nl/producten/product/wi517004/cheetos-nibbit-sticks-party-pack
Scraping for: https://www.ah.nl/producten/product/wi129564/ah-galette-wafels
Scraping for: https://www.ah.nl/producten/product/wi436457/lay-s-bugles-mixups-cheese-flavour
Scraping for: https://www.ah.nl/producten/product/wi437574/santa-maria-nacho-tortilla-chips
Scraping for: https://www.ah.nl/producten/product/wi1687

Scraping for: https://www.ah.nl/producten/product/wi52896/dr-oetker-backin-bakpoeder
Scraping for: https://www.ah.nl/producten/product/wi407177/ah-mueslireep-naturel
Scraping for: https://www.ah.nl/producten/product/wi516959/lay-s-max-double-crunch-red-sweet-chilli
Scraping for: https://www.ah.nl/producten/product/wi470172/ah-biologisch-zonnebloempitten
Scraping for: https://www.ah.nl/producten/product/wi520072/lay-s-naturel-zout-chips-borrel-duo-pack
Scraping for: https://www.ah.nl/producten/product/wi54149/ah-vanillesuiker
Scraping for: https://www.ah.nl/producten/product/wi142550/ah-choc-chip-cookies-triple-chocolate
Scraping for: https://www.ah.nl/producten/product/wi168249/ah-roze-koeken
Scraping for: https://www.ah.nl/producten/product/wi480706/red-band-winegummix
Scraping for: https://www.ah.nl/producten/product/wi492999/milka-cookie-sensations-oreo
Scraping for: https://www.ah.nl/producten/product/wi214065/chio-popcorn-salt
Scraping for: https://www.ah.nl/producten/product/wi52

Scraping for: https://www.ah.nl/producten/product/wi460468/ah-robuuste-chips-zeezout
Scraping for: https://www.ah.nl/producten/product/wi191264/twix-chocolade-reep-5-pack
Scraping for: https://www.ah.nl/producten/product/wi202155/ah-ontbijtkoekreep-naturel
Scraping for: https://www.ah.nl/producten/product/wi493562/m-en-m-s-salted-caramel
Scraping for: https://www.ah.nl/producten/product/wi88649/celebrations-assortiments-mix
Scraping for: https://www.ah.nl/producten/product/wi228598/dr-oetker-vanille-aroma
Scraping for: https://www.ah.nl/producten/product/wi374114/verkade-cafe-noir-bitesize
Scraping for: https://www.ah.nl/producten/product/wi161499/ah-amandelschaafsel
Scraping for: https://www.ah.nl/producten/product/wi194556/ah-ontbijtkoekreep-met-minder-suiker
Scraping for: https://www.ah.nl/producten/product/wi3200/merci-finest-selection
Scraping for: https://www.ah.nl/producten/product/wi407156/trek-multipack-cocoa-oat
Scraping for: https://www.ah.nl/producten/product/wi477606/lay-s

Scraping for: https://www.ah.nl/producten/product/wi375968/haust-snack-cups-ovaal
Scraping for: https://www.ah.nl/producten/product/wi480720/venco-droptoppers-krakend-en-zacht
Scraping for: https://www.ah.nl/producten/product/wi129447/smint-mints-sugarfree-2-pack
Scraping for: https://www.ah.nl/producten/product/wi376167/bonne-maman-la-madeleine
Scraping for: https://www.ah.nl/producten/product/wi55232/cote-d-or-l-original-reep-puur
Scraping for: https://www.ah.nl/producten/product/wi167628/ah-duo-wafels
Scraping for: https://www.ah.nl/producten/product/wi196103/lay-s-paprika-flavour
Scraping for: https://www.ah.nl/producten/product/wi468146/ah-sumo-mix-3v5
Scraping for: https://www.ah.nl/producten/product/wi486794/klene-zoete-mix-drop-suikervrij
Scraping for: https://www.ah.nl/producten/product/wi159495/cote-d-or-bonbonbloc-chocolade-reep-praline-puur
Scraping for: https://www.ah.nl/producten/product/wi1390/cote-d-or-chokotoff
Scraping for: https://www.ah.nl/producten/product/wi1426/b

Scraping for: https://www.ah.nl/producten/product/wi515672/nestle-mini-s
Scraping for: https://www.ah.nl/producten/product/wi395080/verkade-glutenvrije-oaties
Scraping for: https://www.ah.nl/producten/product/wi420734/molensteen-amandelmeel
Scraping for: https://www.ah.nl/producten/product/wi450269/nakd-fruitreep-met-noten-salted-caramel
Scraping for: https://www.ah.nl/producten/product/wi962/ah-pepermunt
Scraping for: https://www.ah.nl/producten/product/wi139742/duyvis-oven-roasted-pinda-s-original
Scraping for: https://www.ah.nl/producten/product/wi457125/chio-xxl-flippies-paprika
Scraping for: https://www.ah.nl/producten/product/wi505206/bolletje-noten-en-granen-amandel-havermout
Scraping for: https://www.ah.nl/producten/product/wi520961/molino-rossetto-fijne-tarwebloem
Scraping for: https://www.ah.nl/producten/product/wi1019/delicata-reep-melk-hazelnoot
Scraping for: https://www.ah.nl/producten/product/wi470029/ah-biologisch-appelchips
Scraping for: https://www.ah.nl/producten/prod

Scraping for: https://www.ah.nl/producten/product/wi196499/punselie-s-stroopkoekjes-classic
Scraping for: https://www.ah.nl/producten/product/wi196541/maoam-pinballs
Scraping for: https://www.ah.nl/producten/product/wi407150/liga-belvita-koeken-soft-bakes-stukjes
Scraping for: https://www.ah.nl/producten/product/wi460470/ah-notenmix-cranberry-rozijn-ongebrand
Scraping for: https://www.ah.nl/producten/product/wi196446/haribo-aardbeienschuim
Scraping for: https://www.ah.nl/producten/product/wi439265/nakd-fruitreep-met-noten-carrot-cake
Scraping for: https://www.ah.nl/producten/product/wi456293/ah-parelkandijkoek
Scraping for: https://www.ah.nl/producten/product/wi502476/sportlife-smashmint-gum-sugarfree
Scraping for: https://www.ah.nl/producten/product/wi516579/sultana-fruitbiscuits-framboos
Scraping for: https://www.ah.nl/producten/product/wi516969/lay-s-max-salt-en-black-pepper-flavour
Scraping for: https://www.ah.nl/producten/product/wi517329/ah-multimix-bbq-stijl
Scraping for: https:

Scraping for: https://www.ah.nl/producten/product/wi387565/croky-ribble-rock-paprika
Scraping for: https://www.ah.nl/producten/product/wi445926/ritter-sport-nut-selection-honing-zout-amandel
Scraping for: https://www.ah.nl/producten/product/wi517267/ah-cookie-bites-koektoefjes-met-citroen
Scraping for: https://www.ah.nl/producten/product/wi117811/ah-walnoten
Scraping for: https://www.ah.nl/producten/product/wi370272/van-gilse-originele-suikerklontjes
Scraping for: https://www.ah.nl/producten/product/wi370290/fairtrade-original-ruwe-rietsuiker-sticks
Scraping for: https://www.ah.nl/producten/product/wi502037/ah-gemengde-drop-zoet-en-zout
Scraping for: https://www.ah.nl/producten/product/wi517494/bonne-maman-la-madeleine-chocolat-au-lait
Scraping for: https://www.ah.nl/producten/product/wi132698/lu-prince-cake-en-choc
Scraping for: https://www.ah.nl/producten/product/wi237927/koetjesreep-chocolade-8-pack
Scraping for: https://www.ah.nl/producten/product/wi374843/ah-schuimbanaantjes
Scrap

[{'date_time': '18/06/2022',
  'product_id': '447566',
  'product_title': 'AH Provencale borrelnoten',
  'product_category': 'snoep-koek-chips-en-chocolade',
  'in_bonus': 'No',
  'price': 1.08,
  'quantity': '250 g',
  'price_in_kg': 4.32,
  'bonus_type': 'NA',
  'bonus_price': 'NA',
  'nutri_score': 'D',
  'characteristics': ['Veganistisch', 'Lactosevrij'],
  'allergies': ["Pinda'S",
   'Glutenbevattende Granen',
   'Tarwe',
   'Noten',
   'Amandel',
   'Cashewnoot',
   'Hazelnoot',
   'Macadamianoot',
   'Pecannoot',
   'Paranoot',
   'Pistache-Noot',
   'Walnoot'],
  'nutritions': {'Soort': 'Per 100 Gram',
   'Energie': '2134 kJ (511 kcal)',
   'Vet': '30 g',
   'waarvan verzadigd': '4.1 g',
   'waarvan onverzadigd': '26 g',
   'Koolhydraten': '44 g',
   'waarvan suikers': '9.1 g',
   'Voedingsvezel': '4.7 g',
   'Eiwitten': '14 g',
   'Zout': '1.5 g'},
  'product_variation_ids': ['447572', '467565', '447570', '448484', '448485'],
  'product_recommmend_ids': 'NA'},
 {'date_time': '

## 4. Final Dataset

### 4.1 Create dataframe, write to csv and access data

The Albert Heijn dataset can be created by running this scrape program once every week. The assumption is that every week new products will be sold at a discount ('Bonus'). Albert Heijn update its website according to the refreshness of the weekly 'bonus' products. Therefore, running this program will provide a dataset containing the whole product assortment and the 'bonus' products in that particular week.

In [67]:
product_data_df = pd.DataFrame(product_data)

def check_header(file_name):

    try:
        with open(file_name) as csv_file:
            lines = csv_file.readlines()
            is_header = not any(char.isdigit() for char in lines[0])
        return is_header
    except:
        False
    
def write_csv(dataframe, csv_file):

    if check_header(csv_file):
        product_data_df.to_csv(csv_file, mode="a", index=False, header=False)
    else:
        product_data_df.to_csv(csv_file, mode="w", index=False, header=True)      

write_csv(product_data_df, "albert_heijn_product_data.csv")  

product_data_df.head()

Unnamed: 0,date_time,product_id,product_title,product_category,in_bonus,price,quantity,price_in_kg,bonus_type,bonus_price,nutri_score,characteristics,allergies,nutritions,product_variation_ids,product_recommmend_ids
0,18/06/2022,447566,AH Provencale borrelnoten,snoep-koek-chips-en-chocolade,No,1.08,250 g,4.32,,,D,"[Veganistisch, Lactosevrij]","[Pinda'S, Glutenbevattende Granen, Tarwe, Note...","{'Soort': 'Per 100 Gram', 'Energie': '2134 kJ ...","[447572, 467565, 447570, 448484, 448485]",
1,18/06/2022,187449,AH Popcorn Caramel,snoep-koek-chips-en-chocolade,No,1.09,150 g,7.27,,,D,[Glutenvrij],"[Melk, Lactose]","{'Soort': 'Per 100 Gram', 'Energie': '1836 kJ ...","[447572, 467565, 447570, 448484, 448485]",
2,18/06/2022,467585,AH Dry roast pinda's gezouten,snoep-koek-chips-en-chocolade,No,0.69,150 g,4.6,,,C,"[Veganistisch, Glutenvrij, Lactosevrij]","[Pinda'S, Noten, Amandel, Cashewnoot, Hazelnoo...","{'Soort': 'Per 100 Gram', 'Energie': '2469 kJ ...","[447572, 467565, 447570, 448484, 448485]",
3,18/06/2022,35001,LU Tuc bacon,snoep-koek-chips-en-chocolade,Yes,0.99,100 g,9.9,VOOR 0.75,0.75,,[Groene Punt],"[Gerst, Glutenbevattende Granen, Eieren, Melk,...","{'Soort': 'Per 100 Gram', 'Energie': '2007 kJ ...","[447572, 467565, 447570, 448484, 448485]",
4,18/06/2022,62056,Lay's Light Paprika,snoep-koek-chips-en-chocolade,No,1.45,170 g,8.53,,,C,"[Groene Punt, Recyclebaar]","[Melk, Glutenbevattende Granen, Tarwe]","{'Soort': 'Per 100 Gram', 'Energie': '2010 kJ ...","[447572, 467565, 447570, 448484, 448485]",
