<a href="https://colab.research.google.com/github/BartoszMietlicki/aniagotuje-recipes-scraper/blob/main/AniaGotuje_Recipes_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
## **0 - Introduction**
---


This notebook presents the complete process of collecting and analyzing recipe data from **AniaGotuje.pl**. The project is divided into two main sections:

1. **Fetching Recipe Links**  
   - Automatically retrieve the list of all culinary categories.  
   - Allow the user to interactively select one or more categories.  
   - Detect the number of pages in each selected category.  
   - Gather all URLs pointing to individual recipes.

2. **Fetching and Processing Recipe Content**  
   - Define an HTML filter (SoupStrainer) to parse only the relevant parts of each page.  
   - Implement a function that, for each URL, extracts:  
     - the recipe title and intro text  
     - the list of ingredients  
     - the step‑by‑step description  
     - preparation times and number of portions  
     - nutritional values and diet type  
   - Download all recipes in parallel, with retry logic for HTTP 429 rate limits.  
   - Combine the results into a single DataFrame and report any failures.

<pre>
┌─────────────────────┐
│    AniaGotuje.pl    │
└─────────────────────┘
│
├── 2 - Fetching Recipe Links
│   ├── 2.1 - Function to Retrieve Category List
│   ├── 2.2 - Selecting Category to Fetch
│   ├── 2.3 - Function to Fetch Links for Selected Categories
│   ├── 2.4 - Detecting Number of Pages in a Category
│   │   ├── 2.4.1 - Function to Check for Recipe Presence
│   │   ├── 2.4.2 - Function to Count Pages in a Category
│   │   └── 2.4.3 - Function to Parallel Count Pages Across Categories
│   ├── 2.5 - Function to Fetch All Recipe Links from Page
│   └── 2.6 - Fetching Recipe Links
│
┌─────────────────────┐
│     Recipe Links    │
└─────────────────────┘
│
└── 3 - Fetching Recipe Content
│   ├── 3.1 - Recipe Data Fetching Function
│   ├── 3.2 - Parallel Recipe Download Function
│   └── 3.3 - Executing Fetch and Displaying Results
│
┌─────────────────────┐
│      Recipes        │
└─────────────────────┘
│
└─────────────────────────────┐
│    Recipes Data Frame       │
└─────────────────────────────┘
</pre>

---
## **1 - Environment Setup**
---


In [None]:
# ── Library imports ───────────────────────────────────────────────
import time
from collections import defaultdict
from urllib.parse import urljoin
from concurrent.futures import (
    ThreadPoolExecutor,
    ProcessPoolExecutor,
    as_completed,
    wait,
    FIRST_COMPLETED
)
from requests.exceptions import HTTPError
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
from tqdm import tqdm

# ── Display settings ───────────────────────────────────────────────
pd.set_option('display.max_colwidth', 50)

# ── HTTP session configuration ─────────────────────────────────────
session = requests.Session()
session.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/114.0.0.0 Safari/537.36"
    ),
    "Accept-Encoding": "gzip, deflate"
})

retry_strategy = Retry(
    total=5,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET"],
    raise_on_status=False
)

adapter = HTTPAdapter(
    max_retries=retry_strategy,
    pool_connections=20,
    pool_maxsize=20
)

session.mount("https://", adapter)
session.mount("http://", adapter)

BASE_URL = "https://aniagotuje.pl"

---
## **2 - Fetching Recipe Links**
---


### 2.1 - Function to Retrieve Category List

In [None]:
def get_recipe_categories():
    """
    Retrieve recipe category names and URLs from the site navigation.
    """
    html = session.get(BASE_URL, timeout=5).text
    soup = BeautifulSoup(html, "lxml")
    nav = soup.select_one("nav.cat-menu")
    if not nav:
        return []
    categories = []
    for link in nav.select("div.cat-group-list a[href^='/przepisy/']"):
        title = link.get_text(strip=True)
        url = urljoin(BASE_URL, link["href"])
        categories.append((title, url))
    return categories


### 2.2 - Selecting Category to Fetch

In [None]:
# recipe_categories = get_recipe_categories()
# print("0) all categories")
# for i, (name, _) in enumerate(recipe_categories, start=1):
#     print(f"{i}) {name}")
# choice = int(input(f"choose category (0–{len(recipe_categories)}): "))

# if choice == 0:
#     chosen_categories = recipe_categories
#     print("→ all categories")
# else:
#     chosen_categories = [recipe_categories[choice-1]]
#     print(f"→ {chosen_categories[0][0]}")


recipe_categories = get_recipe_categories()
chosen_categories = recipe_categories

### 2.3 - Function to Fetch Links for Selected Categories


In [None]:
def fetch_recipe_links_from_page(category_url, page_number=1):
    """
    Return a dict of {title: full_link} for recipes on the specified category page.
    """
    # construct page URL
    if page_number <= 1:
        page_url = category_url
    else:
        page_url = f"{category_url.rstrip('/')}/strona/{page_number}"

    # retrieve and parse HTML
    response = session.get(page_url, timeout=5)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    # locate recipe container
    container = soup.select_one("div.res-col-con")
    if not container:
        return {}

    # extract recipe titles and links
    links = {}
    for article in container.select("article.res-col"):
        anchor = article.select_one("div.col-title a[href^='/przepis/']")
        if not anchor:
            continue
        title = anchor.get_text(strip=True)
        href = urljoin(BASE_URL, anchor["href"])
        links[title] = href

    return links


### 2.4 - Detecting Number of Pages in a Category


##### 2.4.1 - Function to Check for Recipe Presence


In [None]:
def page_has_any_recipe(category_url, page_number=1):
    """
    Load the page and return True if at least one recipe element is present.
    """
    # build page URL
    if page_number <= 1:
        page_url = category_url
    else:
        page_url = f"{category_url.rstrip('/')}/strona/{page_number}"

    # fetch and parse
    response = session.get(page_url, timeout=5)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    # return presence flag
    return bool(soup.select_one("div.res-col-con article.res-col"))


##### 2.4.2 - Function to Count Pages in a Category


In [None]:
def detect_total_pages(category_url, max_page=32):
    """
    Determine the last available page using O(log N) calls to page_has_any_recipe using exponential-binary page count detection.
    """
    low, high = 1, 2
    # expand high exponentially while pages exist
    while high <= max_page and page_has_any_recipe(category_url, high):
        low, high = high, high * 2
    high = min(high, max_page + 1)

    # refine boundary with binary search
    while low + 1 < high:
        mid = (low + high) // 2
        if page_has_any_recipe(category_url, mid):
            low = mid
        else:
            high = mid
    return low


##### 2.4.3 - Function to Parallel Count Pages Across Categories


In [None]:
def detect_pages_for_all(categories, max_workers=8, max_page=30):
    """
    Compute page counts for all selected categories in parallel.
    Returns a dict mapping category name to page count.
    """
    def _detect(cat):
        name, url = cat
        count = detect_total_pages(url, max_page)
        return name, count

    results = {}
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(_detect, cat): cat for cat in categories}
        for fut in as_completed(futures):
            name, count = fut.result()
            results[name] = count

    return results

### 2.5 - Function to Fetch All Recipe Links from Page


In [None]:
def fetch_all_recipes_for_category(categories, pages_map, max_workers=10):
    """
    Fetch links for all recipe pages in each category concurrently.
    Returns a dict mapping category name to {recipe_title: recipe_url}.
    """
    def _worker(name, url, page):
        return name, fetch_recipe_links_from_page(url, page)

    all_links = {name: {} for name, _ in categories}
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # schedule fetch tasks for every page in each category
        futures = [
            executor.submit(_worker, name, url, page)
            for name, url in categories
            for page in range(1, pages_map[name] + 1)
        ]
        # merge results as they complete
        for fut in as_completed(futures):
            name, links = fut.result()
            all_links[name].update(links)

    return all_links

### 2.6 - Fetching Recipe Links


In [None]:
pages_per_category = detect_pages_for_all(
    chosen_categories,
    max_workers=10,
    max_page=30
)

recipes_links = fetch_all_recipes_for_category(
    chosen_categories,
    pages_per_category,
    max_workers=10
)

flat_links = [
    (title, category, url)
    for category, items in recipes_links.items()
    for title, url in items.items()
]

print(f"{len(flat_links)} recipes found")

2070 recipes found


---
## **3 - Fetching Recipe Content**
---


### 3.1 - Recipe Data Fetching Function


In [None]:
def _norm_space(s: str | None) -> str | None:
    """Collapse whitespace; return None if empty."""
    if not s:
        return None
    t = re.sub(r"\s+", " ", s).strip()
    return t if t else None

def fetch_recipe_data(name: str, category: str, link: str) -> dict[str, str | None]:
    """
    Retrieve recipe details: intro, ingredients, description, times, portions, nutrition, diet.
    Returns a dict mapping field names to values (or None if not found).
    """
    # 1) HTTP
    resp = session.get(link, timeout=10)
    resp.raise_for_status()

    # 2) FULL parse
    soup = BeautifulSoup(resp.text, "lxml")

    # 3) Intro
    intro_div = soup.select_one("div.article-intro")
    intro = _norm_space(intro_div.get_text(" ", strip=True)) if intro_div else None

    # 4) Ingredients
    ing_nodes = soup.select("ul.recipe-ing-list li span.ingredient")
    if not ing_nodes:
        ing_nodes = soup.select("ul.recipe-ing-list li span")
    seen = set()
    ilist = [
        txt for txt in (n.get_text(" ", strip=True) for n in ing_nodes)
        if (txt and (txt not in seen) and (not seen.add(txt)))
    ]
    ingredients = "; ".join(ilist) if ilist else None

    # 5) Description
    step_ps = soup.select("div.steps div.step div.step-text p")
    step_texts = [_norm_space(p.get_text(" ", strip=True)) for p in step_ps]
    step_texts = [t for t in step_texts if t]
    if step_texts:
        desc = " ".join(step_texts)
    else:
        content_div = soup.select_one("div.article-content")
        desc = _norm_space(content_div.get_text(" ", strip=True)) if content_div else None

    # 6) Times / portions / nutrition / diet
    prep_time = other_time = num_portions = nutr_unit = nutr_val = diet = None
    info_p = soup.select_one("p.recipe-info")
    if info_p:
        lines = [ln.strip() for ln in info_p.get_text("\n", strip=True).split("\n") if ln.strip()]
        parsed = {}
        i = 0
        while i < len(lines):
            line = lines[i]
            if line.endswith(":"):
                label = line[:-1]
                if i+1 < len(lines) and lines[i+1] == "Wartość energetyczna":
                    nutr_unit = label
                    nutr_val  = lines[i+2] if i+2 < len(lines) else None
                    i += 3
                else:
                    parsed[label] = lines[i+1] if i+1 < len(lines) else None
                    i += 2
            else:
                i += 1

        prep_time    = parsed.get("Czas przygotowania")
        num_portions = parsed.get("Liczba porcji")
        diet         = parsed.get("Dieta")
        others = [(k, v) for k, v in parsed.items() if k.startswith("Czas ") and k != "Czas przygotowania"]
        other_time = "; ".join([f"{k}: {v}" for k, v in others]) if others else None

    return {
        "name":         name,
        "category":     category,
        "link":         link,
        "intro":        intro,
        "ingredients":  ingredients,
        "description":  desc,
        "prep_time":    prep_time,
        "other_time":   other_time,
        "num_portions": num_portions,
        "nutr_unit":    nutr_unit,
        "nutr_val":     nutr_val,
        "diet":         diet
    }


### 3.2 - Parallel Recipe Download Function


In [None]:
def fetch_all_recipes_parallel(
    flat_links,
    max_workers=3,
    max_retries=3,
    show_progress=False
):
    """
    Fetch all recipes in parallel, retrying HTTP 429 up to max_retries.
    Returns (df_full, failures).
    If show_progress is True, a tqdm bar is displayed.
    """
    records = []
    failures = []
    retry_counts = defaultdict(int)

    bar = tqdm(total=len(flat_links)) if show_progress else None

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_meta = {
            executor.submit(fetch_recipe_data, name, cat, url): (name, cat, url)
            for name, cat, url in flat_links
        }

        while future_to_meta:
            done, _ = wait(future_to_meta, return_when=FIRST_COMPLETED)
            for fut in done:
                name, cat, url = future_to_meta.pop(fut)
                complete = False
                try:
                    rec = fut.result()
                    records.append(rec)
                    complete = True
                except HTTPError as e:
                    status = getattr(e.response, "status_code", None)
                    if status == 429 and retry_counts[(name, url)] < max_retries:
                        retry_counts[(name, url)] += 1
                        backoff = 2 ** (retry_counts[(name, url)] - 1)
                        time.sleep(backoff)
                        new_fut = executor.submit(fetch_recipe_data, name, cat, url)
                        future_to_meta[new_fut] = (name, cat, url)
                    else:
                        failures.append((name, cat, url, str(e)))
                        complete = True
                except Exception as e:
                    failures.append((name, cat, url, str(e)))
                    complete = True

                if complete and bar:
                    bar.update(1)

    if bar:
        bar.close()

    df = pd.DataFrame.from_records(records)
    cols = [
        "name", "category", "link",
        "intro", "ingredients", "description",
        "prep_time", "other_time", "num_portions",
        "nutr_unit", "nutr_val", "diet"
    ]

    if df.empty:
        df_full = pd.DataFrame(columns=cols)
    else:
        missing = [c for c in cols if c not in df.columns]
        for c in missing:
            df[c] = None
        df_full = df[cols]

    return df_full, failures

### 3.3 - Executing Fetch and Displaying Results


In [None]:
# set show_progress=True to display a tqdm bar
df_full, failures = fetch_all_recipes_parallel(
    flat_links,
    max_workers=3,
    max_retries=3,
    show_progress=True
)

if failures:
    print(f"\n❌ {len(failures)} recipes failed after retries:")
    for name, category, url, error in failures:
        print(f"   • {name!r} ({url}): {error}")

100%|██████████| 2070/2070 [04:17<00:00,  8.04it/s]


---
## **4 - Summary**
---


### 4.1 - Summary DataFrame


In [None]:
display(df_full)

Unnamed: 0,name,category,link,intro,ingredients,description,prep_time,other_time,num_portions,nutr_unit,nutr_val,diet
0,Ciasta z truskawkami,ciasta i torty,https://aniagotuje.pl/przepis/ciasta-z-truskaw...,"Ciasta z truskawkami to przepyszne słodkości, ...",,Ciasta z truskawkami Ciasta z truskawkami to p...,,,,,,
1,Szybkie ciasta,ciasta i torty,https://aniagotuje.pl/przepis/szybkie-ciasta,Szybkie ciasta to słodkości na każdą okazję. Z...,,Szybkie ciasta Szybkie ciasta to słodkości na ...,,,,,,
2,Ciasto z owocami,ciasta i torty,https://aniagotuje.pl/przepis/ciasto-z-owocami,To ciasto z owocami jest przepyszne! Bardzo wi...,360 g mąki pszennej tortowej - nieco ponad 2 s...,Ciasto z owocami To ciasto z owocami jest prze...,25 minut,Czas pieczenia: 45 minut,forma 20/30 cm,W 100 g ciasta,272 kcal,wegetariańska
3,Placek z rabarbarem,ciasta i torty,https://aniagotuje.pl/przepis/placek-z-rabarbarem,Doskonały i niezwykle prosty w przygotowaniu p...,2 szklanki mąki pszennej tortowej - 320 g; 700...,"Dno formy wyłóż papierem do pieczenia, a odrob...",20 minut,Czas chłodzenia ciasta: 40 minut,forma 24 x 24 cm - 1250 g ciasta,W 100 g ciasta,280 kcal,wegetariańska
4,Ciasto bez jajek,ciasta i torty,https://aniagotuje.pl/przepis/ciasto-bez-jajek,"Rewelacyjne i niezwykle pyszne, czekoladowe ci...","240 g mąki pszennej tortowej - około 1,5 szkla...",Ciasto bez jajek Rewelacyjne i niezwykle pyszn...,15 minut,Czas pieczenia: 45 minut,tortownica 22 cm - 940 g,W 100 g ciasta,199 kcal,"wegańska, wegetariańska"
...,...,...,...,...,...,...,...,...,...,...,...,...
2065,Cymes 100% pozytywnego smaku z natury,porady,https://aniagotuje.pl/przepis/soki-cymes-smak-...,,,Cymes 100% pozytywnego smaku z natury 100% poz...,,,,,,
2066,Jajko sadzone,porady,https://aniagotuje.pl/przepis/jak-zrobic-jajko...,Poznaj mój sposób na idealne jajko sadzone . J...,2 średnie jajka; 2 łyżeczki masła klarowanego;...,Jajko sadzone Poznaj mój sposób na idealne jaj...,2 minuty,Czas smażenia: 5 minut,2,W 1 jajku,127 kcal,"bezglutenowa, wegetariańska"
2067,Lukier,porady,https://aniagotuje.pl/przepis/jak-zrobic-lukie...,Domowy lukier to podstawa przy wypieku babek c...,3 łyżki soku z cytryny lub wody - 18 g; 10 łyż...,Lukier Domowy lukier to podstawa przy wypieku ...,5 minut,,138 gramów lukru,W 100 gramach,340 kcal,"bezglutenowa, wegańska, wegetariańska"
2068,Ciasto francuskie,porady,https://aniagotuje.pl/przepis/co-zrobic-z-cias...,Ciasto francuskie jest świetną bazą pod wiele ...,1 płat ciasta francuskiego - 375 g (wymiar 25...,Ciasto francuskie Ciasto francuskie jest świet...,20 minut,Czas pieczenia: 20 minut,ciasto 25 x 25 cm,W 100 g ciasta,375 kcal,wegetariańska


### 4.2 - Summary

**Key results**  
- Collected around **2 000** recipes (data snapshot: April 2025)  
- Full pipeline completes in ~4 minutes  
- Encountered on average 1–2 HTTP 429 rate‑limit errors per run, all of which are transparently retried  

**Technical highlights**  
- Automatic retry logic ensures no recipes are lost due to transient 429 errors  
- Request cadence is tuned to respect AniaGotuje.pl’s limits; scaling up would require proxies, which we deliberately avoided out of respect for the site’s resources  

**Final outcome**  
A fully automated, end‑to‑end scraper that discovers categories, gathers recipe URLs, parses all relevant details and returns a clean pandas.DataFrame of culinary data.  
