This notebook focuses on collecting and preparing data from the Raverly API.
he main goals are to:
- fetch publicly available knitting pattern data using the Ravelry API,
- build an initial raw dataset containing pattern metadata,
- clean and structure the data for further analysis,
- prepare the dataset for exploratory data analysis (EDA) and later modelling.

This is the first step of the project. I will concentrate on data collection and basic data preparation for further analysis.

1. Import libraries for data handling, API calls, and quick visual checks

In [14]:
# BASIC DATA HANDLING
import pandas as pd
import numpy as np
import os
import json
import time

# API & REQUESTS
import requests
from requests.auth import HTTPBasicAuth

# VISUALISATION (later use)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Settings for nicer plots
sns.set(style="whitegrid")



2. Define and create folders for raw and processed data.

In [15]:
RAW_DIR = "../data/raw/v1"
PROCESSED_DIR = "../data/processed/v1"

os.makedirs(RAW_DIR, exist_ok=True)
os.makedirs(PROCESSED_DIR, exist_ok=True)

RAW_JSON_PATH = os.path.join(RAW_DIR, "patterns_raw.json")
RAW_CSV_PATH = os.path.join(RAW_DIR, "patterns_raw.csv")

PROCESSED_CSV_PATH = os.path.join(PROCESSED_DIR, "patterns_clean.csv")

RAW_DIR, PROCESSED_DIR


('../data/raw/v1', '../data/processed/v1')

3. API Credentials. They are loaded from enviroment variables. They are stored locally as environment variables to prevent accidental exposure in the codebase.


In [16]:
from dotenv import load_dotenv

load_dotenv()  # loads variables from .env

RAVELRY_USER = os.getenv("RAVELRY_ACCESS_KEY")
RAVELRY_PASS = os.getenv("RAVELRY_PERSONAL_KEY")

if not RAVELRY_USER or not RAVELRY_PASS:
    raise ValueError(
        "API credentials not found. "
        "Make sure RAVELRY_ACCESS_KEY and RAVELRY_PERSONAL_KEY are set in .env file."
    )

print("Credentials loaded ✅")



Credentials loaded ✅


4. Test public API endpoint (read-only) and inspect response structure

In [17]:
search_url = f"{BASE_URL}/patterns/search.json" # Endpoint for pattern search
params = {"query": "sweater", "page": 1, "page_size": 5}

resp = requests.get(search_url, params=params, auth=auth, timeout=30) # Make the GET request

print("Status:", resp.status_code)
print("Content-Type:", resp.headers.get("Content-Type"))
print("Preview:", resp.text[:200])

if resp.ok:
    data = resp.json()
    print("Top-level keys:", list(data.keys()))
    print("Patterns returned:", len(data.get("patterns", [])))



Status: 200
Content-Type: application/json; charset=utf-8
Preview: {"patterns": [{"free":false,"id":7497838,"name":"BABA sweater Chunky","permalink":"baba-sweater-chunky","personal_attributes":null,"first_photo":{"id":145451206,"sort_order":1,"user_id":811200,"x_offs
Top-level keys: ['patterns', 'paginator']
Patterns returned: 5


5. Fetch patterns from Raverly API (RAW), and save raw response to JSON

In [18]:
search_url = f"{BASE_URL}/patterns/search.json" # Endpoint for pattern search

query = "sweater"   # Search term
page = 1
page_size = 100   
max_pages = 3     

all_patterns = []

while page <= max_pages:
    print(f"Fetching page {page}...")
    
    params = {
        "query": query,
        "page": page,
        "page_size": page_size
    }
    
    resp = requests.get(search_url, params=params, auth=auth, timeout=30)
    
    if not resp.ok:
        print(f"Stopped at page {page}, status {resp.status_code}")
        break
    
    data = resp.json()
    patterns = data.get("patterns", [])
    
    all_patterns.extend(patterns)
    
    if len(patterns) < page_size:
        # no more pages
        break
    
    page += 1
    time.sleep(1)  

print(f"Total patterns collected: {len(all_patterns)}")

# Save RAW data to JSON
with open(RAW_JSON_PATH, "w", encoding="utf-8") as f:
    json.dump(all_patterns, f, ensure_ascii=False, indent=2)

RAW_JSON_PATH


Fetching page 1...
Fetching page 2...
Fetching page 3...
Total patterns collected: 300


'../data/raw/v1\\patterns_raw.json'