# Clothes multimodal search with Scrapegraph, Jina Clip v2 and Qdrant Vector DB 👗

Hi there 👋 Today we're build a small demo to search clothes from [zalando](https://zalando.com/),  directly with natural language or images. Our plan of attack is to first scrape them, embed the images using a multimodal model and then store them into a vector db so we can search!


Scraping websites is not an easy task, most of them cannot be easily fetch with an http request and requires javascript to be loaded. If we try to make a HTTP request to zalando, we'll be blocked.

In [88]:
import requests

res = requests.get("https://www.zalando.it/jeans-donna")
# we'll get 403
res.status_code

403

We need something smarter, [scrapegraph](https://scrapegraphai.com/) is a perfect tool for the job. It can bypass websites blockers and allow us to define a [pydantic schema](https://docs.pydantic.dev/latest/concepts/models/) to scrape the information we want. It works by loading the website, parsing it and use LLMs to fill our schema with the data within the page.

Once we get the data, we need a way to creates vector to store/search. Since we want to work with images and text, we need the heavy guns. [Jina ClipV2](https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/) is a wonderful open source model that can represent as vector both images and text, thus is a perfect pick for the task.

Finally, we need to save our juicy vectors somewhere. [Qdrant](https://qdrant.tech/) is my go to vector database, you can self host it with [docker](https://hub.docker.com/r/qdrant/qdrant) and it comes with an handy ui. It supports different vector quantizations technics, so we can squeeze a lot of performances!

So, to recap. How plan of attack looks something like

![alt](./images/flow.png)

1. Scrape with Scrapegraph
2. Embed with Jina ClipV2
3. Store with Qdrant

Let's get started!

## Preambula

We'll need a bunch of packages. I am using `uv`, so we'll stick with it. You can init your project using

```
uv init
uv add python-dotenv scrapegraph-py==1.24.0 aiofiles sentence-transformers qdrant-client
```

Or if you prefer `pip`

```
pip install python-dotenv scrapegraph-py==1.24.0 aiofiles sentence-transformers qdrant-client
```

In [90]:
!uv add python-dotenv scrapegraph-py==1.24.0 aiofiles sentence-transformers qdrant-client

[2mResolved [1m158 packages[0m [2min 0.43ms[0m[0m
[2mAudited [1m138 packages[0m [2min 0.10ms[0m[0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Scrape

First thing all, head over [scrapegraph dashboard](https://dashboard.scrapegraphai.com/) and get your api key. Create a `.env` file and put it inside

```
GAI_API_KEY="YOUR_API_KEY"
```

Then we can load it

In [91]:
from dotenv import load_dotenv
import os
load_dotenv()

SGAI_API_KEY = os.getenv("SGAI_API_KEY")

True

Now, we need to define the data we want. Each article/item in the website looks like:


![alt](images/zalando-article.png)

We have a brand, name, description, price, image, review etc.

In order to tell scrapegraph what we want to extract, we have to define a couple of pydantic schemas. Since a page contains multiple items, we'll create an `ArticleModel`, so the single article, and `ArticlesModel` containing an array of them.

We can add `description` to make sure we guide the LLM into extracting the correct info

In [92]:
from pydantic import BaseModel, Field
from typing import Optional
import asyncio


class ArticleModel(BaseModel):
    name: str = Field(description="Name of the article")
    brand: str = Field(description="Brand of the article")
    description: str = Field(description="Description of the article")
    price: float = Field(description="Price of the article")
    review_score: float = Field(description="Review score of the article, out of five.")
    url: str = Field(description="Article url")
    image_url: Optional[str]= Field(description="Article's image url")


class ArticlesModel(BaseModel):
    articles: list[ArticleModel] = Field(description="Articles on the page, only the ones with price, review and image. Discard the others")


Now, the fun part. We'll store our scraped data locally into a `.jsonl` file. We'll also add a `user_prompt` to guide even further scrapegraph. Since the scraping process is heavily I/O bound, we'll use their `AsyncClient` so we can fire a lot of them at once. 

Let's import everything and define our variables

In [97]:
from time import perf_counter
# let' use async
from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger
sgai_logger.set_logging(level="INFO")

# let's use async to write to the file as well
import aiofiles
import json
import asyncio
import os

JSON_PATH = "scrape.jsonl"
# how much scraping request to fire at one
BATCH_SIZE = 8
# how many pages per category
MAX_PAGES = 100

# the user prompt to send to scrapegraph along the pydantic schemas

user_prompt = """Extract ONLY the articles in the page with price, review and image url. Discard all the others."""

Then, bottom up. A function to save the result

In [95]:
async def save(result: dict):
    async with aiofiles.open(JSON_PATH, 'a') as f:
        await f.write(json.dumps(result) + '\n')

One to scrape

In [96]:
async def scrape_and_save(client: AsyncClient, url: str): 
    start = perf_counter()
    sgai_logger.info(f"Scraping url={url}")
    response = await client.smartscraper(
                website_url=url,
                user_prompt=user_prompt,
                output_schema=ArticlesModel)
    await save(response)
    sgai_logger.info(f"Tooked {perf_counter() - start:.2f}s")

Finally, putting all together. We'll scrape women's jeans and t-shirt tops. We'll check first if `JSON_PATH`, if so we'll assume we had scrape already

In [98]:
async def main():
    get_urls = [
        lambda page: f"https://www.zalando.it/jeans-donna/?p={page}",
        lambda page: f"https://www.zalando.it/t-shirt-top-donna/?p={page}"
    ]
    if os.path.exists(JSON_PATH):
        sgai_logger.info(f"jsonl file exists, assuming we had scrape already. Quitting ...")
        return
    async with AsyncClient() as client:
        for get_url in get_urls:
            for i in range(1, MAX_PAGES + 1, BATCH_SIZE):
                pages = list(range(i, min(i + BATCH_SIZE, MAX_PAGES + 1)))
                tasks = [scrape_and_save(client, get_url(page)) for page in pages]
                await asyncio.gather(*tasks)


In [101]:
# we'll take some minutes
await main()

💬 2025-09-21 11:00:49,526 jsonl file exists, assuming we had scrape already. Quitting ...


In [80]:
with open(JSON_PATH, "r") as f:
    for line in f.readlines():
        data = json.loads(line)
        break

data["result"]["articles"][0]

{'name': 'PULL&BEAR BAGGY - Jeans baggy - white',
 'brand': 'PULL&BEAR',
 'description': 'Cropped top bianco senza maniche abbinato a pantaloni bianchi a gamba larga, con tasche frontali e chiusura a bottone. Sandali piatti marroni con borchie.',
 'price': 35.99,
 'review_score': 0,
 'url': 'https://www.zalando.it/pullandbear-jeans-bootcut-white-puc21n0rs-a11.html',
 'image_url': 'https://img01.ztat.net/article/spp-media-p1/ff33dd220e7c4827ba1b8be760e6de7c/b9ca1dcb64b04fa98b0e0d5fa38fff14.jpg?imwidth=300'}

http://localhost:6333/dashboard#/welcome

In [49]:
def get_articles_from_disk():
    with open(JSON_PATH, "r") as f:
        for line in f.readlines():
            data = json.loads(line)
            yield data["result"]["articles"]

articles_gen = get_articles_from_disk()

In [81]:
from sentence_transformers import SentenceTransformer

# Choose a matryoshka dimension
EMBEDDING_SIZE = 512

# Initialize the model
model = SentenceTransformer(
    "jinaai/jina-clip-v2", trust_remote_code=True, truncate_dim=EMBEDDING_SIZE
)





In [82]:
image_embeddings = model.encode(data["result"]["articles"][0]["image_url"], normalize_embeddings=True)

image_embeddings

array([-1.28987029e-01,  1.39657214e-01, -1.31025478e-01,  9.68374386e-02,
       -2.69599911e-02,  4.83107083e-02, -1.53913930e-01,  1.22468648e-02,
       -1.03504017e-01,  5.69794662e-02, -9.76331085e-02, -3.51643823e-02,
       -5.69246598e-02, -2.31039617e-03, -6.89019784e-02,  2.16930825e-02,
        1.19779579e-01,  2.22846001e-01,  1.75678268e-01, -4.91316523e-03,
       -1.80066481e-01,  4.29708622e-02,  5.84954470e-02, -6.49221167e-02,
        1.27607718e-01, -1.78001001e-01, -1.01392306e-01,  9.52235460e-02,
       -3.21293138e-02,  1.77865714e-01, -4.00542766e-02, -1.70831531e-02,
        8.89032148e-03, -9.91963074e-02, -9.23199654e-02,  7.11153671e-02,
       -1.20687401e-02,  4.08993065e-02, -9.51866899e-03,  7.04408735e-02,
        6.54054433e-02,  1.05399072e-01, -1.37217818e-02, -1.09636031e-01,
       -2.46394686e-02,  1.02678955e-01, -7.89565668e-02, -1.83367297e-01,
        6.84156939e-02, -8.01245421e-02, -5.04700560e-03, -2.22889017e-02,
        2.15363712e-03, -

In [61]:
QDRANT_COLLECTION_NAME = "clothes"
QDRANT_URL = "http://localhost:6333"


from qdrant_client import QdrantClient, models
import numpy as np
import asyncio


client = QdrantClient(url=QDRANT_URL)

if not client.collection_exists(QDRANT_COLLECTION_NAME):
    print(f"{QDRANT_COLLECTION_NAME} created!")
    client.create_collection(
        collection_name=QDRANT_COLLECTION_NAME,
        vectors_config=models.VectorParams(
            size=EMBEDDING_SIZE, distance=models.Distance.COSINE, on_disk=True
        ),
        quantization_config=models.ScalarQuantization(
            scalar=models.ScalarQuantizationConfig(
                type=models.ScalarType.INT8,
                quantile=0.99,
                always_ram=True,
            ),
        ),
    )

In [62]:
BATCH_SIZE = 8
import uuid
from tqdm.autonotebook import tqdm

def embed_articles(data):
    image_urls = [el["image_url"] for el in batch]
    image_embeddings = model.encode(
            image_urls, normalize_embeddings=True
        )
    return image_embeddings

def insert_articles_in_db(batch, embeddings):
    client.upsert(
        collection_name=QDRANT_COLLECTION_NAME,
        points=[
            models.PointStruct(
                id=str(uuid.uuid4()),
                vector=vector,
                payload=payload
            )
            for payload, vector in zip(batch, embeddings)
        ],
    )
    

with tqdm(articles_gen, desc="Article Collections", position=0) as pbar_collections:
    for articles in pbar_collections:
        batches = list(range(0, len(articles), BATCH_SIZE))
        with tqdm(batches, desc="Processing Batches", position=1, leave=False) as pbar_batches:
            for i in pbar_batches:
                batch = articles[i:i + BATCH_SIZE]
                embeddings = embed_articles(batch)
                insert_articles_in_db(batch, embeddings)        

Article Collections: 0it [00:00, ?it/s]
Processing Batches:   0%|                                                                                                         | 0/4 [00:00<?, ?it/s][A
Processing Batches:  25%|████████████████████████▎                                                                        | 1/4 [00:05<00:15,  5.22s/it][A
Processing Batches:  50%|████████████████████████████████████████████████▌                                                | 2/4 [00:09<00:09,  4.80s/it][A
Processing Batches:  75%|████████████████████████████████████████████████████████████████████████▊                        | 3/4 [00:14<00:04,  4.92s/it][A
Processing Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.17s/it][A
Article Collections: 1it [00:17, 17.82s/it]                                                                                                             [A
Processing Batches:   0%

IndexError: list index out of range

In [71]:
query = 't-shirt black' # English
# query_embeddings = model.encode(
#     query, prompt_name='retrieval.query', normalize_embeddings=True
    
# )  

query_embeddings = model.encode(
            "https://d1fufvy4xao6k9.cloudfront.net/images/landings/43/shirts-mob-1.jpg", normalize_embeddings=True
        )

res = client.search(
        collection_name=QDRANT_COLLECTION_NAME,
        query_vector=query_embeddings.tolist(),
        limit=10,
    )


  res = client.search(


In [72]:
res

[ScoredPoint(id='2422ba6f-262a-4e04-9126-f874593e4c7a', version=16, score=0.7814283, payload={'name': 'Mango MIT ZIERNÄHTEN', 'brand': 'Mango', 'description': '', 'price': 45.99, 'review_score': 0, 'url': 'https://www.zalando.it/mango-wide-leg-blau-m9121n26u-k11.html', 'image_url': 'https://img01.ztat.net/article/spp-media-p1/13f14ab6cacc4f33aea6cfadb0fac207/7aef5c18accf478d860a3331103b73f6.jpg?imwidth=300'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id='cb86cef5-ba3f-4b05-b8c8-6cfda9f9333a', version=16, score=0.7630111, payload={'name': 'Calvin Klein Jeans MID RISE', 'brand': 'Calvin Klein Jeans', 'description': 'Promo', 'price': 59.99, 'review_score': 0, 'url': 'https://www.zalando.it/calvin-klein-jeans-mid-rise-jeans-skinny-fit-denim-dark-c1821n0lx-k11.html', 'image_url': 'https://img01.ztat.net/article/spp-media-p1/7677dca00ac64142bbd7f40a123fa9f1/5e505f360e094ad89fe394ac376261a8.jpg?imwidth=300'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id='