# 🧢 Mixed-Type Domain-Specific Embeddings (Fashion/Retail Example)

Our custom embedding model uses a 12-dimensional feature vector where:
- Some fields are binary (0/1)
- Some are encoded as integers (0–255) based on domain-specific lookups (e.g., color IDs, size ranks)



## 🧢 FashionVectorizer: Domain-Specific Vector Encoder for Retail Search

The `FashionVectorizer` class is a utility that converts fashion catalog data and user search descriptions into 12-dimensional semantic vectors. These vectors are designed for **clothing and shoe product indexing, similarity search, and recommendation systems**.

This approach enables **interpretable, explainable embeddings** with both binary and integer values reflecting domain-specific features like color, size, material, weather resistance, and use case.

---

### 🧬 Vector Schema (12 Dimensions)

| Dim | Feature                     | Type     | Notes                                                   |
|-----|-----------------------------|----------|----------------------------------------------------------|
| 0   | Is footwear                 | binary   | 1 = shoes, boots, sneakers                               |
| 1   | Is outerwear                | binary   | 1 = jacket, coat                                         |
| 2   | For kids                    | binary   | based on 'target' field or query context                |
| 3   | Waterproof                  | binary   | 1 = water resistant gear                                 |
| 4   | Use case code               | integer  | 0 = casual, 1 = formal, 2 = sport, 3 = workwear          |
| 5   | Color code                  | integer  | Based on lookup table (0–255)                            |
| 6   | Size code                   | integer  | XS–XL mapped to range (80–230)                           |
| 7   | Has memory foam insole      | binary   | comfort enhancement, mostly in shoes                    |
| 8   | Insulated/thermal lining    | binary   | relevant for winter gear                                |
| 9   | Oversized                   | binary   | relaxed fit style                                        |
| 10  | Denim/jeans-style           | binary   | checks for 'denim' in material or query text             |
| 11  | Brand popularity score      | integer  | default = 50, range 0–100 (manual or learned ranking)    |

---

### ✅ Methods

- `json_to_vector(item_dict)`  
  Converts a catalog JSON line to a 12-dimensional embedding.

- `description_to_vector(query_text)`  
  Converts a user-friendly product search into an interpretable embedding.

- `vector_to_jsonl(vector, name, id)`  
  Wraps a vector and metadata back into JSONL format (ready for indexing).

---

### 💡 Example: Vector from Catalog Entry

```python
fv = FashionVectorizer()

item = {
  "id": "item_42",
  "name": "Brown waterproof boots, thermal lining",
  "type": "boots",
  "color": "brown",
  "size": "M",
  "target": "kids",
  "use_case": "casual",
  "material": "leather",
  "foam": False,
  "thermal": True,
  "oversized": False,
  "waterproof": True,
  "brand_score": 75
}

vector = fv.json_to_vector(item)
```

---

### 💬 Example: Vector from Natural Query

```python
query = "Looking for warm waterproof kids boots, size M, brown"
vector = fv.description_to_vector(query)
```

---

### 🔁 Rebuild JSONL Line from Vector

```python
jsonl_line = fv.vector_to_jsonl(vector, name="Query: boots", item_id="query_1")
```






In [None]:
import re
from typing import Dict, List

class FashionVectorizer:
    def __init__(self):
        self.color_lookup = {
            "black": 0, "white": 255, "gray": 128, "red": 50,
            "blue": 75, "brown": 35, "green": 90, "beige": 60
        }
        self.size_lookup = {
            "xs": 80, "s": 100, "m": 150, "l": 200, "xl": 230
        }
        self.use_case_lookup = {
            "casual": 0, "formal": 1, "sport": 2, "workwear": 3
        }

    def json_to_vector(self, item: Dict) -> List[float]:
        return [
            1 if item["type"] in ["shoes", "boots", "sneakers"] else 0,
            1 if item["type"] in ["jacket", "coat"] else 0,
            1 if item.get("target") == "kids" else 0,
            1 if item.get("waterproof", False) else 0,
            self.use_case_lookup.get(item.get("use_case", "casual"), 0),
            self.color_lookup.get(item.get("color", "gray"), 128),
            self.size_lookup.get(item.get("size", "m").lower(), 150),
            1 if item.get("foam", False) else 0,
            1 if item.get("thermal", False) else 0,
            1 if item.get("oversized", False) else 0,
            1 if "denim" in item.get("material", "") else 0,
            item.get("brand_score", 50)
        ]

    def description_to_vector(self, description: str) -> List[float]:
        description = description.lower()
        return [
            int(any(k in description for k in ["shoes", "boots", "sneakers"])),
            int(any(k in description for k in ["jacket", "coat"])),
            int(any(k in description for k in ["kid", "child", "children", "toddler", "youth"])),
            int(any(k in description for k in ["waterproof", "rain", "wet"])),
            next((v for k, v in self.use_case_lookup.items() if k in description), 0),
            next((v for k, v in self.color_lookup.items() if k in description), 128),
            next((v for k, v in self.size_lookup.items() if re.search(rf"\\b{k}\\b", description)), 150),
            int(any(k in description for k in ["foam", "insole", "comfort"])),
            int(any(k in description for k in ["warm", "thermal", "insulated", "winter"])),
            int(any(k in description for k in ["oversized", "loose", "relaxed"])),
            int(any(k in description for k in ["denim", "jean"])),
            50  # default brand popularity
        ]

    def vector_to_jsonl(self, vector: List[float], name: str, item_id: str) -> Dict:
        return {
            "id": item_id,
            "name": name,
            "vector": vector
        }


## Generate Data with various clothing and footwear articles

In [None]:
import json
import random

fv = FashionVectorizer()

types = ["sneakers", "boots", "jacket", "coat", "dress", "pants", "sweater"]
colors = list(fv.color_lookup.keys())
sizes = list(fv.size_lookup.keys())
targets = ["men", "women", "kids", "unisex"]
use_cases = list(fv.use_case_lookup.keys())
materials = ["denim", "cotton", "leather", "polyester", "wool", "canvas"]

def random_item(i):
    item = {
        "id": f"item_{i}",
        "name": f"{random.choice(colors)} {random.choice(sizes)} {random.choice(types)}",
        "type": random.choice(types),
        "color": random.choice(colors),
        "size": random.choice(sizes),
        "target": random.choice(targets),
        "use_case": random.choice(use_cases),
        "material": random.choice(materials),
        "foam": random.choice([True, False]),
        "thermal": random.choice([True, False]),
        "oversized": random.choice([True, False]),
        "waterproof": random.choice([True, False]),
        "brand_score": random.randint(30, 100)
    }
    return item

with open("products.jsonl", "w") as f:
    for i in range(500):
        product = random_item(i)
        f.write(json.dumps(product) + "\n")



## Index Embeddings into ChromaDB

In [None]:
!pip install chromadb

In [None]:
import chromadb
from chromadb.config import Settings

chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = chroma_client.get_or_create_collection(name="fashion_items")

fv = FashionVectorizer()

ids, docs, vectors = [], [], []

with open("products.jsonl", "r") as f:
    for line in f:
        item = json.loads(line)
        vector = fv.json_to_vector(item)
        ids.append(item["id"])
        docs.append(item["name"])
        vectors.append(vector)

collection.add(
    ids=ids,
    documents=docs,
    embeddings=vectors
)

print(f"✅ {len(ids)} items indexed to ChromaDB.")


## Client code to Query Closest Cosine Match

In [None]:
# Human input
query = "I'm looking for a black oversized denim jacket for casual use"

# Convert using FashionVectorizer
fv = FashionVectorizer()
query_vector = fv.description_to_vector(query)

# Search ChromaDB
results = collection.query(
    query_embeddings=[query_vector],
    n_results=3,
    include=["distances", "documents"]
)

# Show matches
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"🧵 Match: {doc} (Distance: {dist:.4f})")
