## Model Training & Data Ingestion Notebook (Final Stable Version)

This notebook uses the `all-MiniLM-L6-v2` sentence-transformer model. This is a highly stable and public model that does not require special authentication to download. It produces 384-dimension vectors.

**Workflow:**
1. **Load Data**: Read the `intern_data_ikarus.csv` file.
2. **Generate Embeddings**: Use `sentence-transformers/all-MiniLM-L6-v2` to create a vector for each product.
3. **Connect to Weaviate**: Establish a connection with the running Weaviate instance.
4. **Batch Ingestion**: Upload the products and their new vectors to Weaviate.

In [1]:
import pandas as pd
import weaviate
import ast
import numpy as np
import uuid
from sentence_transformers import SentenceTransformer




### 1. Load and Preprocess Data

In [2]:
df = pd.read_csv('../backend/data/intern_data_ikarus.csv')

df['description'] = df['description'].fillna('')
df['title'] = df['title'].fillna('')
df['text_for_embedding'] = df['title'] + ". " + df['description']

def safe_literal_eval(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        return []

df['images'] = df['images'].apply(safe_literal_eval)
df['primary_image'] = df['images'].apply(lambda x: x[0] if x else None)

print(f"Loaded and preprocessed {len(df)} products.")

Loaded and preprocessed 312 products.


### 2. Generate Embeddings (384 Dimensions)

In [3]:
# This model produces 384-dimension vectors and is highly reliable for downloads.
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

print("Generating text embeddings (this may take a minute)...")
product_vectors = embedding_model.encode(df['text_for_embedding'].tolist(), show_progress_bar=True)

print(f"Embeddings created with shape: {product_vectors.shape}")

OSError: sentence-transformers/all-MiniLM-L6-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

### 3. Connect to Weaviate and Define Schema

In [None]:
client = weaviate.connect_to_local()

collection_name = "Product"
if client.collections.exists(collection_name):
    print(f"Deleting existing collection: {collection_name}")
    client.collections.delete(collection_name)

products = client.collections.create(
    name=collection_name,
    description="A collection of furniture and home products",
    vectorizer_config=weaviate.classes.config.Configure.Vectorizer.none()
)
print(f"Collection '{collection_name}' created successfully.")

### 4. Batch Data Ingestion into Weaviate

In [None]:
print("Starting data ingestion into Weaviate...")
product_collection = client.collections.get("Product")

with product_collection.batch.dynamic() as batch:
    for i, row in df.iterrows():
        properties = {
            "title": str(row['title']),
            "brand": str(row['brand']),
            "description": str(row['description']),
            "price": str(row['price']),
            "material": str(row['material']),
            "color": str(row['color']),
            "primary_image": str(row['primary_image']),
            "uniq_id": str(row['uniq_id'])
        }
        
        batch.add_object(
            properties=properties,
            vector=product_vectors[i].tolist(),
            uuid=uuid.uuid5(uuid.NAMESPACE_DNS, str(row['uniq_id']))
        )

print("\nData ingestion complete. All products have been uploaded to Weaviate.")
client.close()