In the fast-growing landscape of e-commerce, customers expect fast, accurate, and personalized search experiences. However, traditional keyword-based search engines often fall short when handling vague, visual, or ambiguous queries — such as "Nike shoes with blue swoosh" or "Apple logo hoodie" — where intent cannot be fully captured by text alone.
This project addresses this challenge by leveraging multimodal machine learning techniques to improve product search. Specifically, it uses both textual and visual data from the Shopping Queries Image Dataset (SQID) and extracts semantic embeddings via CLIP, a vision-language model.
The machine learning problem is framed as a multimodal ranking/retrieval task, where the system computes a semantic similarity score between a user query and multiple product candidates, each represented through both their text description and image. The goal is to return a ranked list of products most relevant to the user's intent.
This approach enables improved understanding of visual and textual attributes, making the search more intuitive, robust to vocabulary gaps, and capable of handling open-ended or visually grounded queries.
Modern e-commerce platforms face increasing pressure to deliver accurate and engaging search experiences. When users enter queries, they expect results that match both their functional and visual expectations. For example, a search for "slim fit striped shirt" should return products that actually have the right cut and visual pattern — not just any shirt with matching keywords.
However, most search engines still rely on traditional information retrieval methods that focus only on textual metadata. These methods often fail when:
-
The product image contradicts the description.
-
Visual details are not fully captured by the product title or description.
-
The user’s intent is visual or ambiguous.
This project aims to improve the product search experience by using multimodal machine learning, combining product titles, descriptions, and images to better understand relevance. It leverages the Shopping Queries Image Dataset (SQID), an enriched dataset built on top of Amazon’s Shopping Queries Dataset (SQD), which includes product images and pretrained CLIP embeddings.
By understanding both the visual and textual context of each product, we can return more meaningful and satisfying search results — enhancing customer experience, increasing conversion rates, and reducing bounce rates.
Using machine learning, particularly multimodal models, provides substantial business value for modern e-commerce platforms. Traditional search engines based on keyword matching often fail to capture subtle user intent or visual expectations, leading to irrelevant results and poor user experience.
By introducing CLIP-based multimodal search, the system gains the ability to:
-
Interpret complex queries more semantically, not just syntactically.
-
Leverage both product images and text to rank relevance more effectively.
-
Adapt to user preferences over time without manual rule-based tuning.
In summary, applying machine learning in this context helps companies stay competitive by improving user satisfaction, driving revenue, and enabling smarter automation in product search and recommendation systems.
This project uses the Shopping Queries Image Dataset (SQID), an extension of the original Shopping Queries Dataset (SQD) released by Amazon for the KDDCup’22 challenge. SQID enriches the SQD dataset by including product images and precomputed CLIP embeddings, allowing for multimodal learning and ranking tasks.
Key dataset components:
-
Queries: User search terms in natural language (e.g., "men’s red sneakers").
-
Products: Product listings with titles, descriptions, brand, color, etc.
-
Images: Main product image for each item, linked via URL.
-
Labels (ESCI): Relevance judgments:
-
E – Exact match
-
S – Substitute
-
C – Complement
-
I – Irrelevant
-
-
Embeddings: Precomputed CLIP text and image embeddings for each product.
Dataset stats:
-
~190,000 product listings
-
~1.1M query-product pairs
-
Covers US, ES, and JP locales (this project focuses on the US locale)
The dataset is publicly available at: 🔗 https://github.com/Crossing-Minds/shopping-queries-image-dataset
Based on the archetype framework taught in class, this project falls under the Software 2.0 category.
Software 2.0 refers to systems where traditional code is replaced or augmented by machine-learned logic — typically models trained on data. Instead of writing explicit rules for ranking products or understanding user queries, the system learns from examples and uses embeddings (text + image) to drive behavior.
In this project:
-
The ranking function is learned through semantic similarity between query and product embeddings.
-
The rules for what is relevant are not hardcoded — they are inferred from CLIP’s understanding of visual-textual alignment.
-
The system adapts and improves based on the distribution of queries and product listings.
This is in contrast to:
-
Autonomous systems: which act in the world (e.g., robotics).
-
Human-in-the-loop: where a person is part of the feedback cycle (e.g., labeling data or approving results).
Thus, this project is a clear example of Software 2.0 — where ML replaces handcrafted relevance ranking logic.
This project is inspired by two key papers:
-
Shopping Queries Dataset (SQD) – arXiv:2206.06588
- Introduced by Amazon, this dataset provides a large-scale benchmark for product search using query-product pairs labeled by relevance. It defined the ESCI labels (Exact, Substitute, Complement, Irrelevant) and proposed ranking, classification, and substitution detection tasks.
-
Shopping Queries Image Dataset (SQID) – arXiv:2405.15190
- SQID builds on SQD by enriching it with product images and pretrained embeddings (text and image) using CLIP. It supports multimodal learning and highlights the value of combining text + image for better product ranking.
Baseline Model Summary
Model Name | Developer | Purpose | Performance (NDCG) |
---|---|---|---|
CLIP (ViT-L/14) | OpenAI | Multimodal embedding for image-text matching | ~0.82 (SQID paper) |
SBERT (MiniLM-L12-v2) | Hugging Face | Text-only semantic similarity model | ~0.83 |
ESCI Baseline (MS MARCO Cross-Encoder) | Amazon | Fine-tuned text-only ranker for SQD | 0.85+ |
The baseline model used in this project is CLIP (Contrastive Language-Image Pretraining), specifically the clip-vit-large-patch14 variant, available via Hugging Face.
-
It provides joint embeddings for both text (user queries) and images (product photos).
-
Embeddings can be compared using cosine similarity to estimate semantic relevance.
This model is:
-
Open-source
-
Available as a pretrained binary
-
Does not require retraining to get useful results (zero-shot setup)
This project focuses on ranking product search results by relevance to a given user query. Therefore, the most appropriate evaluation metrics are:
-
Why?: NDCG measures the quality of a ranked list by rewarding correct items at higher positions.
-
How it works:
-
Takes into account the position of relevant items in the list.
-
ESCI labels are mapped to graded relevance scores:
-
E (Exact) = 1.0
-
S (Substitute) = 0.1
-
C (Complement) = 0.01
-
I (Irrelevant) = 0.0
-
-
Used in: The official SQD and SQID benchmarks.
-
-
Measures how many of the top-K retrieved products are truly relevant (e.g., Exact or Substitute).
-
Useful for evaluating short result lists, like top 5 or top 10.
-
Not a final metric, but used internally to compare query and product embeddings (via CLIP).
-
It’s how we score and rank product candidates before computing NDCG.
These metrics provide a robust framework to evaluate how well the system understands and ranks product relevance based on multimodal inputs.
Milestone 2 Demo
This milestone implements a functional Streamlit prototype (PoC) of the multimodal product search engine using OpenAI's CLIP model. The system supports natural language queries and returns visually and semantically relevant product images.
Hugging Face MultimodalSearchML Space
- Build an interactive UI to test query-to-image search
- Use real product data from the SQID dataset
- Deploy the app publicly on Hugging Face Spaces
- Model: CLIP (
ViT-L/14
) from OpenAI - Embedding:
- Product image features: precomputed and stored in
products_sample.csv
- Query text: encoded in real-time using CLIP
- Product image features: precomputed and stored in
- Source: SQID (Shopping Queries Image Dataset)
- Sample size: 15,000 products
- Format:
product_id
product_title
image_url
clip_image_features
(768-dim)
- Built with Streamlit
- Accepts a free-text user query
- Encodes the query with CLIP
- Computes cosine similarity between query and all image embeddings (batched)
- Ranks top 10 matches and displays:
- Product image
- Title / ID
- Normalized match score (%)
The working PoC is live on Hugging Face Spaces:
Component | Status |
---|---|
Query input | ✅ |
CLIP query encoding | ✅ |
Embedding similarity | ✅ |
Image ranking | ✅ |
Visual output | ✅ |
This milestone focuses on building the data pipeline for the MultimodalSearchML project. We automated the ingestion, validation, preprocessing, and preparation of the dataset using TFX, ensured proper data versioning with DVC, and managed features with Feast as our feature store.
-
Library used: TFX ExampleGen
-
Dataset used: query_features_with_timestamp.csv
-
Task: Ingest raw data into the pipeline for future steps (statistics, schema generation, validation, and transformation).
I created a TFX pipeline component for ingestion using CsvExampleGen. Then, I loaded the dataset from:
data/query/query_features_with_timestamp.csv
The dataset is already prepared with query_id, f0 to f767 features, and an event_timestamp column.
from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2
input_config = example_gen_pb2.Input(splits=[
example_gen_pb2.Input.Split(name='train', pattern='query_features_with_timestamp.csv'),
])
example_gen = CsvExampleGen(input_base=data_path, input_config=input_config)
-
The CsvExampleGen is used to automatically:
-
Ingest the CSV file
-
Convert it into TFRecords
-
Automatically split into train and eval sets
-
Store generated artifacts under artifacts/CsvExampleGen
-
Results:
- You can observe the ingested data under:
tfx_pipeline/artifacts/CsvExampleGen/examples/
and inside:
Split-train/
Split-eval/
Each containing generated data_tfrecord-* files.
To automatically generate statistics, infer a data schema, and detect anomalies in the ingested query dataset before applying transformations or training models.
I added three essential components:
-
StatisticsGen: Computes descriptive statistics on the ingested data.
-
SchemaGen: Automatically infers the schema from the statistics.
-
ExampleValidator: Detects data anomalies and schema drift.
from tfx.components import StatisticsGen, SchemaGen, ExampleValidator
# Compute statistics
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
# Infer schema
schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'])
# Validate dataset against the inferred schema
example_validator = ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_gen.outputs['schema']
)
Pipeline Flow:
-
Statistics are automatically computed from the TFRecords generated by ExampleGen.
-
SchemaGen infers a schema without any manual intervention.
-
ExampleValidator checks if:
-
The data conforms to the inferred schema.
-
There are missing values, unusual distributions, or unexpected types.
-
Results:
- Statistics are stored under:
tfx_pipeline/artifacts/StatisticsGen/statistics/
- Schema is stored under:
tfx_pipeline/artifacts/SchemaGen/schema/
- This is the output of our schema: (schame.pbtxt)
feature {
name: "event_timestamp"
type: BYTES
domain: "event_timestamp"
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
feature {
name: "f0"
type: FLOAT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
feature {
name: "f1"
type: FLOAT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
.....
.....
feature {
name: "f767"
type: FLOAT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
feature {
name: "query_id"
type: INT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
string_domain {
name: "event_timestamp"
value: "2025-03-27 07:01:04.332726+00:00"
}
- Anomalies report is stored under:
tfx_pipeline/artifacts/ExampleValidator/anomalies/
NB:
- This process is fully automated thanks to TensorFlow Data Validation (TFDV).
- The schema file (schema.pbtxt) will serve as input for the next Transform step.
- Any anomalies detected can be analyzed in TensorBoard or directly via the generated artifacts.
Prepare the dataset for model training by applying transformations, selecting useful features, and managing missing data using TensorFlow Transform (TFX Transform).
I built a dedicated preprocessing module responsible for:
-
Selecting only useful columns (768 embedding features).
-
Handling potential missing data.
-
Preparing the data for modeling in a scalable and production-friendly way.
Preprocessing Module (preprocessing.py):
import tensorflow as tf
import tensorflow_transform as tft
def preprocessing_fn(inputs):
# Keep only feature columns f0 to f767
outputs = {}
for i in range(768):
col = f"f{i}"
outputs[col] = inputs[col]
# Keep the event timestamp as is
outputs['event_timestamp'] = inputs['event_timestamp']
return outputs
TFX Transform Component (in the pipeline):
from tfx.components import Transform
transform = Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=module_file, # preprocessing.py script
)
What this step achieves:
- Reduces the dataset to only the needed features (f0 - f767 + timestamp).
- Ensures data is formatted consistently.
- Scales and prepares data in a reproducible way for future serving.
- The transformed dataset is automatically split into:
- Split-train/
- Split-eval/
Output Directories:
- Transform Graph:
tfx_pipeline/artifacts/Transform/transform_graph/
- Transformed Dataset:
tfx_pipeline/artifacts/Transform/transformed_examples/
In this step, I integrated Feast as a feature store to manage and serve features for the MultimodalSearchML project. The feature store plays a crucial role in centralizing, versioning, and serving feature data for both training and serving machine learning models.
Objectives:
- Organize, version, and serve query and product dataset features.
- Enable consistent feature access during both training and inference.
- Prepare for later integration with the future TFX pipeline model training phase.
Data Used:
- query_features_flat.csv
- product_features_flat.parquet
These datasets contain extracted features representing query embeddings and product embeddings computed using CLIP models.
I installed Feast in a dedicated environment:
pip install feast==0.40.1
Then, initialized the feature repository:
feast init feature_store/multimodal_features
I defined two main FeatureViews:
- Query Feature View from query_features_flat.csv
- Product Feature View from product_features_flat.parquet
Example from query_features_view.py:
query_feature_view = FeatureView(
name="query_features_view",
entities=[query_entity],
ttl=Duration(seconds=86400),
schema=[Field(name=f"f{i}", dtype=Float32) for i in range(768)],
source=query_source
)
Example from product_features_view.py:
product_feature_view = FeatureView(
name="product_features_view",
entities=[product_entity],
ttl=Duration(seconds=86400),
schema=[Field(name=f"f{i}", dtype=Float32) for i in range(768)],
source=product_source
)
Once the FeatureViews were defined, we applied them:
feast apply
Feast registered:
- 2 Entities (query_id and product_id)
- 2 FeatureViews
- Features with embedding dimensions: (f0 to f767 for queries) and (f0 to f1535 for products)
Feature Materialization:
I materialized features to be ready for offline and online serving:
feast materialize-incremental $(date -u +%Y-%m-%d)
Feast then generated:
- Offline feature data (for training pipelines)
- Online feature data (for future model serving)
Summary:
- Both datasets were successfully managed in Feast.
- Redis was configured as an online store (development purpose).
- The setup ensures that during model training or serving, the exact same feature values are retrieved as during feature generation.
In this step, I integrated DVC (Data Version Control) to manage and version the raw data used throughout the milestone. DVC provides a Git-like workflow for datasets, making sure that every experiment or pipeline execution always uses the correct version of the data.
Objectives
Track the query and product datasets systematically.
Avoid redundant storage.
Enable reproducibility of experiments.
Prepare the project for collaborative work and future experiments.
I versioned the raw data folder containing:
query_features_flat.csv
query_features_with_timestamp.csv
product_features_flat.parquet
product_image_urls.csv
supp_product_image_urls.csv
I initialized DVC inside the project directory:
dvc init
This created the .dvc/ folder and the configuration files necessary to start versioning data.
Then, for tracking the dataset, I use the following command to track the whole folder:
dvc add Milestone\ 3/data/raw/
DVC created:
- raw.dvc file to track changes.
- .dvc/cache/ for storing dataset versions efficiently.
I also configured a local remote to store data artifacts:
dvc remote add -d localremote ~/dvcstore
dvc remote modify localremote type local
This ensures all future dataset versions will be pushed to ~/dvcstore safely.
The datasets were pushed into the local remote:
dvc push
This command:
-
Stored the dataset version into the ~/dvcstore directory.
-
Allowed us to share datasets without duplicating the data.
In this step, I designed and implemented a fully functional TFX (TensorFlow Extended) pipeline for ingesting, validating, and preparing the query dataset for modeling. We focused on separating ingestion, validation, and preprocessing clearly while ensuring the pipeline is clean, professional, and reproducible.
This pipeline included the following TFX components:
-
CsvExampleGen
-
StatisticsGen
-
SchemaGen
-
ExampleValidator
-
Transform
Each component generated useful artifacts under the artifacts/ directory automatically, maintaining the TFX structure of numbered runs and split directories (train/eval).
This component ingests the query_features_with_timestamp.csv dataset.
input_config = example_gen_pb2.Input(splits=[
example_gen_pb2.Input.Split(name="train", pattern="query_features_with_timestamp.csv"),
])
example_gen = CsvExampleGen(input_base=data_path, input_config=input_config)
It automatically:
-
Split the dataset.
-
Converted it into TFRecords under:
tfx_pipeline/artifacts/CsvExampleGen/examples/1/Split-train/
tfx_pipeline/artifacts/CsvExampleGen/examples/1/Split-eval/
It computed descriptive statistics of the dataset:
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
Generated:
- FeatureStats.pb containing complete statistics of all columns.
- Split under:
artifacts/StatisticsGen/statistics/{run_id}/Split-train/FeatureStats.pb
Automatically inferred the data schema:
schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'])
Generated:
-
schema.pbtxt defining data types, expected values, and feature constraints.
-
Saved under:
artifacts/SchemaGen/schema/{run_id}/schema.pbtxt
Validated dataset against the inferred schema:
example_validator = ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_gen.outputs['schema']
)
Generated:
-
SchemaDiff.pb showing anomalies and schema drifts.
-
Stored under:
artifacts/ExampleValidator/anomalies/{run_id}/Split-train/SchemaDiff.pb
I used a custom preprocessing module to:
-
Remove unwanted features.
-
Scale and normalize input features.
-
Keep only the f0 to f767 (768 clip-text embeddings).
Example code:
def preprocessing_fn(inputs):
return {
f"f{i}": inputs[f"f{i}"] for i in range(768)
}
TFX Transform component:
transform = Transform(
examples=example_gen.outputs["examples"],
schema=schema_gen.outputs["schema"],
module_file=module_file,
)
Generated:
-
transformed_examples/ (TFRecords ready for modeling)
-
transform_graph/ containing transformation logic for serving
We executed the pipeline with:
python tfx_pipeline/scripts/run_ingest_query_pipeline.py
It completed successfully and populated the following:
-
All component artifacts
-
Automatically versioned runs (run_id folders)
-
Split-aware artifacts (train/eval)
Example Final Pipeline Structure (auto-generated by TFX):
artifacts/
CsvExampleGen/
StatisticsGen/
SchemaGen/
ExampleValidator/
Transform/
Data Visualization Notebook
After running the ingestion and preprocessing pipeline, we prepared a dedicated Jupyter Notebook to:
-
Explore TFX-generated artifacts.
-
Visualize statistics, schema, and anomalies.
-
Confirm data correctness before training.
Notebook Objective:
Load pipeline artifacts automatically.
-
Show summary statistics.
-
Visualize feature distributions.
-
Display schema.
-
Report anomalies detected by ExampleValidator.
-
Prepare the pipeline for potential monitoring (TensorBoard or other tools).
As It showed, every component's artifact is generated under:
artifacts/
CsvExampleGen/examples/{run_id}/Split-*/
StatisticsGen/statistics/{run_id}/Split-*/FeatureStats.pb
SchemaGen/schema/{run_id}/schema.pbtxt
ExampleValidator/anomalies/{run_id}/Split-*/SchemaDiff.pb
Transform/...
This allowed us to dynamically load latest runs using helper functions like:
def get_latest_subdir(directory):
subdirs = [os.path.join(directory, o) for o in os.listdir(directory) if os.path.isdir(os.path.join(directory, o))]
if not subdirs:
raise FileNotFoundError(f"No subdirectories found in {directory}")
return max(subdirs, key=os.path.getmtime)
I used tensorflow_data_validation (tfdv) to load and visualize statistics:
train_stats = tfdv.load_statistics(os.path.join(stats_path, "Split-train", "FeatureStats.pb"))
tfdv.visualize_statistics(train_stats)
Output:
-
Rich interactive charts
-
Feature distributions
-
Missing value analysis
-
Data type distributions
We also displayed the schema:
schema = tfdv.load_schema_text(os.path.join(schema_path, "schema.pbtxt"))
tfdv.display_schema(schema)
Schema gave us:
-
Inferred types
-
Domains (min/max)
-
Presence constraints
ExampleValidator anomalies were visualized as:
anomalies = tfdv.load_anomalies_text(os.path.join(anomalies_path, "SchemaDiff.pb"))
tfdv.display_anomalies(anomalies)
Result:
- Quickly spotted missing values, type mismatches, or distribution drifts
This milestone focuses on the training and evaluation phase of the MultimodalSearchML pipeline using real-world relevance labels from the Amazon ESCI dataset. It builds on Milestone 3, which handled data ingestion and preprocessing, and advances the pipeline with a robust training setup using TensorFlow Extended (TFX), MLflow, and CodeCarbon.
I used multi-class labels (Exact, Substitute, Complement, Irrelevant) from the ESCI dataset. This allowed for meaningful model training, accurate performance evaluation, and more realistic experimentation.
This milestone integrates several tools from the MLOps ecosystem to manage the training, tracking, and monitoring of the model lifecycle.
Tools | Role in the Project |
---|---|
TFX (TensorFlow Extended) | Powers the end-to-end training pipeline and handles modular orchestration |
MLflow | Tracks experiments, logs training metrics, and stores model versions |
CodeCarbon | Logs energy usage and carbon emissions of model training |
Git & GitHub | Maintains version control and collaborative development |
DVC | Tracks and manages dataset versions (e.g., ESCI data) |
Python 3.9 | Used to implement preprocessing, training logic, and orchestration code |
This project was initially scaffolded using a modified version of the Cookiecutter Data Science structure, then adapted to fit milestone-based developmen>
The core ideas from Cookiecutter — such as separation of data/
, notebooks/
, scripts/
, and models/
— are reflected across the milestones.
The dataset used in this milestone is the Amazon ESCI dataset, which provides real multi-class relevance labels for query-product pairs. The ESCI labels include:
-
E: Exact match
-
S: Substitute
-
C: Complement
-
I: Irrelevant
This allowed us to simulate a realistic product search environment, using actual labels rather than synthetic ones.
We combined this dataset with the query CLIP embeddings from the Shopping Queries Image Dataset (SQID) used in Milestone 3. The following files were used:
-
query_features_with_timestamp.csv (from Milestone 3): CLIP text embeddings for shopping queries
-
shopping_queries_dataset_examples.parquet: Query–product pairs with real esci_labels
-
The two datasets were merged on the common query_id field.
-
The esci_label column was mapped to integer labels as follows:
label_map = {'I': 0, 'C': 1, 'S': 2, 'E': 3}
- The resulting file (merged_queries.csv) was saved in the data/processed/ directory and used as input to the pipeline.
The Transform component in TFX was updated to:
-
Pass all 768 features (f0 to f767) unchanged
-
Accept the real label column directly without synthetic generation
-
Ensure compatibility with sparse_categorical_crossentropy for multi-class classification
This marks a significant upgrade over the earlier synthetic-label strategy, making training outcomes more interpretable and relevant to the underlying task.
We designed a multi-class classification model using TensorFlow Keras to predict the ESCI relevance label (E, S, C, I) based on the 768-dimensional CLIP text embeddings of shopping queries.
-
Input: 768-dimensional CLIP embedding vector (f0 to f767)
-
Architecture:
-
Dense layer with 256 units, ReLU activation
-
Dropout layer (0.3)
-
Dense layer with 128 units, ReLU activation
-
Output Layer: Dense layer with 4 units, Softmax activation
-
-
Loss Function: sparse_categorical_crossentropy (used with integer labels 0–3)
-
Optimizer: adam
-
Metrics: accuracy
- Batch size: 32
- Epochs: 50 (early stopping applied with patience = 5)
- Steps per epoch: 200 (controlled via TrainArgs)
- Validation split: Validation: Provided via eval_dataset in the TFX pipeline
- Checkpoints: Saved after every epoch via
ModelCheckpoint
callback
This architecture balances simplicity and scalability, offering a strong baseline while keeping the model TFX-friendly and resource-efficient.
MLflow was integrated into the Trainer component of the TFX pipeline to enable end-to-end experiment tracking for model development. The tracking setup captures all key aspects of the training run.
-
Training and Validation Metrics
- accuracy, val_accuracy, loss, val_loss
-
Hyperparameters
- batch_size, epochs, input_dim, optimizer
-
Model versions
- Saved under individual MLflow run IDs for comparison
Both manual logging (mlflow.log_param) and autologging (mlflow.tensorflow.autolog()) were used to ensure full visibility of model behavior.
- Experiment name:
milestone4_training
- Tracking UI hosted locally via:
mlflow ui --host 0.0.0.0
- Tracking Source: Model training was launched via LocalDagRunner inside the TFX pipeline.
Metric | Value |
---|---|
Accuracy | 0.6756 |
Val Accuracy | 0.7652 |
Loss | 0.6471 |
Val Loss | 0.5114 |
These metrics reflect training with real ESCI labels and indicate the model's ability to generalize relevance decisions across multiple product-query pairs.
To measure the environmental impact of the training process, the CodeCarbon library was integrated into the run_fn()
method of the TFX Trainer.
The EmissionsTracker
was configured to output logs into the pipeline’s artifact directory for each training run. This included detailed metrics such as energy consumed, CO₂ emissions, CPU power usage, and location-based carbon impact estimation.
Metric | Value |
---|---|
Duration | ~6 seconds |
Emissions | 0.000056 kg CO₂ |
CPU Power Used | 42.5 W |
RAM Power Used | 11.4 W |
CPU Energy Consumed | 0.00007 kWh |
Region | Casablanca, Morocco |
CPU Model | AMD Ryzen 7 7700 |
Tracking Mode | machine (local hardware) |
This demonstrates a growing awareness of sustainable ML practices and carbon-aware experimentation.
The output of the carboncode can be found in the following path
./tfx_pipeline/artifacts/training_pipeline/Trainer/model_run/6/emissions.csv
The goal of this milestone is to productionize the MultimodalSearchML system built in earlier milestones. The system allows users to search for relevant products based on natural language queries using CLIP-based multimodal embeddings. Rather than training a new model, the system leverages a pretrained CLIP model (ViT-B/32) to encode both product information and user queries, and serves predictions via a FastAPI backend and a React-based frontend.
The entire system is modular, automated, and integrates tools like TFX, Feast, and DVC to handle data processing, feature storage, and versioning.
The following architecture diagram shows the full machine learning system, including all tools and services involved in the pipeline and serving infrastructure:
-
CSV / Parquet Files: Raw query and product data in tabular format. These files contain embeddings, metadata, and product image links.
-
DVC (Data Version Control): Tracks versions of input data and metadata. Connected to Google Drive for remote storage.
-
Google Drive: Remote DVC storage used to store raw data files for reproducibility and sharing.
-
Redis: Serves as the online feature store for fast retrieval of product embeddings.
-
TFX (TensorFlow Extended): Used for building the end-to-end ML pipeline: ingestion, validation, transformation, trainer, and pusher. It outputs the final model to be deployed.
-
MLflow: Handles model tracking, experiment logging, and metric visualization. Integrated inside the TFX trainer.
-
GitHub: Hosts the source code and CI/CD pipeline.
-
GitHub Actions: Automates the deployment of the backend and frontend to Hugging Face and cPanel respectively.
-
FastAPI: Lightweight Python API framework used to serve predictions and expose endpoints.
-
CLIP (ViT-B/32): Pretrained model used for encoding queries and product metadata into embeddings.
-
Gemini LLM: Used for query rewriting to enhance semantic understanding before matching.
-
Docker: Containers are used to package the FastAPI service for deployment.
-
Hugging Face Spaces: Hosts the Dockerized FastAPI backend.
-
ReactJS: Frontend application that users interact with.
-
cPanel: Used to host the React frontend publicly.
Each arrow in the diagram illustrates data or control flow — from dataset versioning to real-time inference served on Hugging Face and visualized in the React app.
In this system, we implemented two serving modes:
🔹 1. On-Demand Serving to Humans (via UI)
A React.js frontend lets users submit:
Text queries
Image uploads
These are served to the backend for inference, and matching products are returned instantly.
🔹 2. On-Demand Serving to Machines (via API)
A FastAPI backend exposes two main endpoints:
POST /predict for text queries
POST /predict-image for image-based search
The backend is accessible as a public API hosted on Hugging Face Spaces.
Can be queried programmatically using tools like curl or axios.
🔧 Deployment Setup
The FastAPI backend is containerized using Docker.
The API is deployed on Hugging Face via Dockerfile and .huggingface.yml.
The frontend is deployed separately on a cloud server (cPanel).
The model service was designed to serve on-demand, low-latency predictions based on multimodal embeddings from the CLIP model. Instead of training a new model, we use the pretrained ViT-B/32 CLIP model to encode queries and products. This design ensures a consistent, production-ready service for image–text matching tasks.
-
Framework: The FastAPI framework was chosen for its speed, simplicity, and OpenAPI integration.
-
Endpoints:
-
POST /predict: Accepts a query string and returns the top-k matching products using cosine similarity over embeddings.
-
POST /predict-image: Accepts an image upload, embeds it via CLIP, and returns ranked product matches.
-
-
LLM Rewriter: The backend optionally integrates Gemini LLM to rewrite natural language queries before encoding to improve semantic matching.
-
Redis: Used for temporary embedding caching and fast retrieval operations.
-
DVC + Google Drive: Data access and versioning are managed through DVC, with remote storage pointing to Google Drive.
-
Model: ViT-B/32, loaded via the clip Python package.
-
Usage:
-
Both product descriptions and queries/images are encoded using CLIP.
-
Embeddings are precomputed and compared using cosine similarity.
-
-
The full backend is containerized using Docker and deployed on Hugging Face Spaces.
-
The Docker image includes:
-
All Python dependencies via requirements.lock
-
FastAPI app in app/main.py
-
CLIP model loading and pre/post-processing utils
-
-
The container is automatically built and redeployed via GitHub Actions CI/CD on every push.
The deployed MultimodalSearchML system leverages cloud infrastructure to serve predictions in real-time:
-
Hosting Platform: Hugging Face Spaces
-
Serving Framework: FastAPI with automatic OpenAPI docs
-
Containerization: Docker is used to isolate and run the backend service
-
The service is launched using a custom Dockerfile and requirements.txt.
-
The runtime environment pulls the latest code and models from GitHub on each deploy.
-
On every new push to main, GitHub Actions triggers automatic container rebuild and deployment.
-
Low-latency serving with Redis caching for embeddings
-
On-the-fly embedding using CLIP for both image and text
-
LLM-based query rewriting to improve semantic accuracy before matching
This setup enables both human (React UI) and machine (public API) clients to interact with the model in a scalable and production-ready runtime environment.
The front-end application provides an interactive user interface that allows users to perform multimodal search through both text and image inputs. It was built using ReactJS with modern best practices in component-based architecture and state management.
-
Text-based Search: Users can input natural language queries describing the product they are looking for. The input is processed and sent to the FastAPI backend for similarity-based retrieval using CLIP embeddings.
-
Image-based Search: Users can upload an image to find visually similar products. The image is embedded using the CLIP model and matched against product embeddings.
-
Result Display: Retrieved products are displayed as responsive cards containing:
Product title
Thumbnail image
Relevance score (cosine similarity)
“More Info” button for detailed view
-
Search Customization Controls: The UI includes toggles for setting:
top_k (number of results)
threshold (minimum similarity score)
Whether to allow text/image input independently
-
Navigation and Routing: Implemented using react-router-dom, allowing clean navigation between the search results page and the product detail page.
Technology | Purpose |
---|---|
ReactJS | Front-end framework |
Axios | API requests to FastAPI |
React Router DOM | Client-side routing |
Tailwind CSS | Styling and responsive layout |
React Icons | Icons for UI feedback |
State Hooks | Search input and result state management |
The frontend is hosted via cPanel, deployed manually through file upload and auto-refresh. The final UI is fully responsive and publicly accessible Here
The backend is hosted via Hugging Face spaces. It is accessible Here
- Screenshot Example
To ensure portability, reproducibility, and isolation of the backend inference service, the FastAPI server was packaged into a Docker container.
- Dockerfile
A custom Dockerfile was written to:
-
Set up a minimal Python environment using an official python:3.10-slim base image
-
Install all dependencies from a pinned requirements.txt
-
Copy the application codebase into the container
-
Set the working directory and expose the serving port
-
Launch the FastAPI server using uvicorn
Key Dockerfile commands:
FROM python:3.10-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app /app
WORKDIR /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]
- .dockerignore
Unnecessary files such as local datasets, .git, and cache directories are excluded from the container image build context.
- Testing Locally
The Docker image was tested using:
docker build -t multimodal-backend .
docker run -p 7860:7860 multimodal-backend
The API was successfully served on http://localhost:7860.
-
Ensures consistent runtime environment across development, testing, and deployment
-
Makes backend easily deployable to cloud platforms like Hugging Face or DigitalOcean
-
Reduces environment conflicts and simplifies dependency management
The MultimodalSearchML project supports continuous integration and deployment (CI/CD) to streamline the release of updates to both the backend and frontend services.
-
Platform: GitHub Actions
-
Target: Hugging Face Spaces
Every push to the main branch automatically triggers the backend deployment pipeline via a .github/workflows/deploy.yml file.
Key Steps in CI/CD Workflow:
-
Checkout repository
-
Set up Python environment
-
Install dependencies from requirements.txt
-
Build and push Docker container
-
Trigger Hugging Face deployment using .huggingface.yml
Hugging Face Integration:
# .huggingface.yml
title: MultimodalSearchML
sdk: docker
emoji: 🔍
This automation ensures that any committed changes to the main branch result in a refreshed, containerized FastAPI app being redeployed live on Hugging Face Spaces.
The React frontend was deployed manually via cPanel:
-
The build output from npm run build was uploaded to the public_html folder on the hosting server.
-
The interface is now publicly available via your domain (insert link if available).
-
Any frontend update is handled by rebuilding locally and replacing the previous build on the server.
-
Environment secrets should be defined in GitHub repository settings under Settings > Secrets and variables.
-
Hugging Face deployment uses a personal access token (HF_TOKEN) securely stored in the repo secrets.
This milestone focuses on evaluating, auditing, and analyzing the deployed MultimodalSearchML system beyond accuracy metrics. It covers unseen data evaluation, online A/B testing, bias auditing, robustness testing, and model explainability.
Evaluate the system using unseen product data split (80/20 train/test split).
- Used unseen data split from our dataset.
- Performed similarity search using ViT-B/32.
- Calculated:
- Top-5 Accuracy
- Mean Reciprocal Rank (MRR)
- Top-5 Accuracy: 14.77%
- Mean Reciprocal Rank (MRR): 14.56%
Model shows moderate accuracy on unseen data, with room for improvement.
Evaluate two models online using A/B Testing framework integrated in the API:
- Model A: ViT-B/32
- Model B: ViT-L/14
- Created a FastAPI endpoint
/search_best
that dynamically selects the best model per query using z-score normalized top-k similarity. - Integrated the endpoint with the deployed React frontend.
- User can see which model was used per query.
The system can dynamically select the best model per user query, based on online similarity evaluation.
- Used Aequitas toolkit.
- Sent predefined demographic-sensitive queries (e.g.,
"shoes for women"
,"ramadan gifts"
,"toys for kids"
). - Created proxy positives (
label_value=1
) and added dummy negatives (label_value=0
). - Analyzed representation and coverage per demographic.
- See:
bias_audit_aequitas_report.csv
- Observations:
- Underrepresentation observed for certain demographics.
- Junk queries still returned products.
- Need for input filtering and demographic-aware training.
bias_audit_aequitas.py
- Tested the system under perturbations:
- Typos, slang, spacing.
- Gibberish.
- Irrelevant queries.
- Multilingual input.
- See:
robustness_test_results.csv
- Observations:
- Model is robust to typos, spacing, and verbose queries.
- Fails to reject gibberish or junk inputs.
- Acceptable multilingual handling.
robustness_test.py
- Used TSNE projection of query and product title embeddings (text side only).
- Created cosine similarity heatmaps.
- Used CLIP ViT-L/14 text encoder.
Figure | Description |
---|---|
![]() |
Similarity heatmap between queries and products |
![]() |
TSNE projection of queries and products in the embedding space |
- Clear clustering between similar queries and products.
- Gibberish queries placed isolated, validating model behavior.
- The visualization supports model explainability for semantic search.
explainability_tsne_umap.py
- Integrated MLflow tracking inside the A/B testing API for query logging and model decision recording.
- Logged metrics:
- Query.
- Selected route.
- Latency.
- Top returned products.
- Run using:
mlflow ui
- Access at:
http://localhost:5000
-
Used MLflow as the main tool to monitor:
-
Query lengths.
-
Model selection (route).
-
Latency per request.
-
Top-k results returned.
-
-
Tracked every user query through the API directly integrated with MLflow.
-
Observed variations in latency depending on the model selected (ViT-L/14 being slower).
-
Query length monitoring showed input variations between short and verbose queries.
Simulate a Continual Learning Trigger and Delivery pipeline based on the logged queries.
-
Collected all queries into a structured CSV log (query_logs_for_retraining.csv).
-
Implemented automatic retraining trigger (continual_trigger_retrain.py) that:
-
Checks if the number of logged queries exceeds 100 queries threshold.
-
Simulates the retraining process by generating a marker file with timestamp (simulated_retrained_model_YYYYMMDD_HHMMSS.txt).
-
Clears the query log file for the next cycle.
-
-
Successfully simulated automatic retraining triggering and model version update.
-
Query logs are cleared after every simulated retraining to maintain clean data flow.
- continual_trigger_retrain.py
-
Queries logged in: query_logs_for_retraining.csv
-
Simulated retrained model files saved in: simulated_models/
-
The MultimodalSearchML system was extended to support advanced testing, evaluation, monitoring, and lifecycle management.
-
Successfully fulfilled all Milestone 6 requirements, including:
-
Unseen data evaluation.
-
Online A/B Testing.
-
Bias auditing.
-
Robustness testing.
-
Explainability and interpretability.
-
Query and model monitoring with MLflow.
-
Simulated CT/CD retraining trigger and model delivery.
-
-
The project is now ready for future enhancements, like integrating real retraining pipelines and advanced drift detection tools.