In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
data_dir = Path("../data").absolute()

In [3]:
df = pd.read_parquet(data_dir / "product_images.parquet")
df.sample(10)

Unnamed: 0,asin,title,primary_image
16091,B09H3NBDYF,"New Balance Men's 515 V3 Sneaker, Castlerock/H...",https://m.media-amazon.com/images/I/412H8-gxnL...
14949,B08P2C8VFW,New Balance Kid's Fresh Foam Arishi V2 Lace-Up...,https://m.media-amazon.com/images/I/41aJSBNIF8...
28000,B08LMS8KWZ,Under Armour womens Tech Spacedye Short Sleeve...,https://m.media-amazon.com/images/I/41lb9FZ9Mt...
23017,B00J6F9UA6,Premium Raybestos Element3 EHT™ Replacement Fr...,https://m.media-amazon.com/images/I/41+5AtenbP...
34151,B09H3NNZWW,New Balance Women's DynaSoft Nergize V3 Cross ...,https://m.media-amazon.com/images/I/41jbB2Zb6+...
59226,B09D77KNSP,Valpweet Womens Military Anorak Jacket Lightwe...,https://m.media-amazon.com/images/I/41CaUVfzgX...
69451,B0922C9DVG,Bala&Fillic Size 2mm 12/0 Seed Beads Jewelry M...,https://m.media-amazon.com/images/I/61nifDIH3x...
30059,B096NFB5R3,New Balance Men's Fresh Foam Roav V2 Running S...,https://m.media-amazon.com/images/I/41n0XG-2CF...
47228,B00HQ2323M,3M DBI-SALA 9505712 Suspension Trauma Safety S...,https://m.media-amazon.com/images/I/41oRXhBQPv...
68809,B08LVPYW3D,[ 3 Pack] UniqueMe Camera Lens Protector Compa...,https://m.media-amazon.com/images/I/51UT8hDO01...


In [6]:
df=df[df["primary_image"].str.endswith(".jpg")|df["primary_image"].str.endswith(".png")].rename(columns={"asin":"id"})
df.head()


Unnamed: 0,id,title,primary_image
0,B0B7NV73PJ,SanDisk 256GB Ultra microSDXC UHS-I Memory Car...,https://m.media-amazon.com/images/I/41MpKnSAd0...
1,B0B7NVXLLM,SanDisk 512GB Ultra microSDXC UHS-I Memory Car...,https://m.media-amazon.com/images/I/41ikVLl+gp...
2,B0B985Q9F1,SanDisk 256GB microSDXC -Card Licensed for Nin...,https://m.media-amazon.com/images/I/51X0sUzOQ7...
3,B071FVF3T7,Zinus 6 Inch Green Tea Memory Foam Mattress / ...,https://m.media-amazon.com/images/I/51t3X6VyK+...
4,B00Q7EPY04,Zinus 12 Inch Green Tea Memory Foam Mattress /...,https://m.media-amazon.com/images/I/51tBWhJnUF...


In [9]:
embedding_ids = list(df["id"].sample(10))
df["title"]=df["title"].fillna("")
df["has_emb"]=df["id"].isin(embedding_ids)
df=df[df["has_emb"]]
embedding_ids


['B08BND9PQL',
 'B093QMRXDB',
 'B0BGBZTPNH',
 'B082DZ56MT',
 'B07X3ZDW6J',
 'B09JNJ18S1',
 'B09DC8MN1Y',
 'B08LNQQMZ2',
 'B08P1ZJKZD',
 'B096NF7VDK']

# Task description
## The Data
The dataframe contains the top 100k best-selling items on Amazon (as of November 2022) has 3 columns

1. `asin` - The Amazon identifier.
1. `title` - The product title, as listed on the Amazon store.
1. `primary_image` - The image to be listed in search results.

## Goal
The goal of the task is be able to search products both by textual similarity, and by image similarity.

For example, a customer walking down the street could take a picture of a red dress she likes and get similar items from Amazon.

Altenatively, that same customer might open the Amazon website and search for "red dress" and find items that correspond to that query.

## Implementation

### Embedding
We will use [CLIP](https://github.com/openai/CLIP) embedding for this task.
<img src="https://openaiassets.blob.core.windows.net/$web/clip/draft/20210104b/overview-b.svg" width="400">

CLIP allows us to link images with their description and map them to the same embedding space.

### Similarity search

Once the embedding is done, we need to run a nearest-neighbor search using the `cosine` similarity measure.

The products that are closest to the query vector should (hopefully) be similar to the customer's intentions.

The query vector could be a result of either `CLIP` image embedding or `CLIP` textual embedding.

We will use the [vecsim](https://github.com/argmaxml/vecsim) module to do the similarity search.

### Serving

We used [Flask](https://flask.palletsprojects.com/en/2.2.x/) to implement the web-server, the code is at `server.py`.

**Note**: The server code cotains several `TODO:` comments, you will need to implement. The server is currently functional and it outputs random results.

# Submission


1. Please clone this repo to a **private** repo on your github account.
1. Implement the missing parts.
1. Please fill in [this form](https://forms.gle/apMr8zPLbBf9pQY7A).
1. Once done, please schdule an interview with Uri to review the code

## Submission deadline:
December 21st, 2022

## Good luck !
