<a href="https://colab.research.google.com/gist/LanaMavy/9e975a7279bfef4ec6b0bbe9389984e1/shopee-generate-data-for-triplet-loss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generate triplets training data

_[Shopee - Price Match Guarantee](https://www.kaggle.com/c/shopee-product-matching)_

This notebook shows you how to generate triplets of training data by leveraging the `group_label` column for positive sampling, and also group-level negative sampling. The results of this notebook are CSV files with an anchor, positive, and negative column, where each value corresponds to either the product ID, image name, or product title.

In [2]:
import random

import pandas as pd
from tqdm.auto import tqdm

tqdm.pandas()

In [9]:
from google.colab import files
uploaded = files.upload()


Saving train.csv to train.csv


This helper function will let you generate an anchor, a positive sample from the same label group, and a negative sample from a different label group. The external function wraps around a certain dataframe, and the inner function should be applied to a row of that `df`.

In [3]:
def generate_triplets(df):
    # Source: https://www.kaggle.com/xhlulu/shopee-generate-data-for-triplet-loss
    random.seed(42)
    group2df = dict(list(df.groupby('label_group')))

    def aux(row):
        anchor = row.posting_id

        # We sample a positive data point from the same group, but
        # exclude the anchor itself
        ids = group2df[row.label_group].posting_id.tolist()
        ids.remove(row.posting_id)
        positive = random.choice(ids)

        # Now, this will sample a group from all possible groups, then sample
        # a product from that group
        groups = list(group2df.keys())
        groups.remove(row.label_group)
        neg_group = random.choice(groups)
        negative = random.choice(group2df[neg_group].posting_id.tolist())

        return anchor, positive, negative

    return aux

Load the training data and create some useful dictionaries for later:

In [18]:
import pandas as pd
train = pd.read_csv('train.csv')
print(train.head())  # Vérification


         posting_id                                 image       image_phash  \
0   train_129225211  0000a68812bc7e98c42888dfb1c07da0.jpg  94974f937d4c2433   
1  train_3386243561  00039780dfc94d01db8676fe789ecd05.jpg  af3f9460c2838f0f   
2  train_2288590299  000a190fdd715a2a36faed16e2c65df7.jpg  b94cb00ed3e50f78   
3  train_2406599165  00117e4fc239b1b641ff08340b429633.jpg  8514fc58eafea283   
4  train_3369186413  00136d1cf4edede0203f32f05f660588.jpg  a6f319f924ad708c   

                                               title  label_group  
0                          Paper Bag Victoria Secret    249114794  
1  Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...   2937985045  
2        Maling TTS Canned Pork Luncheon Meat 397 gr   2395904891  
3  Daster Batik Lengan pendek - Motif Acak / Camp...   4093212188  
4                  Nescafe \xc3\x89clair Latte 220ml   3648931069  


In [20]:
train = pd.read_csv('train.csv')

# Useful dictionaries; use below to convert if needed
id_to_img = train.set_index('posting_id').image.to_dict()
id_to_title = train.set_index('posting_id').title.to_dict()

Here, we use the `generate_triplets` helper function defined above and create a new dataframe from it:

In [24]:
train_triplets = train.progress_apply(generate_triplets(train), axis=1).tolist()
train_triplets_df = pd.DataFrame(train_triplets, columns=['anchor', 'positive', 'negative'])
train_triplets_df.head()

  0%|          | 0/34250 [00:00<?, ?it/s]

Unnamed: 0,anchor,positive,negative
0,train_129225211,train_2278313361,train_924467621
1,train_3386243561,train_3423213080,train_898433678
2,train_2288590299,train_3803689425,train_3197900532
3,train_2406599165,train_3342059966,train_2986920724
4,train_3369186413,train_921438619,train_3549128200


From the `train_triplets_df` you can create a triplet dataframe of titles:

In [21]:
train_triplets_titles = train_triplets_df.applymap(lambda x: id_to_title[x])
train_triplets_titles.head()

  train_triplets_titles = train_triplets_df.applymap(lambda x: id_to_title[x])


Unnamed: 0,anchor,positive,negative
0,Paper Bag Victoria Secret,PAPER BAG VICTORIA SECRET,"[GOSEND] Amplop Coklat Ukuran Folio WPS ( 23,5..."
1,"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...",Double Tape VHB 3M ORIGINAL 12mm x 4.5mm Busa ...,Kikkoman Bulgogi Sauce - jerigen 2.2 Kg
2,Maling TTS Canned Pork Luncheon Meat 397 gr,Maling Ham Pork Luncheon Meat TTS 397gr,KAOS ANAK BARONG KAOS BALI PIYAMA ANAK KAOS BA...
3,Daster Batik Lengan pendek - Motif Acak / Camp...,DASTER PIYAMA KATUN JEPANG(TIDAK BISA PILIH MO...,Hot Drinks Milk Coffee Frother Foamer Whisk
4,Nescafe \xc3\x89clair Latte 220ml,Nescafe Eclair Latte Pet 220 Ml,Nuvo Hand Gel @50ml/Antis/Dettol


The same works for images:

In [22]:
train_triplets_imgs = train_triplets_df.applymap(lambda x: id_to_img[x])
train_triplets_imgs.head()

  train_triplets_imgs = train_triplets_df.applymap(lambda x: id_to_img[x])


Unnamed: 0,anchor,positive,negative
0,0000a68812bc7e98c42888dfb1c07da0.jpg,f83b49a86a0ee8592e3bf0204da3fbdf.jpg,deb80330b925161fc59d78a13fcee3bd.jpg
1,00039780dfc94d01db8676fe789ecd05.jpg,8cbe4bf9706bc177fd61071ef776be8c.jpg,28cbb067e780fb784714f7e86a14dd40.jpg
2,000a190fdd715a2a36faed16e2c65df7.jpg,75dbd1e9f31f2d0f21d31c08b3e0b94e.jpg,2c6803502ce2b962a7b3591bf8fa652e.jpg
3,00117e4fc239b1b641ff08340b429633.jpg,52f5b2e6f6647325817eb99db17709f0.jpg,9b0cdfb0f02ddc1b3b755f87587e91fe.jpg
4,00136d1cf4edede0203f32f05f660588.jpg,8b0e60baf319fa282242ab1739df10e0.jpg,2c42eeb2a0c0df8ff3082c17e9b5e687.jpg



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



Let's save everything so you can easily use the output of this notebook. Alternatively, you can copy/paste the helper function as well and use it directly with the code above!

In [23]:
train_triplets_imgs.to_csv('train_triplets_imgs.csv', index=False)
train_triplets_titles.to_csv('train_triplets_titles.csv', index=False)
train_triplets_df.to_csv('train_triplets_ids.csv', index=False)