# pHash - Notebook

In this notebook, we use the pHashes of the products contained in the csv-file. We define a distance function in order to measure the similarity of two images. At the end of the notebook, we build clusters optimized on precision. In the NLP notebook, these prediction will be combined with the results from the NLP method.

## Preliminaries




In [1]:
# import libraries
import numpy as np
import pandas as pd
import imagehash
import pickle

In [2]:
MY_PATH = './data'  

df_train_all = pd.read_csv(MY_PATH + '/shopee-product-matching/train.csv')
train_images = MY_PATH + '/shopee-product-matching/train_images' + '/' + df_train_all['image']
df_train_all['path'] = train_images

dic_label_group_posting_id = df_train_all.groupby('label_group').posting_id.agg('unique').to_dict()
df_train_all['target'] = df_train_all.label_group.map(dic_label_group_posting_id)

## Create feature vector


In [3]:
df_train_all['phash_hex_to_hash'] = df_train_all['image_phash'].apply(lambda x: imagehash.hex_to_hash(x))

## Create a distance function

In [4]:
def get_phash_distance(p_hash_vec_1,p_hash_vec_2):
    '''
    input:  p_hash_vec_1: one pHash vector
            p_hash_vec_2: one or more pHash vectors
    output: distance '''
    return p_hash_vec_2 - p_hash_vec_1

In [5]:
# example
get_phash_distance(df_train_all['phash_hex_to_hash'][0],df_train_all['phash_hex_to_hash'][0:4])

0     0
1    42
2    40
3    30
Name: phash_hex_to_hash, dtype: object

## Evaluation

Define some functions we'll need to do the evaluation.

In [6]:
def f_score_i(i, threshold, 
              feature_vec_all = df_train_all['phash_hex_to_hash'],              
              posting_id_ls   = df_train_all['posting_id'].to_list()):
    feature_vec_i = feature_vec_all[i]
    # predicted cluster 
    s_pred            =  set(pred_cluster_of_i(threshold, feature_vec_i, feature_vec_all, posting_id_ls ))
    # real cluster 
    s_real            =  set(df_train_all['target'][i])
    # intersection of real- and predicted cluster
    int_sec_pred_real = s_pred.intersection(s_real)
    return (2 * len(int_sec_pred_real)/(len(s_pred)+len(s_real))), s_pred, len(s_pred), s_real 


def recall_i(i, threshold,
              feature_vec_all = df_train_all['phash_hex_to_hash'],              
              posting_id_ls   = df_train_all['posting_id'].to_list()): 
    
    feature_vec_i = feature_vec_all[i]
    s_pred = set(pred_cluster_of_i(threshold, feature_vec_i, feature_vec_all, posting_id_ls ))
    s_real = set(df_train_all['target'][i])
    c = s_real.difference(s_pred)
    return (len(s_real) - len(c)) / len(s_real), s_pred ,len(s_pred), s_real

def precision_i(i, threshold,  
              feature_vec_all = df_train_all['phash_hex_to_hash'],              
              posting_id_ls   = df_train_all['posting_id'].to_list()):  
    
    feature_vec_i = feature_vec_all[i]
    s_pred = set(pred_cluster_of_i(threshold, feature_vec_i, feature_vec_all, posting_id_ls ))
    s_real = set(df_train_all['target'][i])
    return ((len(s_pred) - len(s_pred.difference(s_real))) / len(s_pred)), s_pred, len(s_pred), s_real

def pred_cluster_of_i(threshold, feature_vec_i,
                      feature_vec_all = df_train_all['phash_hex_to_hash'],                      
                      posting_id_ls   = df_train_all['posting_id'].to_list()):
    
    p_hash_i   = feature_vec_i
    p_hash_all = feature_vec_all
    diff_1     = get_phash_distance(p_hash_i,p_hash_all)
    list1      = list(diff_1)

        
    df_diff = pd.DataFrame(data = [list1,posting_id_ls]).transpose()    
    df_diff = df_diff[df_diff[0] <= threshold]
    
    ls = df_diff[1].tolist()
    return ls

### Prediction with threshold = 9

In [7]:
fsc_9 = []
prec_9 = []
for i in range(0,342):
    fsc_9.append(f_score_i(i*100,9)[0])
    prec_9.append(precision_i(i*100,9)[0])

print("Estimated precision:   ", sum(prec_9)/len(prec_9))
print("Estimated F1-Score:   ", sum(fsc_9)/len(fsc_9))

Estimated precision:    0.963087748833363
Estimated F1-Score:    0.5874131213621182


## Save results

Now we make a prediction optimized on precision and save the results with pickle. These will be used at the end of the NLP-Notebook.

In [8]:
already_done = True

if already_done == False:

    dict_prec_all_9 = {}
    for i in range(0,34250):
        dict_prec_all_9[i] = precision_i(i, threshold=9)[0:2]

        if i % 3000 == 0:
            # Save
            pickle.dump(dict_prec_all_9, open( "dict_prec_all_9.p", "wb" ) )
            print(i)

    # FinalSave
    pickle.dump(dict_prec_all_9, open( "dict_prec_all_9.p", "wb" ) )
    
    # Load
    dict_prec_all_9_load = pickle.load( open( "dict_prec_all_9.p", "rb" ) )
    
    # Change keys of dictionary
    list_post_id = df_train_all['posting_id'].tolist()

    dict_phash_prec_all_9 = {}
    for i in range(34250):
        dict_phash_prec_all_9[list_post_id[i]] = dict_prec_all_9_load[i][1]
    pickle.dump(dict_phash_prec_all_9, open( "dict_phash_prec_all_9.p", "wb" ) )

# Load  
dict_phash_prec_all_9_load = pickle.load( open( "dict_phash_prec_all_9.p", "rb" ) )