<div>
    <center><img src="https://pic2.zhimg.com/v2-303ba1a7c5eef0dd535c0f0f3e4a4f33_1440w.jpg?source=172ae18b"></center>
    </div>


<a id="intro"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>0. Introduction</center></h2>

Please, refer to:
 - [Overview Description](https://www.kaggle.com/c/shopee-product-matching/overview) 

 - [Overview Evaluation](https://www.kaggle.com/c/shopee-product-matching/overview/evaluation) 

 - [Data](https://www.kaggle.com/c/shopee-product-matching/data) 
 
 Notes:
 1.- [What is a perceptual hash?](https://www.phash.org/) A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar. 
 
 2.- [md5sum](https://en.wikipedia.org/wiki/Md5sum) is a computer program that calculates and verifies 128-bit MD5 hashes. It is used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change.
 
 3.- Submissions will be evaluated based on their mean [F1-score](https://en.wikipedia.org/wiki/F-score), is the harmonic mean of precision and recall:
<div>
   <left><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/4179c69cf1dde8418c4593177521847e862e7df8"></left>
   <div>
where:
       
       - tp = true positive
       - fp = false positive
       - fn = false negative

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>1. Contents</center></h2>

0. [Introduction](#intro)  
1. [Contents](#contents)
2. [Libraries](#libraries)  
3. [Datasets Exploration](#datasets-exploration)  
4. [Tabular Exploration](#tabular-exploration)  
5. [Image Title Exploration](#image-title-exploration) 
6. [Image Shapes Distribution Exploration](#image-shapes-exploration)
7. [Image PHASH Exploration](#image-phash)
8. [Baseline Prediction](#baseline-prediction)
9. [References](#references)  

<a id="import-libraries"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>2. Libraries</center></h2>

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
import plotly.express as px
import os
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import tqdm
from tqdm.auto import tqdm as tqdmp
tqdmp.pandas()

#Text Color
from termcolor import colored

#NLP
from sklearn.feature_extraction.text import CountVectorizer

#WordCloud
!pip install stylecloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

#Unigrams, bigrams, trigrams
from sklearn.feature_extraction.text import CountVectorizer as CV

#Text Processing
import re
import nltk
nltk.download('popular')

# Work with phash
import imagehash

import skimage.io as io
from PIL import Image

<a id="datasets-exploration"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>3. Datasets Exploration</center></h2>

### Datasets Directory

In [None]:
WORK_DIR = '../input/shopee-product-matching'
os.listdir(WORK_DIR)

### Datasets

In [None]:
train = pd.read_csv('../input/shopee-product-matching/train.csv')
test = pd.read_csv('../input/shopee-product-matching/test.csv')
ss = pd.read_csv('../input/shopee-product-matching/sample_submission.csv')

### Images Folder Paths

In [None]:
train_jpg_directory = '../input/shopee-product-matching/train_images'
test_jpg_directory = '../input/shopee-product-matching/test_images'

### Complete image paths for train and test datasets

In [None]:
train_images_path = WORK_DIR + "/train_images/" + train['image'] 
test_images_path = WORK_DIR + "/test_images/" + test['image'] 

### Dataset Shape

In [None]:
train.shape

As mentioned in the 'Data' section, this dataset is made of:
 - 34250 rows, and
 - 5 columns

### Datasets Heads

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.info()

We can conclude that there are no **NAN** values in the training dataset. 

### Dataset Size

In [None]:
print(f"Training Dataset Shape: {colored(train.shape, 'blue')}")
print(f"Test Dataset Shape: {colored(test.shape, 'green')}")

### Column Unique Values

In [None]:
for col in train.columns:
    print('{} unique values: {}'.format(col,colored(str(train[col].nunique()), 'blue')))

### Number of Images in Each Directory

In [None]:
print(f"Number of train images: {colored(len(train_images_path), 'blue')}")
print(f"Number of test images:  {colored(len(test_images_path), 'green')}")

<a id="image-title-exploration"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>5. Image Title Exploration</center></h2>

### Wordcloud

In [None]:
from IPython.core.display import display, HTML, Javascript
def nb():
    styles = open("../input/css-style/edit.css", "r").read()
    return HTML("<style>"+styles+"</style>")

In [None]:
import stylecloud

In [None]:
stylecloud.gen_stylecloud(text=' '.join(train['title']),
                          icon_name='fas fa-shopping-cart',
                          palette='colorbrewer.qualitative.Accent_8',
                          background_color='black',
                          gradient='horizontal',
                          size=1024)

from IPython.display import Image
Image(filename="./stylecloud.png", width=604, height=604)

'titles' are made of words in English, but also other languages (i.e Indonesian, Malay, or Latvian).

### Unigrams, bigrams and trigrams baseline.

In [None]:
def get_top_n_words(corpus, n=None):
    vec = CV().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

def get_top_n_bigram(corpus, n=None):
    vec = CV(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]


def get_top_n_trigram(corpus, n=None):
    vec = CV(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
def plot_bt(x,w,p):
    common_words = x(train['title'], 10)
    common_words_df = DataFrame (common_words,columns=['word','freq'])

    plt.figure(figsize=(14, 6))
    sns.barplot(x='freq', y='word', data=common_words_df,palette=p)
    plt.title("Top 10 "+ w , fontsize=14)
    plt.xlabel("Frequency", fontsize=11)
    plt.yticks(fontsize=11)
    plt.xticks(rotation=45, fontsize=11)
    plt.ylabel("");
    return common_words_df

In [None]:
from sklearn.feature_extraction.text import CountVectorizer as CV

In [None]:
from pandas import DataFrame

In [None]:
common_words = get_top_n_words(train['title'], 10)
common_words_df1 = DataFrame(common_words,columns=['word','freq'])
plt.figure(figsize=(14, 6))
ax = sns.barplot(x='freq', y='word', data=common_words_df1,palette='Blues')

plt.title("Top 10 unigrams", fontsize=14)
plt.xlabel("Frequency", fontsize=11)
plt.yticks(fontsize=11)
plt.xticks(rotation=45, fontsize=11)
plt.ylabel("");

common_words_df2 = plot_bt(get_top_n_bigram,"bigrams",'BuGn')
common_words_df3 = plot_bt(get_top_n_trigram,"trigrams",'YlGnBu')

As per above, **most common words and phrases** are not in English, but Indonesian.

### Basic NLP 

In [None]:
train.head()

In [None]:
def preprocess_text(text, flg_stemm=False, flg_lemm=True):

    lst_stopwords = nltk.corpus.stopwords.words("english")
    
    ## clean (convert to lowercase and remove punctuations and characters and then strip)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
            
    ## Tokenize (convert from string to list)
    lst_text = text.split()
    ## remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in 
                    lst_stopwords]
                
    ## Stemming (to remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    ## Lemmatisation (to convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()    
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## back to string from list
    text = " ".join(lst_text)
    return text

In [None]:
#Clean Address
train["clean_title"] = train["title"].apply(lambda x: preprocess_text(x, flg_stemm=False, flg_lemm=True, ))

In [None]:
#Length of Title
train['clean_title_len'] = train['clean_title'].apply(lambda x: len(x))

#Word Count
train['clean_title_word_count'] =train["clean_title"].apply(lambda x: len(str(x).split(" ")))

#Character Count
train['clean_title_char_count'] = train["clean_title"].apply(lambda x: sum(len(word) for word in str(x).split(" ")))

#Average Word Length
train['clean_title_avg_word_length'] = train['clean_title_char_count'] / train['clean_title_word_count']

In [None]:
train.head()

### Distribution Plots

In [None]:
def plot_distribution(x, title):

    fig = px.histogram(
    train, 
    x = x,
    width = 800,
    height = 500,
    title = title
    )
    
    fig.show()

In [None]:
#Distribution of titles converted to lowercase, without punctuations and characters and stripped.
plot_distribution(x = 'clean_title_len', title = 'Title Length Distribution')

Most commont title lenght is 33 (clean) words.

In [None]:
#Distribution of titles word count
plot_distribution(x = 'clean_title_word_count', title = 'Word Count Distribution')

In [None]:
#Distribution of titles characters count
plot_distribution(x = 'clean_title_char_count', title = 'Character Count Distribution')

In [None]:
#Distribution of titles average word lenght count
plot_distribution(x = 'clean_title_avg_word_length', title = 'Average Word Length Distribution')

Most commont average word lenght is almost 5, as per above.

No further analysis shall be performed under this line to the moment. Most of this analysis is performed due to academic purposes. However, further analysis of title exploration without relationship to image exploration might redeem no positive result to this competition goal.

<a id="image-shapes-exploration"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>6. Image Shapes Distribution Exploration</center></h2>

In [None]:
#Datasets
train = pd.read_csv('../input/shopee-product-matching/train.csv')
test = pd.read_csv('../input/shopee-product-matching/test.csv')
ss = pd.read_csv('../input/shopee-product-matching/sample_submission.csv')

In [None]:
#Addition of column 'path' to both datasets 

train_images = WORK_DIR + "/train_images/" + train['image']
train['path'] = train_images

test_images = WORK_DIR + "/test_images/" + test['image']
test['path'] = test_images

In [None]:
# Shape of the last column created by loading images from its files
train['img_shape'] = train['path'].progress_apply(lambda x: np.shape(io.imread(x)))

In [None]:
train['img_shape'].describe()

In [None]:
# Plot of images width and height
shapes = pd.DataFrame().from_records(train['img_shape'])
shapes.columns = ['Width', 'Height', 'Colors']

sns.set_style("white")
sns.jointplot(x = shapes.iloc[:, 0].astype('float32'), 
              y = shapes.iloc[:, 1].astype('float32'),
              height = 6, color = '#222353')
plt.show()

As per above, not all images have the same size.

604 x 604 shows the most frequent size.

<a id="image-phash"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>7. Image PHASH Exploration</center></h2>

As mentioned above, items with the same image_phash are potentially duplicates. Using the [imagehash library](https://pypi.org/project/ImageHash/) images can be loaded for comparison. 
- Perpetual hashing acts as the image fingerprint which is generated by analyzing the content of the mathematically.
- Its a 64-bits representation.
- It is also widely used for use-cases of copyright-infringement.

Perceptual hashing converts an image, by degrading it and turning it into "pixels", into a binary (or hexadecimal) sequence. Unlike cryptographic hashing, perceptual hashing lacks of avalanche effect, making any change in the image easily perceivable in the hash.

<div>
    <center><img src="https://miro.medium.com/max/460/0*zfY4Co3OIXnuJ-96."></center>
    </div>

'image-phash' exploration for amount of duplicates in the train dataset.

In [None]:
phash_count = train.groupby(['image_phash']).size().reset_index()
phash_count.columns = ['image_phash', 'amount']
phash_count.sort_values(by='amount', ascending=False, inplace=True)
phash_count

In [None]:
fig = px.histogram(
    phash_count, 
    x = phash_count['amount'],
    width = 800,
    height = 500,
    title = 'Phash amount distribution'
    )
    
fig.show()

As shown above, there are more than two thousand images with identical pash.

Phash algorithm breaks images into fragments, then analyzes the image structure on luminance (without color information) and assigns True or False depending on the value (above or below the mean). In order to analyze the similarity, it is necessary to subtract one phash matrix from another. Similar fragments will receive a null value (True - True = 0, False - False = 0). The closer the sum of all differences is to zero, the more similar the images are.

For instance, the phash matrix of the first image looks like this:

In [None]:
imagehash.hex_to_hash(train['image_phash'][0])

In [None]:
#Notice image shape is 8x8
len(imagehash.hex_to_hash(train['image_phash'][0]))

Function to check for matches by phash value.

In [None]:
def match_matrix(phash_array):
    """
    A function that checks for matches by phash value.
    Input - takes phash values as input.
    Output - phash diff matrix (pandas data frame)
    """
    phashs = phash_array.apply(lambda x: imagehash.hex_to_hash(x))
    phash_matrix = pd.DataFrame()
    pbar = tqdm.tqdm(total = len(phash_array), desc = 'Progress:', 
                     position = 0, leave = True)
    for idx, i in enumerate(phash_array):
        pbar.update(1)
        phash_matrix = pd.concat([phash_matrix, phashs - imagehash.hex_to_hash(i)], 
                                 axis = 1)
    pbar.close()
    phash_matrix.columns = range(len(phash_array))
    return phash_matrix

In [None]:
#Only the frist thousand train images will be analysed since the process of building up the matrix is quite resource-intensive.
train_part = train.iloc[:1000, :]
matches = match_matrix(train_part['image_phash'])
matches

In [None]:
#All test images can be taken
test_match = match_matrix(test['image_phash'][:3])
test_match

As displayed above, test images are not alike.

In [None]:
#Checking for matches among train images
match = []
for i in range(len(matches)):
    match.append(matches.iloc[i, :][(matches.iloc[i, :] == 0)].index.values)
match = pd.Series(match)

match[match.apply(lambda x: len(x) > 1)]

In [None]:
match[match.apply(lambda x: len(x) > 1)].head(20)

Among those first thousand, 90 matches were found. 

Plotting some matches:

In [None]:
def image_viz(image_path):
    """
    Function for visualization.
    Takes path to image as input.
    """
    img = cv2.imread(image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)    
    plt.imshow(img)
    plt.axis('off')

In [None]:
train_part.loc[[98,99],['posting_id','image','image_phash','title','label_group']]

In [None]:
example1 = train_part.loc[[98,99],['posting_id','image','image_phash','title','label_group']]
for col in example1.columns:
    print('{} unique values: {}'.format(col,colored(str(example1[col].nunique()), 'blue')))

Notice that **'image'** and **'image_phash'** are duplicates. The same with **'label_group'** but in this case as expected by field definition.

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[98, 'path'], 
                         train_part.loc[99, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

In [None]:
#Another match group
train_part.loc[[58,59,482],['posting_id','image','image_phash','title','label_group']]

In [None]:
example2 = train_part.loc[[58,59,482],['posting_id','image','image_phash','title','label_group']]
for col in example2.columns:
    print('{} unique values: {}'.format(col,colored(str(example2[col].nunique()), 'blue')))

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[58, 'path'], 
                         train_part.loc[59, 'path'], 
                         train_part.loc[482, 'path']]):
    plt.subplot(1, 3, idx + 1)
    image_viz(i)
plt.show()

In [None]:
#Last example
train_part.loc[[104,105,106,107],['posting_id','image','image_phash','title','label_group']]

In [None]:
example3 = train_part.loc[[104,105,106,107],['posting_id','image','image_phash','title','label_group']]
for col in example1.columns:
    print('{} unique values: {}'.format(col,colored(str(example3[col].nunique()), 'blue')))

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[104, 'path'], 
                         train_part.loc[105, 'path'],
                         train_part.loc[106, 'path'],
                         train_part.loc[107, 'path']]):
    plt.subplot(1, 4, idx + 1)
    image_viz(i)
plt.show()

Hence phash analysis allows to find matches.

In addittion to above, phash analysis also allows to find **not exactly** matches though very similar ones:

In [None]:
#Using the previous matrix made of the first thousand train images
match = []
for i in range(len(matches)):
    match.append(matches.iloc[i, :][(matches.iloc[i, :] > 0) & 
                                    (matches.iloc[i, :] <= 5)].index.values)
match = pd.Series(match)

match[match.apply(lambda x: len(x) >= 1)]

In [None]:
len(match[match.apply(lambda x: len(x) >= 1)])

Notice that 5 almost matches are found only in the **first thousand train images**.

In [None]:
#First almost match
plt.figure(figsize = (10, 5))
for idx, i in enumerate([train_part.loc[55, 'path'], 
                         train_part.loc[312, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

In [None]:
#Another example
plt.figure(figsize = (10, 5))
for idx, i in enumerate([train_part.loc[469, 'path'], 
                         train_part.loc[194, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

This phash analysis allows to conclude that **'image_phash' duplicates are the same but also there are other not image duplicates where the product matches.**

<a id="baseline-prediction"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>8. Baseline Prediction</center></h2>

In [None]:
# Work functions
def phash_match(phash_array, element):
    """
    A function that calculates phash diffs.
    Input - phashs array and element as input.
    Output - phash diff
    """
    phash_diff = phash_array - phash_array[element]
    return phash_diff

def add_match(phash, i, dataset = train, threshold = 5):
    """
    A function that returns match names.
    Input - phash array, i element, applicable dataset and threshold (default = 5).
    Output - match names.
    """
    diffs = phash_match(phash, i)
    matches = [x for x in diffs[diffs <= threshold].index.drop(i).values]
    str_matches = ''
    str_matches = str_matches + dataset.iloc[i, 0] + ' '
    for j in matches:
        str_matches = str_matches + dataset.iloc[j, 0] + ' '
    str_matches = str_matches[:-1]
    return str_matches

In [None]:
phashs = train['image_phash'][:1000].apply(lambda x: imagehash.hex_to_hash(x))
str_matches = []

for i in tqdm.tqdm(range(len(phashs)), desc = 'Progress:', position = 0, leave = True):
    str_matches.append(add_match(phashs, i))

str_matches[:15]

### Test images

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate(test['path']):
    plt.subplot(1, 3, idx + 1)
    image_viz(i)
plt.show()

In [None]:
test

In [None]:
test_phashs = test['image_phash'].apply(lambda x: imagehash.hex_to_hash(x))
test_matches = []

for i in tqdm.tqdm(range(len(test_phashs)), desc = 'Progress:', 
                   position = 0, leave = True):
    test_matches.append(add_match(test_phashs, i, test, threshold = 7))

test_matches

In [None]:
ss['matches'] = test_matches
ss.to_csv("submission.csv")
ss

This analysis presents speed disadvantage. For large test data, it cannot be used, and therefore it is restricted to the usual finding of all identical PHASH codes for each image.

<a id="references"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>9. References</center></h2>

This analysis is developed under academic purposes and out of competition scoring.

Main references:
 - [[V7]Shopee InDepth EDA:One stop for all your needs](https://www.kaggle.com/ishandutta/v7-shopee-indepth-eda-one-stop-for-all-your-needs)
 - [Shopee: Before we start (EDA, PHASH, Baseline)](Shopee: Before we start (EDA, PHASH, Baseline))
 - [🛍️ Shopee: EDA + RAPIDS preprocessing + W&B](https://www.kaggle.com/ruchi798/shopee-eda-rapids-preprocessing-w-b)
 - [EDA WordCloud Indo->Eng Insights](https://www.kaggle.com/abhishekvermasg1/eda-wordcloud-indo-eng-insights)
 
 
 

