<h1 style='color:white; background:orange; border:0'>CNN Semantic Clustering - by Kay Delventhal</center></h1>

### Capstone Project: Product Clustering

Coauthors: Elias Büchner, Kay Delventhal, Niels-Christian Leight and Phillip McRae

<h3 style='color:white; background:orange; border:0'>Classify images and collect labels to generate semantic features.</center></h3>

This Notebook is based on Python 3.6 - for the benefit of GPU processing.

### The Idea:

As an alternative to technical based algorithms like pHash or ColorMatch, which use number-based features, semantic features would allow to cluster images not just based on similar images - but by embedded semantic meaning.

A number of pre-trained CNNs were used, to classifiy all 32412 Shoppee images. All CNNs used were pre-traind on the "ImageNet" data-set which is based on 1000 classes. Therefore they all use the same classes for prediction.

The goal is **not** to obtain a single valid classification, but rather to obtain an **array of labels** with rated probabilities. This **array of labels** generated by different CNNs, gives us the possibility to count appearances of classes. 

This counting is done based on different metrics:
- N bests of all CNNs
- N better then 'threshold'
- Misc

After a selection is made, a feature generating process is used to create new semantic based features per image. These new semantic based features can be used with a NLP algorithm.

### Note:

For this notebook the following data is needed: https://www.kaggle.com/c/shopee-product-matching/overview/evaluation
- shopee-product-matching/train_images
- shopee-product-matching/test_images
- train.csv
- test.csv
- sample_submission.csv

For this notebook the following python files are needed:
- SemanticClustering.py
- CSData.py
- CSScoring.py
- CSTools.py

### Python Setup

In [1]:
# basic python modul import
import os
import pickle
from glob import glob
from time import time

import importlib # to reload modules
from tqdm import tqdm

import numpy as np
import pandas as pd

In [3]:
# basic CNN tools modul import
import PIL
from sklearn.model_selection import train_test_split

from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array

Using TensorFlow backend.


In [5]:
# modul import for for CNN grid-search
import keras.applications.xception as xcptn
import keras.applications.vgg16 as vgg16
import keras.applications.vgg19 as vgg19
import keras.applications.densenet as denet
import keras.applications.resnet50 as rsnt50
import keras.applications.resnet as resnet
import keras.applications.nasnet as nasnet
import keras.applications.mobilenet_v2 as mblntv2
import keras.applications.inception_v3 as incptv3
import keras.applications.inception_resnet_v2 as incntv2
#import keras.applications.efficientnet as efcntnt

In [6]:
# used for NLP setup
import random
np.random.seed(2018)

import nltk
from tensorflow.keras.preprocessing import text

#from sklearn.decomposition import PCA
#from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from gensim.models import Word2Vec
import nltk
#nltk.download('wordnet')
#[nltk_data] Downloading package wordnet to
#[nltk_data]     C:\Users\vfx\AppData\Roaming\nltk_data...
#[nltk_data]   Unzipping corpora\wordnet.zip.
stemmer = SnowballStemmer('english')

In [6]:
# tools to solve GPU issues
import gc
gc.collect()
from keras import backend as K
K.clear_session()

### Project Tools and Environment Setup

In [3]:
# python module import - cgn-data-21-1 Capstone Project tools
import os
import importlib # to reload modules

import CSTools as ct
importlib.reload(ct)

import CSData as cd
importlib.reload(cd)

import CSScoring as sc
importlib.reload(sc)

import SemanticClustering as sem
importlib.reload(sem)

# base path setup
PATH = os.getcwd().replace('\\','/')+'/'
print(PATH)
DATA = PATH+'CnnGridSearch'+'/'
if not os.path.isdir(DATA):
    os.makedirs(DATA)
print(PATH)
print(DATA)

D:/_DataScience/CapstoneProject/
D:/_DataScience/CapstoneProject/
D:/_DataScience/CapstoneProject/CnnGridSearch/


## Define CNN Grid-Search Parameter

Pre-trained **SciKit-learn** models are used.

In [23]:
# define global vars for image and label data
SHOPEEIMG_TRAIN = PATH+'shopee-product-matching/train_images'
SHOPEEDATA_TRAIN = PATH+'shopee-product-matching/train.csv'

# define CNN grid-search dict()
# to be used with classify_all()
models = {
    'Xception':    {'CNN': xcptn.Xception, 'PREP': xcptn.preprocess_input, 'DECO': xcptn.decode_predictions, 'SIZE': (299, 299)},
    'VGG16':       {'CNN': vgg16.VGG16, 'PREP': vgg16.preprocess_input, 'DECO': vgg16.decode_predictions, 'SIZE': (224, 224)},
    'VGG19':       {'CNN': vgg19.VGG19, 'PREP': vgg19.preprocess_input, 'DECO': vgg19.decode_predictions, 'SIZE': (224, 224)},
    'DenseNet201': {'CNN': denet.DenseNet201, 'PREP': denet.preprocess_input, 'DECO': denet.decode_predictions, 'SIZE': (224, 224)},
    'NASNetLarge': {'CNN': nasnet.NASNetLarge, 'PREP': nasnet.preprocess_input, 'DECO': nasnet.decode_predictions, 'SIZE': (331, 331)},
    'MobileNetV2': {'CNN': mblntv2.MobileNetV2, 'PREP': mblntv2.preprocess_input, 'DECO': mblntv2.decode_predictions, 'SIZE': (224, 224)},
    'ResNet50':    {'CNN': rsnt50.ResNet50, 'PREP': rsnt50.preprocess_input, 'DECO': rsnt50.decode_predictions, 'SIZE': (224, 224)},
    'ResNet152':   {'CNN': resnet.ResNet152, 'PREP': resnet.preprocess_input, 'DECO': resnet.decode_predictions, 'SIZE': (224, 224)},
    'InceptionV3': {'CNN': incptv3.InceptionV3, 'PREP': incptv3.preprocess_input, 'DECO': incptv3.decode_predictions, 'SIZE': (299, 299)},
    'IncResNetV2': {'CNN': incntv2.InceptionResNetV2, 'PREP': incntv2.preprocess_input, 'DECO': incntv2.decode_predictions, 'SIZE': (299, 299)}#,
    #'EfficNetB7':  {'CNN': efcntnt.EfficientNetB7, 'PREP': efcntnt.preprocess_input, 'DECO': efcntnt.decode_predictions} # not included in python 3.6 keras version
}

### Find Images for CNN Grid-Search

In [12]:
%%time
# gather images for classify grid-sreach with classify_all()

images = [x.replace('\\','/') for x in glob(SHOPEEIMG_TRAIN+'/*.jpg')]
print('images:', len(images))

use = models.keys()
print('cnn #:', len(use))
print(use)

images: 32412
cnn #: 10
dict_keys(['Xception', 'VGG16', 'VGG19', 'DenseNet201', 'NASNetLarge', 'MobileNetV2', 'ResNet50', 'ResNet152', 'InceptionV3', 'IncResNetV2'])
Wall time: 146 ms


### CNN Gid-Search (Execution)

In [None]:
%%time

# classify grid-search

print('images:', len(images))
collect = sem.classify_all(images,models,top=10)

### Save/Load CNN Grid-search Data

In [14]:
#ct.write_dict('collect',collect,DATA)
collect = ct.read_dict('collect',DATA)
print(len(collect))

read: D:/_DataScience/CapstoneProject/CnnGridSearch/collect.pickle
32412


## Process CNN Gird-Search Data

### Define Functions to Retrive Data from Raw Collection

### Execute Data Processing

In [17]:
%%time

# use apply_threshold() to gather classifications with probability over 'threshold' per CNN

processing = dict()
joind, found, count = sem.apply_threshold(collect, threshold = 0.1, show=0)
processing['apply'] = joind, found, count

100%|██████████████████████████████████████████████████████████████████████████| 32412/32412 [00:09<00:00, 3340.14it/s]

Wall time: 9.7 s





In [20]:
%%time

# use find_predictions() to gather best three classifications guesses per CNN

joind, found, count = sem.find_predictions(collect, get=3, show=0)
processing['find'] = joind, found, count

100%|█████████████████████████████████████████████████████████████████████████| 32412/32412 [00:02<00:00, 11678.78it/s]


Wall time: 2.94 s


### Save/Load Processed Data

In [21]:
ct.write_dict('processing',processing,DATA)
processing = ct.read_dict('processing',DATA)
print(len(processing))

write: D:/_DataScience/CapstoneProject/CnnGridSearch/processing.pickle
read: D:/_DataScience/CapstoneProject/CnnGridSearch/processing.pickle
2


##  Generate New Features from CNN Grid-Search Data

Find label frequency data per image across all CNNs.

In [30]:
%%time

newtags = dict()
#joind, found, count = processing['apply']

joind, found, count = sem.find_predictions(collect, get=5, show=0)
all_tags, file_tags, tags_list, count_num, count_cnn = sem.tag_grouping(joind, booster=False, show=2)

newtags['find'] = all_tags, file_tags, tags_list, count_num, count_cnn

print()
print('collect  ', len(collect))
print('joind    ', len(joind))
print('file_tags', len(file_tags))
print()
for cnn in count_cnn:
    print(count_cnn[cnn], cnn)
print()
for num in count_num:
    print(num, count_num[num])
print()     
print('0:',count_num[0])
print('2-5 :',sum([count_num[x] for x in range(2,5+1)]))
print('5-10:',sum([count_num[x] for x in range(5,len(models.keys())+1)]))
print('2-10:',sum([count_num[x] for x in range(2,len(models.keys())+1)]))

joind, found, count = sem.apply_threshold(collect, threshold=0.1, show=0)
all_tags, file_tags, tags_list, count_num, count_cnn = sem.tag_grouping(joind, booster=False, show=2)

newtags['apply'] = all_tags, file_tags, tags_list, count_num, count_cnn
print()
print('collect  ', len(collect))
print('joind    ', len(joind))
print('file_tags', len(file_tags))
print()
for cnn in count_cnn:
    print(count_cnn[cnn], cnn)
print()
for num in count_num:
    print(num, count_num[num])
print()     
print('0:',count_num[0])
print('2-5 :',sum([count_num[x] for x in range(2,5+1)]))
print('5-10:',sum([count_num[x] for x in range(5,len(models.keys())+1)]))
print('2-10:',sum([count_num[x] for x in range(2,len(models.keys())+1)]))

ct.write_dict('newtags',newtags,DATA)

100%|█████████████████████████████████████████████████████████████████████████| 32412/32412 [00:02<00:00, 11960.06it/s]



0000a68812bc7e98c42888dfb1c07da0.jpg 2
                                     LABEL ['confectionery*pillow*sock*wooden_spoon*wool', 'book_jacket*brassiere*comic_book*pillow*wool', 'brassiere*head_cabbage*shower_cap*wig*wool', 'brassiere*face_powder*handkerchief*rubber_eraser*wool', 'artichoke*confectionery*perfume*plastic_bag*wool', 'comic_book*jersey*maillot*pajama*pillow', 'book_jacket*comic_book*pajama*sock*sweatshirt', 'book_jacket*comic_book*pillow*sock*wool', 'bakery*comic_book*cowboy_hat*wooden_spoon*wool', 'artichoke*bakery*book_jacket*comic_book*wool']
                                     TAGCK ['confectionery', 'pillow', 'sock', 'wooden_spoon', 'wool', 'book_jacket', 'brassiere', 'comic_book', 'pillow', 'wool', 'brassiere', 'head_cabbage', 'shower_cap', 'wig', 'wool', 'brassiere', 'face_powder', 'handkerchief', 'rubber_eraser', 'wool', 'artichoke', 'confectionery', 'perfume', 'plastic_bag', 'wool', 'comic_book', 'jersey', 'maillot', 'pajama', 'pillow', 'book_jacket', 'comic_bo

  2%|█▋                                                                          | 733/32412 [00:00<00:08, 3643.26it/s]


collect   32412
joind     32412
file_tags 32412

143939 Xception
138194 VGG16
138894 VGG19
140296 DenseNet201
138125 NASNetLarge
129329 MobileNetV2
138902 ResNet50
139203 ResNet152
135120 InceptionV3
142924 IncResNetV2

0 0
1 0
2 0
3 4
4 156
5 605
6 1380
7 2554
8 3928
9 5594
10 18191

0: 0
2-5 : 765
5-10: 32252
2-10: 32412


100%|██████████████████████████████████████████████████████████████████████████| 32412/32412 [00:10<00:00, 3216.00it/s]



0000a68812bc7e98c42888dfb1c07da0.jpg 2
                                     LABEL ['pillow', 'book_jacket*pillow*wool', 'brassiere*head_cabbage*wool', 'face_powder', 'wool', 'jersey*pajama', 'comic_book*pajama*sock', 'book_jacket*comic_book', 'cowboy_hat*wool', 'artichoke']
                                     TAGCK ['pillow', 'book_jacket', 'pillow', 'wool', 'brassiere', 'head_cabbage', 'wool', 'face_powder', 'wool', 'jersey', 'pajama', 'comic_book', 'pajama', 'sock', 'book_jacket', 'comic_book', 'cowboy_hat', 'wool', 'artichoke']
                                     FOUND 4 wool ['VGG16', 'VGG19', 'NASNetLarge', 'InceptionV3']
                                     FOUND 2 pillow ['Xception', 'VGG16']
                                     FOUND 2 book_jacket ['VGG16', 'ResNet152']
                                     FOUND 2 pajama ['MobileNetV2', 'ResNet50']
                                     FOUND 2 comic_book ['ResNet50', 'ResNet152']
                                     REST 4 12

### Save/Load New-Tags Data

In [50]:
print(newtags.keys())
print(newtags['apply'][3].keys())


dict_keys(['find', 'apply'])
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])


### Print New-Tags aka CNN Grid-Sreach Features

In [31]:
show = 10
for file in file_tags:
    print(file,file_tags[file])
    show -= 1
    if show == 0:
        break

0000a68812bc7e98c42888dfb1c07da0.jpg (4, 12, ['book_jacket', 'comic_book', 'pajama', 'pillow', 'wool'])
00039780dfc94d01db8676fe789ecd05.jpg (9, 18, ['brass', 'face_powder', 'puck'])
000a190fdd715a2a36faed16e2c65df7.jpg (8, 16, ['face_powder', 'magnetic_compass', 'measuring_cup'])
00117e4fc239b1b641ff08340b429633.jpg (9, 16, ['gown', 'pajama', 'sarong'])
00136d1cf4edede0203f32f05f660588.jpg (7, 19, ['digital_watch', 'hair_spray', 'lotion', 'water_bottle'])
0013e7355ffc5ff8fb1ccad3e42d92fe.jpg (4, 11, ['jean', 'miniskirt', 'poncho', 'stole'])
00144a49c56599d45354a1c28104c039.jpg (5, 15, ['academic_gown', 'bearskin', 'drum', 'vestment'])
0014f61389cbaa687a58e38a97b6383d.jpg (9, 22, ['cloak', 'gown', 'hoopskirt', 'overskirt'])
0019a3c6755a194cb2e2c12bfc63972e.jpg (6, 15, ['bib', 'hair_slide', 'rubber_eraser', 'safety_pin'])
001be52b2beec40ddc1d2d7fc7a68f08.jpg (9, 14, ['clog', 'loafer', 'sandal'])


## Cross Analyse New Grid-Search Feature with Shopee Features

In [19]:
%%time
classes_thr, classes_god, classes_all = sem.nlp_tag_grouping(joind, booster=False, show=5, threshold=3)


0000a68812bc7e98c42888dfb1c07da0.jpg 5 train_129225211
                                     FOUND 4 wool
                                     FOUND 2 pillow
                                     FOUND 2 book_jacket
                                     FOUND 2 pajama
                                     FOUND 2 comic_book
                                     REST 12 ['brassiere', 'head_cabbage', 'face_powder', 'jersey', 'sock', 'cowboy_hat', 'artichoke']

00039780dfc94d01db8676fe789ecd05.jpg 4 train_3386243561
                                     FOUND 9 puck
                                     FOUND 7 face_powder
                                     FOUND 2 brass
                                     REST 18 ['bottlecap', 'barrel']

000a190fdd715a2a36faed16e2c65df7.jpg 3 train_2288590299
                                     FOUND 8 face_powder
                                     FOUND 6 measuring_cup
                                     FOUND 2 magnetic_compass
                       

In [20]:
print(len(classes_thr))
print(classes_thr['train_129225211'])
print(classes_god['train_129225211'])
print(classes_all['train_129225211'])

34250
['wool', 'wool', 'wool', 'wool']
['wool', 'wool', 'wool', 'wool', 'pillow', 'pillow', 'book_jacket', 'book_jacket', 'pajama', 'pajama', 'comic_book', 'comic_book']
['wool', 'wool', 'wool', 'wool', 'pillow', 'pillow', 'book_jacket', 'book_jacket', 'pajama', 'pajama', 'comic_book', 'comic_book', 'brassiere', 'head_cabbage', 'face_powder', 'jersey', 'sock', 'cowboy_hat', 'artichoke']


# Word2Vec

This NLP solution is by the courtesy of Niels-Chrstian Leight and Elias Büchner from: NLP-Notebook.ipynb


### Preprocessing etc.

In [21]:
# Friendly borrowed by Nikhil Manali
# https://www.kaggle.com/coder247/similarity-using-word2vec-text
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            if token == 'xxxx':
                continue
            result.append(lemmatize_stemming(token))
    
    return result

def word2vec_model(processed_docs, size_feat_vec=50):
    w2v_model = Word2Vec(min_count=1,
                         window=3,
                         size=size_feat_vec,
                         sample=6e-5, 
                         alpha=0.03, 
                         min_alpha=0.0007, 
                         negative=20)
    
    w2v_model.build_vocab(processed_docs)
    w2v_model.train(processed_docs, 
                    total_examples=w2v_model.corpus_count, 
                    epochs=300, 
                    report_delay=1)
    
    return w2v_model

In [22]:
df_train_w2v = load_trainCsv()
processed_docs = df_train_w2v['title'].map(preprocess)
processed_docs = list(processed_docs)

df_train_w2v['preprocess_title']=processed_docs
df_train_w2v[['posting_id','preprocess_title']][0:2]

Unnamed: 0,posting_id,preprocess_title
0,train_129225211,"[paper, victoria, secret]"
1,train_3386243561,"[doubl, tape, origin, doubl, foam, tape]"


### Creation of DataFrame() with CNNSemanticClustering data

In [30]:
data = dict()
features = dict() 
df = pd.DataFrame()

key = 'all' # 'thr'  'god'  'all'

if key == 'thr':
    classes = classes_thr
elif key == 'god':
    classes = classes_god
else:
    classes = classes_all

print(len(classes_thr))
print(classes_thr['train_129225211'])

for pid in classes:
    label_group = str(dict_begin['dic_posting_id_label_group'][pid][0])
    features.update({'posting_id':pid})
    features.update({'label_group':label_group})
    features.update({'title':' '.join(classes_thr[pid])})
    df = df.append(features, ignore_index=True)

# convert dict() into pd.DataFrame()
df.head(5)

34250
['wool', 'wool', 'wool', 'wool']


Unnamed: 0,label_group,posting_id,title
0,249114794,train_129225211,wool wool wool wool
1,2937985045,train_3386243561,puck puck puck puck puck puck puck puck puck f...
2,2395904891,train_2288590299,face_powder face_powder face_powder face_powde...
3,4093212188,train_2406599165,pajama pajama pajama pajama pajama pajama paja...
4,3648931069,train_3369186413,lotion lotion lotion lotion lotion lotion loti...


In [31]:
#if not 'df_train_w2v' in locals():
#    df_train_w2v = dict()

df_train_w2v = dict()

df_train_w2v[key] = df.copy()

processed_docs = dict()
processed_docs[key] = df_train_w2v[key]['title'].map(preprocess)
processed_docs[key] = list(processed_docs[key])

df_train_w2v[key]['preprocess_title']=processed_docs[key]
df_train_w2v[key][['posting_id','preprocess_title']][0:5]

Unnamed: 0,posting_id,preprocess_title
0,train_129225211,"[wool, wool, wool, wool]"
1,train_3386243561,"[puck, puck, puck, puck, puck, puck, puck, puc..."
2,train_2288590299,"[face_powd, face_powd, face_powd, face_powd, f..."
3,train_2406599165,"[pajama, pajama, pajama, pajama, pajama, pajam..."
4,train_3369186413,"[lotion, lotion, lotion, lotion, lotion, lotio..."


In [40]:
processed_docs[key][13]

['odomet',
 'odomet',
 'odomet',
 'odomet',
 'analog_clock',
 'analog_clock',
 'analog_clock',
 'analog_clock']

### Building a Word2Vec Model

In [32]:
%%time
build_new_model_bool = False

processed = list(processed_docs[key])

if build_new_model_bool:
    w2v_model = word2vec_model(processed, size_feat_vec=50)
    w2v_model.save(DATA+'word2vec_model_'+key+'.pickle')
else:
    w2v_model = pickle.load(open(DATA+'word2vec_model_'+key+'.pickle', 'rb'))

if not 'emb_vec' in locals():
    emb_vec = dict()
emb_vec[key] = w2v_model.wv

Wall time: 18.5 ms


In [33]:
def get_feature_vec_v2(sen1, model, size_feat_vec=50):
    
    sen_vec1 = np.zeros(size_feat_vec)
    for val in sen1:
        sen_vec1 = np.add(sen_vec1, model[val])    
    return sen_vec1/norm(sen_vec1)

df_train_w2v[key]['word_vec'] = df_train_w2v[key].apply(
    lambda row: get_feature_vec_v2(row['preprocess_title'], emb_vec[key]), axis=1)

print(df_train_w2v[key].shape)
df_train_w2v[key].head(2)

  


(34250, 5)


Unnamed: 0,label_group,posting_id,title,preprocess_title,word_vec
0,249114794,train_129225211,wool wool wool wool,"[wool, wool, wool, wool]","[0.006743095288388861, -0.004858696665403262, ..."
1,2937985045,train_3386243561,puck puck puck puck puck puck puck puck puck f...,"[puck, puck, puck, puck, puck, puck, puck, puc...","[-0.20203430882502466, 0.006678147491851313, -..."


### Calculate distances

In [34]:
id_1=8400
i_vec_1   = df_train_w2v[key]['word_vec'][id_1]
i_vec_all = df_train_w2v[key]['word_vec'].values
i_vec_all = np.vstack(i_vec_all)

labels     = df_train_w2v[key]['label_group'].to_list()
posting_id = df_train_w2v[key]['posting_id'].to_list()
list1      = list(sc.get_sim_all_pi(i_vec_1,i_vec_all))

if not 'df_nlp' in locals():
    df_nlp = dict()
df_nlp[key] = pd.DataFrame(data=[list1,labels,posting_id]).transpose()
df_nlp[key] = df_nlp[key].sort_values(by=[0,1],ascending=False)
df_nlp[key].head(10)

Unnamed: 0,0,1,2
8400,1.0,912146474,train_598686012
17008,1.0,912146474,train_578257500
22962,0.994298,912146474,train_395732057
21920,0.944525,3185007248,train_574223332
25472,0.937145,452508504,train_3343934633
5006,0.937145,4288078681,train_635337083
28614,0.937139,3185007248,train_4082031443
14260,0.935965,1790344296,train_657274189
3326,0.935072,912146474,train_2710843880
26938,0.935072,2393818132,train_3813210970


### Create clusters optimized on recall

In [211]:
%%time
rec_values = []
cl_size = []

key = 'all' # 'thr'  'god'  'all'

threshold = 0.98

#for i in list(np.random.randint(34250, size=50)): <- ... does not work here
for i in range(100):
    clreal = sc.real_cluster_of_i_w2v(i*100,df_train_w2v[key])
    clpred = sc.pred_cluster_of_i_w2v(i*100,threshold,df_train_w2v[key],labels,posting_id)
    rec_values.append(recall_i(clreal,clpred))
    cl_size.append(len(clpred))
    
print("Mean Recall: ",sum(rec_values)/len(rec_values), "  Mean Length of cluster: ", sum(cl_size)/len(cl_size))

Mean Recall:  0.5294256514727107   Mean Length of cluster:  188.68
Wall time: 4min 56s


### Cluster creation for all-2-all (34250x34250): threshold = 0.99

In [247]:
%%time
already_done = True
already_done = False

threshold = 0.99
DO = 34250
DO = 100
key = 'all' # 'thr'  'god'  'all'

labels     = df_train_w2v[key]['label_group'].to_list()
posting_id = df_train_w2v[key]['posting_id'].to_list()
list1      = list(sc.get_sim_all_pi(i_vec_1,i_vec_all))

if already_done == False:

    dict_nlp_prec_all_97 = {}
    list_post_id = df_train_w2v[key]['posting_id'].tolist()
   
    for i in range(25501,34250):    
        dict_nlp_prec_all_97[list_post_id[i]] = set(sc.pred_cluster_of_i_w2v(i,threshold,df_train_w2v[key],labels,posting_id))
        if i%1000 == 0:    # Display progress and save 
            print(i)
            pickle.dump(dict_nlp_prec_all_97, open( DATA+"dict_nlp_prec_all_97_run1.pickle", "wb" ) )
    pickle.dump(dict_nlp_prec_all_97, open( DATA+"dict_nlp_prec_all_97_run1.pickle", "wb" ) )    # Final save

26000
27000
28000
29000
30000
31000
32000
33000
34000
Wall time: 8h 11min 13s


In [12]:
# Load results
dict_nlp_prec_all_97_load = pickle.load( open( DATA+"dict_nlp_prec_all_97_run1.pickle", "rb" ) )
temp = pickle.load( open( DATA+"dict_nlp_prec_all_97_run2.pickle", "rb" ) )
dict_nlp_prec_all_97_load.update(temp)
temp = pickle.load( open( DATA+"dict_nlp_prec_all_97_run3.pickle", "rb" ) )
dict_nlp_prec_all_97_load.update(temp)
temp = pickle.load( open( DATA+"dict_nlp_prec_all_97_run4.pickle", "rb" ) )
dict_nlp_prec_all_97_load.update(temp)

### Calculate the f1-Score

In [38]:
%%time
# Combine two predictions
def combi_pred(pred1, pred2):

    combi_set = {} 
    for i in pred1.keys():
        assert i in set(pred2.keys())
        combi_set[i] = pred1[i].union(pred2[i])
    return combi_set

# Load results from the pHash Notebook
dict_phash_prec_all_9_load = pickle.load( open( DATA+"dict_phash_prec_all_9.pickle", "rb" ) )

pred_nlp_phash = combi_pred(dict_nlp_prec_all_97_load, dict_phash_prec_all_9_load)

Wall time: 52.5 s


In [37]:
%%time
# Calculate F1-Score of the combination NLP and pHash

fscores = []

for i in range(1000,1100):
    x = sc.f_score_i(sc.real_cluster_of_i_w2v(i,df_train_w2v[key]), pred_nlp_phash[df_train_w2v[key].posting_id.values[i]])
    fscores.append(x)
    
print("F1-Score: ", sum(fscores)/len(fscores))

F1-Score:  0.28776551732169997
Wall time: 210 ms


### Calculate the f1-Score (with pHash)

In [36]:
pred_nlp_phash = combi_pred(dict_nlp_prec_all_97_load, dict_nlp_prec_all_97_load)

In [39]:
%%time
# Calculate F1-Score of the combination NLP and pHash

fscores = []

for i in range(1000,1100):
    x = sc.f_score_i(sc.real_cluster_of_i_w2v(i,df_train_w2v[key]), pred_nlp_phash[df_train_w2v[key].posting_id.values[i]])
    fscores.append(x)
    
print("F1-Score: ", sum(fscores)/len(fscores))

F1-Score:  0.31655435955308664
Wall time: 250 ms


## Conclusion

The overall **f1-Score** of **0.288** is not as good as expected. Together with the **pHash** method the **f1-score** reaches **0.317**.

With this result the method is not suited for prediction and cluster creation in the **Shopee** data set. The NLP algorithm has been used as is, it may not be setup up right for CNNSemanticClustering data - see **Future Work**.

## Future Work

Due to time issues the hyper parameters for Word2Vec where not optimized. The **f1-Score** could potentially be improved by hyper parameter tuning.

Though experience shows that often the improvement by hyper parameter tuning is limited.

Methods like **Caption Prediction** or **Self-Trained CNN** seems more promising ...