# NLP - Notebook

In this notebook we will attempt to make a good clustering on our data set by using NLP (Natural Language Processing) methods. Since the data set does not only contain visual information, but also very specific titles of the products, it seems not to be unlikely that this text information can make a substantial contribution to the accuracy of our prediction.

At the end of this notebook, the results we got from our NLP approach are combined with those from the pHash-Notebook.

## Preliminaries

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load all the necessary libraries
import pandas as pd
import numpy as np
from numpy import dot
from numpy.linalg import norm
import pickle
import time
import random
np.random.seed(2018)

import nltk
from tensorflow.keras.preprocessing import text

from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from gensim.models import Word2Vec
import nltk
#nltk.download('wordnet')
stemmer = SnowballStemmer('english')



In [3]:
# Load the .csv file containing the text information to each image 
MY_PATH = './data'
def load_trainCsv():
    df = pd.read_csv(MY_PATH + "/train.csv")   
    return df

# Functions for evaluation

In this notebook, we will build two NLP models. One model is based on the TfidfVectorizer from sklearn and one model based on Word2Vec. In order to evaluate these models later on, we now define some functions.

In [4]:
def f_score_i(cl_real_i, cl_pred_i):
    '''
    Description:
    Calculate f-score for a single posting_id
    f1-score is the mean of all f-scores
    
    Parameters: 
    argument1 (list): list of posting_id's belonging to the real cluster
    argument2 (list): list of posting_id's belonging to the predicted cluster
    
    Returns: 
    float value of f-score   
    '''
    
    s_pred = set(cl_pred_i)
    s_real = set(cl_real_i)
    s_intsec = s_pred.intersection(s_real)

    return 2*len(s_intsec) / (len(s_pred)+len(s_real))


def recall_i(cl_real_i, cl_pred_i):      
    '''
    Description:
    Calculate recall for a single posting_id
    
    Parameters: 
    argument1 (list): list of posting_id's belonging to the real cluster
    argument2 (list): list of posting_id's belonging to the predicted cluster
    
    Returns: 
    float value of recall   
    '''
        
    s_pred = set(cl_pred_i)
    s_real = set(cl_real_i)   
    s_diff_r_p = s_real.difference(s_pred)
    
    return (len(s_real) - len(s_diff_r_p)) / len(s_real) 


def precision_i(cl_real_i, cl_pred_i):      
    '''
    Description:
    Calculate precision for a single posting_id
    
    Parameters: 
    argument1 (list): list of posting_id's belonging to the real cluster
    argument2 (list): list of posting_id's belonging to the predicted cluster
    
    Returns: 
    float value of precision   
    '''
    
    s_pred = set(cl_pred_i)
    s_real = set(cl_real_i)    
    s_diff_p_r = s_pred.difference(s_real)
    
    return (len(s_pred) - len(s_diff_p_r)) / len(s_pred)

In [5]:
def real_cluster_of_i_w2v(i):
    '''
    Description:
    Find real cluster for a single posting_id
    Use this function when working with Word2Vec
    
    Parameters: 
    argument1 (int): position of posting_id in DataFrame
    
    Returns: 
    list of all posting_id's  
    '''
    
    l_g = (df_train_w2v.iloc[i].at['label_group'])
    df_red = df_train_w2v[df_train_w2v['label_group'] == l_g]
    df_red_list = df_red['posting_id'].tolist()
    return df_red_list


def real_cluster_of_i_tfidf(i):
    '''
    Description:
    Find real cluster for a single posting_id
    Use this function when working with TfidVectorizer
    
    Parameters: 
    argument1 (int): position of posting_id in DataFrame
    
    Returns: 
    list of all posting_id's  
    '''
        
    l_g = (df_train_tfidf.iloc[i].at['label_group'])
    df_red = df_train_tfidf[df_train_tfidf['label_group'] == l_g]
    df_red_list = df_red['posting_id'].tolist()
    return df_red_list


def pred_cluster_of_i_tfidf(i,threshold):
    '''
    Description:
    Find predicted cluster for a single posting_id
    Use this function when working with TfidVectorizer
    
    Parameters: 
    argument1 (int): position of posting_id in DataFrame
    
    Returns: 
    list of all posting_id's  
    '''
    
    list1 = []
    list2 = []
    list3 = []
    
    for j in range(len(corpus)):
        list1.append(round(dist2(i, j),3))
        list2.append(labels_tfidf[j])
        list3.append(posting_id_tfidf[j])
        
    df_nlp = pd.DataFrame(data = [list1,list2,list3]).transpose()    
    df_nlp = df_nlp[df_nlp[0] <= threshold]    
    ls = df_nlp[2].tolist()
    
    return ls


def pred_cluster_of_i_w2v(i,threshold):
    '''
    Description:
    Find predicted cluster for a single posting_id
    Use this function when working with Word2Vec
    
    Parameters: 
    argument1 (int): position of posting_id in DataFrame
    
    Returns: 
    list of all posting_id's  
    '''

    list1 = []
    list2 = []
    list3 = []
        
    for j in range(34250):

        i_vec_1 = df_train_w2v['word_vec'][j]
        i_vec_2 = df_train_w2v['word_vec'][i]
        list1.append(round(get_sim_two_pi(i_vec_1, i_vec_2),4))
        
        list2.append(labels[j])
        list3.append(posting_id[j])
                
    df_nlp = pd.DataFrame(data = [list1,list2,list3]).transpose()
    df_nlp = df_nlp.sort_values(by = 0)
    
    df_nlp = df_nlp[df_nlp[0] >= threshold]
    
    ls = df_nlp[2].tolist()
    
    return ls

# TfidfVectorizer

## Creating a feature vector

In order to compare to images, we will first transform each title into a feature vector. The distance of the respective images can then be measured by the distance between these two vectors. 

In [6]:
# Create a function to clean all the titles of our images
def clean_title(title):
    title = title.replace("-", " ").replace("//", " ").replace("[", " ").replace("]", " ").replace("(", " ").replace(")", " ")
    title = title.replace("+", " ").replace("/", " ").replace("x", " ").replace("x", " ").replace("\\", " ")
    title = title.replace(",", " ")
    return title

In [7]:
df_train_tfidf = load_trainCsv()
df_train_tfidf['cleanTitle'] = df_train_tfidf['title'].apply(clean_title)

In [8]:
corpus = df_train_tfidf['cleanTitle'].to_list()
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(corpus)
word_index = tokenizer.word_index

st_words = nltk.corpus.stopwords.words("english")
st_words.extend(nltk.corpus.stopwords.words("indonesian"))

# Deleting all words which consist only of one or two letters
for w in word_index.keys():
    if len(w)<3:
        st_words.append(w)
    if w.isnumeric():
        st_words.append(w)

The hyperparameter max_features is very important. It determines the length of the feature vectors. The longer these vectors are, the more precise our predictions will be. On the other hand, this will make calculations slower.

In [9]:
# Now vectorize all our string included in the corpus; use TfidfVectorizer from Sklearn
vectorizer = TfidfVectorizer(stop_words=st_words, max_features=2500)   
vectorizer.fit(corpus)
label_vec = vectorizer.transform(corpus)

label_vec = label_vec.toarray()

### Defining a distance function

There are theoretically infinite possibilities to define distance functions as we can use the Minkowski Metric with any natural number p. Here we just experimented with p=1 (the so called Manhattan Metric) and p=2 (The Euclidean Metric). It turned out that the Euclidean Metric gives us better results.

In [10]:
def dist2(x,y):          # x,y indicate the position of the two images in our DataFrame
    a = label_vec[x]
    b = label_vec[y]
    dist = np.sqrt(sum([(a[i] - b[i])**2 for i in range(label_vec.shape[1])]))        # (Euclidean Metric)
    #dist = sum([abs((a[i] - b[i])) for i in range(label_vec.shape[1])])              # (Manhattan-Metric)
    
    return dist

### Finding semantically similar pictures

We now have a look at a particular picture and display the 15 images out of our data set which are semantically closest to this picture.

In [11]:
labels_tfidf = df_train_tfidf['label_group'].to_list()
posting_id_tfidf = df_train_tfidf['posting_id'].to_list()

dist_ls = []

for i in range(len(corpus)):
    dist_ls.append(round(dist2(8400, i),2))

df_nlp = pd.DataFrame(data = [dist_ls,labels_tfidf,posting_id_tfidf]).transpose()    

df_nlp = df_nlp.sort_values(by = 0)
df_nlp.head(15)

Unnamed: 0,0,1,2
8400,0.0,912146474,train_598686012
14689,0.29,912146474,train_4187857506
17008,0.29,912146474,train_578257500
3326,0.51,912146474,train_2710843880
22962,0.67,912146474,train_395732057
20890,1.0,30737591,train_462541577
7262,1.0,2077604114,train_4161009603
21009,1.0,2545212005,train_1675502287
4097,1.0,179570104,train_3639374022
27513,1.0,599988605,train_3547727499


In column 0 we see the distance the respective image (8400) has to the image represented by the row number. In column 1 we see the cluster this image belongs to. We notice that the semantically nearest neighbours of image 8400 are images which actually belong to the same cluster. (Of course, our metric does not always perform that well.)

One important hyperparameter we will have to tune is the threshold. This threshold determines how "close" another image has to be to the image we're just looking at in order to make the assumption that the two images display the same product. The table above indicates that the threshold could be at around 0.85.

### Estimating the F1-Score

In order to estimate the F1-Score we can expect from that method, we now apply our f_score_i function to a certain number of images. The exact F1-Score is the mean of these accuracy values of all the 34250 images.

What does the f_score_i function do: First construct the intersection of the two sets which represent the real cluster and predicted cluster. Then divide the length of this intersection through the sum of the length of both real and predicted cluster. The result is then multiplied by two and is finally returned.

In [12]:
pred_tfidf = []

for i in range(100):
    clreal = real_cluster_of_i_tfidf(i*100)
    clpred = pred_cluster_of_i_tfidf(i*100,0.7)
    pr = f_score_i(clreal,clpred)
    pred_tfidf.append(pr)

In [14]:
sum(pred_tfidf)/len(pred_tfidf)

0.5831381080605215

### Excursion: PCA

In order to decrease the calculation time, we now will do a dimensionality reduction via Principal Component Analysis. 

In [15]:
def create_fit_PCA(data, n_components=300):
    p = PCA(n_components=n_components, random_state=42)
    p.fit(data)    
    return p

feat_vec_pca = create_fit_PCA(label_vec)
vec_pca = feat_vec_pca.transform(label_vec)

def dist_pca(x,y):          # x,y indicate the position of the two images in our DataFrame
    a = vec_pca[x]
    b = vec_pca[y]
    dist = np.sqrt(sum([(a[i] - b[i])**2 for i in range(100)]))           # p=2 (Euclidean Metric)
    #dist = sum([abs((a[i] - b[i])) for i in range(label_vec.shape[1])])  # p=1 (Manhattan-Metric)
    
    return dist

In [16]:
dist_ls_pca = []

for i in range(len(corpus)):
    dist_ls_pca.append(round(dist_pca(1100, i),2))

df_nlp = pd.DataFrame(data = [dist_ls_pca,labels_tfidf,posting_id_tfidf]).transpose()    

df_nlp = df_nlp.sort_values(by = 0)
df_nlp.head(15)

Unnamed: 0,0,1,2
1100,0.0,3697043299,train_804001417
6175,0.11,410540079,train_3958025844
2513,0.11,387772861,train_3913619870
973,0.11,3697043299,train_1337940039
7601,0.11,1377712939,train_1496501644
31227,0.12,3697043299,train_2388554702
12453,0.12,410540079,train_625253343
9916,0.12,961662899,train_1414655801
13127,0.13,1460273711,train_2153757835
24479,0.14,4138426580,train_1312282384


We see that calculation time decreases a lot. Looking at some examples, we can see that the results we get are not very reliable. Future work on this issue could include checking out precisely the benefits and the limits of the PCA method for the NLP approach.

# Word2Vec

As the TfidfVectorizer need quite a lot of computation time, we now try out a second NLP method: Word2Vec

### Preprocessing etc.

In [17]:
# Friendly borrowed by Nikhil Manali
# https://www.kaggle.com/coder247/similarity-using-word2vec-text
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            if token == 'xxxx':
                continue
            result.append(lemmatize_stemming(token))
    
    return result

def word2vec_model(size_feat_vec=50):
    w2v_model = Word2Vec(min_count=1,
                     window=3,
                     vector_size=size_feat_vec,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20)
    
    w2v_model.build_vocab(processed_docs)
    w2v_model.train(processed_docs, total_examples=w2v_model.corpus_count, epochs=300, report_delay=1)
    
    return w2v_model

In [18]:
df_train_w2v = load_trainCsv()
processed_docs = df_train_w2v['title'].map(preprocess)
processed_docs = list(processed_docs)

df_train_w2v['preprocess_title']=processed_docs
df_train_w2v[['posting_id','preprocess_title']][0:2]

Unnamed: 0,posting_id,preprocess_title
0,train_129225211,"[paper, victoria, secret]"
1,train_3386243561,"[doubl, tape, origin, doubl, foam, tape]"


### Building a Word2Vec Model

In [19]:
build_new_model_bool = False

if build_new_model_bool:
    w2v_model = word2vec_model()
    w2v_model.save('word2vec_model')

else:
    w2v_model = pickle.load(open('word2vec_model', 'rb')) 
    
emb_vec = w2v_model.wv

### Create a feature vector

In [20]:
def get_feature_vec_v2(sen1, model, size_feat_vec=50):
    
    sen_vec1 = np.zeros(size_feat_vec)
    for val in sen1:
        sen_vec1 = np.add(sen_vec1, model[val])    
    return sen_vec1/norm(sen_vec1)

df_train_w2v['word_vec'] = df_train_w2v.apply(
                        lambda row: 
                        get_feature_vec_v2(row['preprocess_title'], emb_vec), 
                        axis=1)

df_train_w2v.head(2)

Unnamed: 0,posting_id,image,image_phash,title,label_group,preprocess_title,word_vec
0,train_129225211,0000a68812bc7e98c42888dfb1c07da0.jpg,94974f937d4c2433,Paper Bag Victoria Secret,249114794,"[paper, victoria, secret]","[-0.04777907373336296, 0.16484181570560288, 0...."
1,train_3386243561,00039780dfc94d01db8676fe789ecd05.jpg,af3f9460c2838f0f,"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...",2937985045,"[doubl, tape, origin, doubl, foam, tape]","[0.14724423593913194, 0.1192708373649622, -0.0..."


### Calculate distances

In [21]:
def get_sim_all_pi(i_vec_1,i_vec_all):  
    return i_vec_all.dot(i_vec_1)


def get_sim_two_pi(i_vec_1,i_vec_2):
    sim = dot(i_vec_1,i_vec_2)/(norm(i_vec_1)*norm(i_vec_2))
    return sim

In [22]:
# Finding semantically similar images, same as above

id_1=8400
i_vec_1 = df_train_w2v['word_vec'][id_1]
i_vec_all = df_train_w2v['word_vec'].values
i_vec_all = np.vstack(i_vec_all)

labels = df_train_w2v['label_group'].to_list()
posting_id = df_train_w2v['posting_id'].to_list()
list1 = list(get_sim_all_pi(i_vec_1,i_vec_all))

df_nlp = pd.DataFrame(data = [list1,labels,posting_id]).transpose()
df_nlp = df_nlp.sort_values(by = [0,1],ascending=False)
df_nlp.head(10)

Unnamed: 0,0,1,2
8400,1.0,912146474,train_598686012
3326,0.986869,912146474,train_2710843880
14689,0.985473,912146474,train_4187857506
17008,0.981489,912146474,train_578257500
22962,0.908996,912146474,train_395732057
29207,0.701713,1813765221,train_3947026216
20472,0.695985,3195941759,train_1751507239
10514,0.692491,1813765221,train_722294539
12462,0.663119,342597810,train_599753440
16961,0.659976,1776453982,train_3522099472


### Create clusters optimized on recall

In [23]:
rec_values = []
cl_size = []

for i in range(100):
    clreal = real_cluster_of_i_w2v(i*100)
    clpred = pred_cluster_of_i_w2v(i*100,0.7)
    rec_values.append(recall_i(clreal,clpred))
    cl_size.append(len(clpred))
    
print("Mean Recall: ",sum(rec_values)/len(rec_values), "  Mean Length of cluster: ", sum(cl_size)/len(cl_size))

Mean Recall:  0.9343347338935574   Mean Length of cluster:  177.13


By building clusters using a threshold of 0.7 we get a mean cluster size of around 180. On the other hand, we see that the mean recall value is close to 1.  

This is one of the most important findings of this notebook. That way we can reduce significantly the amount of images which possibly belong to a certain cluster: When looking for the cluster a certain image belongs to, instead of having to compare it to 34249 images, we now just need to look at 180, e.g. with visual methods. That way, the NLP method can be combined with other methods later on.

## Save results and combine them with pHash 

Now we want to predict clusters for all of the 34250 images using a high threshold (0.97). This way we'll obtain clusters which are optimized on precision. The results will be saved in a dictionary and then combined with the results obtained in the pHash Notebook. This combination leads to a F1-Score of 66 %.

In [24]:
already_done = True

if already_done == False:

    dict_nlp_prec_all_97 = {}
    list_post_id = df_train_w2v['posting_id'].tolist()

    for i in range(34250):    
        dict_nlp_prec_all_97[list_post_id[i]] = set(pred_cluster_of_i_w2v(i,0.97))
        if i%1000 == 0:    # Display progress and save 
            print(i)
            pickle.dump(dict_nlp_prec_all_97, open( "dict_nlp_prec_all_97.p", "wb" ) )
    pickle.dump(dict_nlp_prec_all_97, open( "dict_nlp_prec_all_97.p", "wb" ) )    # Final save

# Load results
dict_nlp_prec_all_97_load = pickle.load( open( "dict_nlp_prec_all_97.p", "rb" ) )

In [25]:
# Combine two predictions
def combi_pred(pred1, pred2):

    combi_set = {} 
    for i in pred1.keys():
        assert i in set(pred2.keys())
        combi_set[i] = pred1[i].union(pred2[i])
    return combi_set

# Load results from the pHash Notebook
dict_phash_prec_all_9_load = pickle.load( open( "dict_phash_prec_all_9.p", "rb" ) )

pred_nlp_phash = combi_pred(dict_nlp_prec_all_97_load, dict_phash_prec_all_9_load)

In [26]:
# Calculate F1-Score of the combination NLP and pHash

fscores = []
for i in range(34250):
    x = f_score_i(real_cluster_of_i_w2v(i), pred_nlp_phash[df_train_w2v.posting_id.values[i]])
    fscores.append(x)
    
print("F1-Score: ", sum(fscores)/len(fscores))

F1-Score:  0.6647914689708506


## Future Work

- Most important: Saving (via pickle) clusters omtimized on recall and combine them with visual methods based on CNNs
- Experimenting more on the hyperparameters of the W2V model
- Use further NLP models based on the transformer framework (such as BERT) 
- Find out if PCA can be applied in order to decrease computation time but without loosing to much accuracy