# Part 1: DBSCAN — Pre-Processing

## Contents

[Merging Synonyms with WordNet](#merging-synonyms-to-reveal-hidden-patterns)\
[Improving the Set of Features in Other Ways](#improving-the-set-of-features-in-other-ways)

In [None]:
import spacy, nltk
import pandas as pd
import numpy as np
nltk.download("wordnet")
nltk.download("omw-1.4")
from nltk.corpus import wordnet as wn
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.cluster import DBSCAN


In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
data_df = pd.read_csv("../../Data/sample-data.csv")

data_df.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


## Merging Synonyms to Reveal Hidden Patterns?

We may be able to reduce the size of the vocabulary and reveal patterns that way.

In order to be able to pick the right synonym to use in context, I need to see how well spaCy's sentence vector similarity method works. Using the two meanings of "bank" (a place where you store your money, and the banks of a river).

In [None]:
sent_1 = nlp("This cardigan is the best cardigan. It has impeccable seams and I would gladly wear it to the bank.")
sent_2 = nlp("The Bank of Ireland is a financial institution involved with many countries' international currency bonds.")

sent_1.similarity(sent_2)

0.7902679443359375

In [None]:
sent_3 = nlp("This cardigan is the best cardigan. It has impeccable seams and I would gladly wear it to the banks of the Danube.")
sent_3.similarity(sent_1)

0.9916204214096069

In [None]:
sent_4 = nlp("The river Boyne has seen many civilisations on its banks.")
sent_4.similarity(sent_1)

0.8406769037246704

In [None]:
sent_3.similarity(sent_2)

0.8120417594909668

In [None]:
sent_3.similarity(nlp("hello"))

0.36783286929130554

#### Without trying to merge synonyms

In [None]:
df = data_df.copy()
df["clean_docs"] = df["description"].str.replace(r"[^a-zA-Z0-9']+", " ", regex=True)\
                                    .apply(lambda desc: nlp(desc.lower()))\
                                    .apply(lambda doc: [token.lemma_ for token in doc
                                                        if token.text not in STOP_WORDS])\
                                    .apply(lambda ls: " ".join(ls))

In [None]:
df.head()

Unnamed: 0,id,description,clean_docs
0,1,Active classic boxers - There's a reason why o...,active classic boxer reason boxer cult favorit...
1,2,Active sport boxer briefs - Skinning up Glory ...,active sport boxer brief skin glory require mo...
2,3,Active sport briefs - These superbreathable no...,active sport brief superbreathable fly brief m...
3,4,"Alpine guide pants - Skin in, climb ice, switc...",alpine guide pant skin climb ice switch rock t...
4,5,"Alpine wind jkt - On high ridges, steep ice an...",alpine wind jkt high ridge steep ice alpine ja...


In [None]:
no_syn_merge_vectoriser = TfidfVectorizer(stop_words="english")
no_syn_merge_X = no_syn_merge_vectoriser.fit_transform(df["clean_docs"])
len(no_syn_merge_vectoriser.vocabulary_)

3825

It seems, though, that the lemmas in df have a few available synonyms on average, and that we should at least try to reduce the size of the vocabulary:

In [None]:
# Keep this cell to motivate my choice of using wordnet synonyms
test_descs = df["description"].apply(lambda desc: nlp(desc))\
                              .apply(lambda doc: np.mean([len(wn.synsets(token.lemma_)) for token in doc])) 
test_descs.mean()

np.float64(4.695089691414051)

#### With trying to merge synonyms

Using WordNet's Synsets feature.

In [None]:
df = data_df.copy()

def get_best_synonym(token):
    sent_context = token.sent
    synset_list = wn.synsets(token.lemma_)

    if len(synset_list) <= 1:
        return token.lemma_
    
    best, best_score = None, -1
    for synonym in synset_list:
        definition_doc = nlp(synonym.definition())
        similarity_score = sent_context.similarity(definition_doc)
        if similarity_score > best_score:
            best, best_score = synonym, similarity_score
    return best.name()

df["clean_docs"] = df["description"].str.replace(r"[^a-zA-Z0-9']+", " ", regex=True)\
                                    .apply(lambda desc: nlp(desc.lower()))\
                                    .apply(lambda doc: [token for token in doc
                                                        if token.text not in STOP_WORDS])\
                                    .apply(lambda doc: [get_best_synonym(token) for token in doc])
df.head()

  similarity_score = sent_context.similarity(definition_doc)


Unnamed: 0,id,description,clean_docs
0,1,Active classic boxers - There's a reason why o...,"[Synset('active.a.14'), Synset('classical.a.01..."
1,2,Active sport boxer briefs - Skinning up Glory ...,"[Synset('active_voice.n.01'), Synset('sport.n...."
2,3,Active sport briefs - These superbreathable no...,"[Synset('active.a.01'), Synset('sport.n.04'), ..."
3,4,"Alpine guide pants - Skin in, climb ice, switc...","[Synset('alpine.s.03'), Synset('guide.n.06'), ..."
4,5,"Alpine wind jkt - On high ridges, steep ice an...","[Synset('alpine.s.03'), Synset('wind_instrumen..."


In [None]:
df["clean_docs"] = df["clean_docs"].apply(lambda ls: " ".join([syn.name() if (str(type(syn)) == "<class 'nltk.corpus.reader.wordnet.Synset'>")
                                                               else syn
                                                               for syn in ls]))
df.head()

Unnamed: 0,id,description,clean_docs
0,1,Active classic boxers - There's a reason why o...,active.a.14 classical.a.01 packer.n.01 reason....
1,2,Active sport boxer briefs - Skinning up Glory ...,active_voice.n.01 sport.n.04 packer.n.01 brief...
2,3,Active sport briefs - These superbreathable no...,active.a.01 sport.n.04 brief.n.01 superbreatha...
3,4,"Alpine guide pants - Skin in, climb ice, switc...",alpine.s.03 guide.n.06 trouser.n.01 skin.n.01 ...
4,5,"Alpine wind jkt - On high ridges, steep ice an...",alpine.s.03 wind_instrument.n.01 jkt high_gear...


In [None]:
df["clean_docs"] = df["clean_docs"].apply(lambda st: " ".join([word.split(".")[0] if ("." in word)
                                                               else word
                                                               for word in st.split(" ")]))
df.head()

Unnamed: 0,id,description,clean_docs
0,1,Active classic boxers - There's a reason why o...,active classical packer reason packer cult fro...
1,2,Active sport boxer briefs - Skinning up Glory ...,active_voice sport packer brief skin aura nece...
2,3,Active sport briefs - These superbreathable no...,active sport brief superbreathable fly brief m...
3,4,"Alpine guide pants - Skin in, climb ice, switc...",alpine guide trouser skin climb methamphetamin...
4,5,"Alpine wind jkt - On high ridges, steep ice an...",alpine wind_instrument jkt high_gear ridge ste...


In [None]:
syn_merge_vectoriser = TfidfVectorizer(stop_words="english")
syn_merge_X = syn_merge_vectoriser.fit_transform(df["clean_docs"])
len(syn_merge_vectoriser.vocabulary_)

3947

Using the context of the sentences actually somehow increases the dimensionality by quite a lot.

In [None]:
df = data_df.copy()

def get_best_synonym(token):
    sent_context = token.sent
    synset_list = wn.synsets(token.lemma_)
    return synset_list[0].name().split(".")[0] if len(synset_list) >= 1 else token.lemma_

df["clean_docs"] = df["description"].str.replace(r"[^a-zA-Z0-9']+", " ", regex=True)\
                                    .apply(lambda desc: nlp(desc.lower()))\
                                    .apply(lambda doc: [token for token in doc
                                                        if token.text not in STOP_WORDS])\
                                    .apply(lambda doc: [get_best_synonym(token) for token in doc])\
                                    .apply(lambda ls: " ".join(ls))
df.head()

Unnamed: 0,id,description,clean_docs
0,1,Active classic boxers - There's a reason why o...,active_agent classic boxer reason boxer cult f...
1,2,Active sport boxer briefs - Skinning up Glory ...,active_agent sport boxer brief skin glory nece...
2,3,Active sport briefs - These superbreathable no...,active_agent sport brief superbreathable fly b...
3,4,"Alpine guide pants - Skin in, climb ice, switc...",alpine usher pant skin ascent ice switch rock ...
4,5,"Alpine wind jkt - On high ridges, steep ice an...",alpine wind jkt high ridge steep ice alpine ja...


In [None]:
syn_merge_vectoriser = TfidfVectorizer(stop_words="english")
syn_merge_X = syn_merge_vectoriser.fit_transform(df["clean_docs"])
len(syn_merge_vectoriser.vocabulary_)

3434

Simply taking the first option and removing the extra information in the synset object name (f.ex. "skin.n.01"), does reduce the dimensionality, but it doesn't take into account the context the word appeared in. Let's see how much more of a difference can we get by removing words that only occur once in the whole corpus:

In [None]:
df = data_df.copy()
df["clean_docs"] = df["description"].str.replace(r"[^a-zA-Z0-9']+", " ", regex=True)\
                                    .apply(lambda desc: nlp(desc.lower()))\
                                    .apply(lambda doc: [token.lemma_ for token in doc
                                                        if token.text not in STOP_WORDS])\
                                    .apply(lambda ls: " ".join(ls))

In [None]:
df.head()

Unnamed: 0,id,description,clean_docs
0,1,Active classic boxers - There's a reason why o...,active classic boxer reason boxer cult favorit...
1,2,Active sport boxer briefs - Skinning up Glory ...,active sport boxer brief skin glory require mo...
2,3,Active sport briefs - These superbreathable no...,active sport brief superbreathable fly brief m...
3,4,"Alpine guide pants - Skin in, climb ice, switc...",alpine guide pant skin climb ice switch rock t...
4,5,"Alpine wind jkt - On high ridges, steep ice an...",alpine wind jkt high ridge steep ice alpine ja...


In [None]:
vectoriser = TfidfVectorizer(stop_words="english")
X = vectoriser.fit_transform(df["clean_docs"])
len(vectoriser.vocabulary_)

3825

In [None]:
dense = X.todense()
tfidf_df = pd.DataFrame(dense, 
                        columns=vectoriser.vocabulary_, 
                        index=[f"doc_{x}" for x in range(1, dense.shape[0]+1)])
tfidf_df.head()

Unnamed: 0,active,classic,boxer,reason,cult,favorite,cool,especially,sticky,situation,...,493,paint,splatter,cake,washing,bellow,349,arduous,unfazed,282
doc_1,0.0,0.0,0.0,0.073055,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.100574,0.146854,0.103911,0.0,0.0
doc_5,0.0,0.0,0.047799,0.047159,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.151113,0.0,0.0,0.0,0.0


In [None]:
len([term for term in vectoriser.vocabulary_
     if len(tfidf_df[tfidf_df[term] != 0.0]) > 1])

2222

Removing unique lemmas is a lot better. \
Just out of curiosity, let's see how much picking the first synonym helped with unique words:

In [None]:
syn_merge_dense = syn_merge_X.todense()
syn_merge_tfidf_df = pd.DataFrame(syn_merge_dense, 
                                  columns=syn_merge_vectoriser.vocabulary_, 
                                  index=[f"doc_{x}" for x in range(1, syn_merge_dense.shape[0]+1)] )
syn_merge_tfidf_df.head()

Unnamed: 0,active_agent,classic,boxer,reason,cult,favorite,cool,particularly,gluey,situation,...,jeer,493,paint,spatter,cake,bellow,349,arduous,unfazed,282
doc_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14985,0.0
doc_5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
len([term for term in syn_merge_vectoriser.vocabulary_
     if len(syn_merge_tfidf_df[syn_merge_tfidf_df[term] != 0.0]) > 1])

2033

It seems like it did reduce the proportion of unique words a little. But at what cost?\
With more time, I may have been able to find a way to merge synonyms, but the risk of merging words with different meanings and significantly reducing the quality of the data is too big, and removing unique features makes a big difference already, so I won't pursue the synonym pre-processing step further.

## Improving the Set of Features in Other Ways

The idea now is to run TfidfVectorizer on the data with 1- to 4-grams to reveal more patterns, and then remove any feature that is unique in the TF-IDF matrix (n-gram or simple lemma). Features that only have one non-zero value on one doc cannot help show common patterns between docs.
I also realised that I had left the HTML tags in at that stage. That gets fixed in the cell below.

In [None]:
df = data_df.copy()
df["clean_docs"] = df["description"].str.replace(r"<[^>]*>", " ", regex=True)\
                                    .str.replace(r"[^a-zA-Z0-9']+", " ", regex=True)\
                                    .apply(lambda desc: nlp(desc.lower()))\
                                    .apply(lambda doc: [token.lemma_ for token in doc
                                                        if token.text not in STOP_WORDS])\
                                    .apply(lambda ls: " ".join(ls))

In [None]:
# I am not using a max or min document frequency for terms because I want to decide which cols to keep after the n-grams are made
vectoriser = TfidfVectorizer(stop_words="english", ngram_range=(1, 4))
X = vectoriser.fit_transform(df["clean_docs"])
len(vectoriser.vocabulary_)

80511

In [None]:
dense = X.todense()
tfidf_df = pd.DataFrame(dense, 
                        columns=vectoriser.vocabulary_, 
                        index=[f"doc_{x}" for x in range(1, dense.shape[0]+1)])
tfidf_df.head()

Unnamed: 0,active,classic,boxer,reason,cult,favorite,cool,especially,sticky,situation,...,flat zip fly button,entry drop pocket welt,drop pocket welt pocket,welt pocket inseam update,pocket inseam update fit,inseam update fit fabric,update fit fabric oz,recycle program weight 282,program weight 282 oz,weight 282 oz thailand
doc_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
cols_more_than_one = [term for term in vectoriser.vocabulary_
                      if len(tfidf_df[tfidf_df[term] != 0.0]) > 1] # I checked that no elements were negative, but I used `!=` just to be sure.
denser_df = tfidf_df[cols_more_than_one]
denser_df.head()

Unnamed: 0,active,classic,boxer,reason,lightweight,travel,pack,expose,softness,traditional,...,flat zip fly button,entry drop pocket welt,drop pocket welt pocket,welt pocket inseam update,pocket inseam update fit,inseam update fit fabric,update fit fabric oz,recycle program weight 282,program weight 282 oz,weight 282 oz thailand
doc_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc_5,0.0,0.0,0.0,0.0,0.019869,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Our lessons say that if there are about 1/5 non-zero values in the TF-IDF matrix, I should try to reduce the number of features:

In [None]:
denser_df.to_numpy().nonzero()

(array([  0,   0,   0, ..., 499, 499, 499], shape=(120046,)),
 array([   35,   132,   137, ..., 23352, 23357, 23358], shape=(120046,)))

In [None]:
(120046 / (500 * 23757)) * 100

1.0106158184956013

There is only 1% of non-zero values! Getting it to 20% is going to be hard, but let's see what it would take.

Let's see if I can remove features with a low average TF-IDF (features that are comparatively less useful to extract topics that stand out) before performing a truncated SVD. Although this may remove very common words first and make the situation worse, I want to see if we can find a sweet spot or if it's impossible.

Then, I'll see how many columns I would need to remove to reach 20%. It seems a lot more doable.

In [None]:
# Removing terms with low average TF-IDF

percentage_non_zeros = -1
num_features = np.inf
threshold = 0.05
full_num_terms = len(tfidf_df.columns)
eligible_features = [term for term in denser_df.columns
                    if (denser_df[denser_df[term] > 0][term].mean() > threshold)]
current_df = denser_df[eligible_features]

while (percentage_non_zeros < 20) and num_features > 1:
    threshold += 0.001
    eligible_features = [term for term in current_df.columns
                        if (current_df[current_df[term] > 0][term].mean() > threshold)]
    current_df = current_df[eligible_features]
    num_features = len(eligible_features)
    percentage_non_zeros = (
        len(current_df.to_numpy().nonzero()[0])
        / (500 * num_features)
        ) * 100
    print(f"There are {num_features}/{full_num_terms} terms with an average TF-IDF above {threshold}, with {percentage_non_zeros}% non-zero values.")
    

There are 9442/80511 terms with an average TF-IDF above 0.051000000000000004, with 0.7570853632704935% non-zero values.
There are 8720/80511 terms with an average TF-IDF above 0.052000000000000005, with 0.7638302752293578% non-zero values.
There are 8255/80511 terms with an average TF-IDF above 0.053000000000000005, with 0.767244094488189% non-zero values.
There are 7619/80511 terms with an average TF-IDF above 0.054000000000000006, with 0.7735135844599028% non-zero values.
There are 7182/80511 terms with an average TF-IDF above 0.05500000000000001, with 0.7752715121136173% non-zero values.
There are 6283/80511 terms with an average TF-IDF above 0.05600000000000001, with 0.7828425911188923% non-zero values.
There are 5915/80511 terms with an average TF-IDF above 0.05700000000000001, with 0.7761284868977177% non-zero values.
There are 5360/80511 terms with an average TF-IDF above 0.05800000000000001, with 0.7747014925373134% non-zero values.
There are 5131/80511 terms with an average TF

In [None]:
# Removing rare features

percentage_non_zeros = -1
num_features = np.inf
threshold = 1
full_num_terms = len(denser_df.columns)
eligible_features = [term for term in denser_df.columns
                     if len(denser_df[denser_df[term] != 0.0]) > threshold]
current_df = denser_df[eligible_features]

while (percentage_non_zeros < 20) and num_features > 1:
    threshold += 1
    eligible_features = [term for term in current_df.columns
                         if len(current_df[current_df[term] != 0.0]) > threshold]
    current_df = current_df[eligible_features]
    num_features = len(eligible_features)
    percentage_non_zeros = (
        len(current_df.to_numpy().nonzero()[0])
        / (500 * num_features)
        ) * 100
    print(f"There are {num_features}/{full_num_terms} terms that occur more than {threshold} times in the TF-IDF matrix, with {percentage_non_zeros}% non-zero values.")

There are 10717/23757 terms that occur more than 2 times in the TF-IDF matrix, with 1.753587757768032% non-zero values.
There are 7333/23757 terms that occur more than 3 times in the TF-IDF matrix, with 2.285940270012273% non-zero values.
There are 5035/23757 terms that occur more than 4 times in the TF-IDF matrix, with 2.9641310824230387% non-zero values.
There are 4002/23757 terms that occur more than 5 times in the TF-IDF matrix, with 3.4711144427786103% non-zero values.
There are 3047/23757 terms that occur more than 6 times in the TF-IDF matrix, with 4.18293403347555% non-zero values.
There are 2471/23757 terms that occur more than 7 times in the TF-IDF matrix, with 4.831647106434642% non-zero values.
There are 1959/23757 terms that occur more than 8 times in the TF-IDF matrix, with 5.676263399693721% non-zero values.
There are 1614/23757 terms that occur more than 9 times in the TF-IDF matrix, with 6.504832713754646% non-zero values.
There are 1425/23757 terms that occur more tha

In both cases, trying to reach 1/5 non-zero values seems like a bad idea: On the one hand, we reached 1 feature before we came even close to reaching 20% of non-zero values by reducing the tolerance for low TF-IDF values, and on the other hand, we reached 20% of non-zero values by reducing the tolerance for term rarity, but we end up with only 303 features out of a total of 23,757.