# Semantic Analysis

TF-IDF vectors can be used to anlyze both words and n-grams.

We are currently aiming at finding a meaning of sentences by looking on the general word content. We will construct *topic*/*semantic* vectors that carry meaning of a sentence and not its statistics.


Note that rephrased sentence can have the same meaning having different TF-IDF vectors! Therefore closednes measures as cos-similarity cannot compare semantics.

LSA = Latent Semantic Analysis

Some important notions in analysing semantics (LSA):

- Polysemy - The existence of words and phrases with more than one meaning
- Homonyms - Words with the same spelling and pronunciation, but different meanings
- Zeugma - Use of two meanings of a word simultaneously in the same sentence
- Homographs - Words spelled the same, but with different pronunciations and meanings
- Homophones -Words with the same pronunciation, but different spellings and meanings (an NLP challenge with voice interfaces)

All these problems are issues in LSA.

## Initial experiments

We construct topics/semnatic vectors by weighting contibutions from different words.

In [5]:
import numpy as np
topic = {}
tfidf = dict(list(zip('cat dog apple lion NYC love'.split(), np.random.rand(6))))

print("tfidf = ", tfidf)

topic['petness'] = (.3* tfidf['cat'] +\
    .3 * tfidf['dog'] +\
     0 * tfidf['apple'] +\
     0 * tfidf['lion'] -\
     .2 * tfidf['NYC'] +\
     .2 * tfidf['love'])

topic['animalness'] = (.1 * tfidf['cat'] +\
    .1 * tfidf['dog'] -\
    .1 * tfidf['apple'] +\
    .5 * tfidf['lion'] +\
    .1 * tfidf['NYC'] -\
    .1 * tfidf['love'])

topic['cityness'] = ( 0 * tfidf['cat'] -\
    .1 * tfidf['dog'] +\
    .2 * tfidf['apple'] -\
    .1 * tfidf['lion'] +\
    .5 * tfidf['NYC'] +\
    .1 * tfidf['love'])

print("petness = ", topic['petness'])
print("animalness = ", topic['animalness'])
print("cityness = ", topic['cityness'])

tfidf =  {'cat': 0.08219887120266478, 'dog': 0.6434922923613229, 'apple': 0.7259162900863801, 'lion': 0.2601513375554295, 'NYC': 0.5989888276780737, 'love': 0.2718156894499929}
petness =  0.15227272142358014
animalness =  0.1627704699482836
cityness =  0.3814948778096369


**Note** Negative weights represnet opposite influence of a word on a topic.

We can also do reverse transform:

In [7]:
word_vector = {}
 
word_vector['cat']  = .3*topic['petness'] +\
    .1*topic['animalness'] +\
    0*topic['cityness']

word_vector['dog'] = .3*topic['petness'] +\
    .1*topic['animalness'] -\
    .1*topic['cityness']
 
word_vector['apple'] = 0*topic['petness'] -\
    .1*topic['animalness'] +\
    .2*topic['cityness']

word_vector['lion'] = 0*topic['petness'] +\
    .5*topic['animalness'] -\
    .1*topic['cityness']

word_vector['NYC'] = -.2*topic['petness'] +\
    .1*topic['animalness'] +\
    .5*topic['cityness']

word_vector['love'] = .2*topic['petness'] -\
    .1*topic['animalness'] +\
    .1*topic['cityness']

print("love = ", word_vector['love'])

love =  0.05232698507085136


By this we can see how much topic carries a given word.

<img src ="3DWords.png">

We compressed 6D vector of words to 3D vector of topics.

**Question:** How to autmoatize topics finding and scoring sentences with respect to topics they represents? 

**Answer:** 'You shall know a word by the company it keeps.' J. R. Firth



**Definition** LSA is an algorithm to analyze your TF-IDF matrix (table of TF-IDF vectors) to gather up words into topics. It works on bag-of-words vectors, too, but TF-IDF vectors give slightly better results. LSA also optimizes these topics to maintain diversity in the topic dimensions;


Possible LSA alghoritms:
- SVD (Singular Value Deomposition)
- PCA (Principal Components Analysis)

Another related alghorimts:
- LDA (Linear Discriminant Analysis)
- LDiA (Latant Dirichlet Allocation)

### A note on scikit-learn classifiers

Typical workflow of scikit-learn classifier:
- create classifier
- fit model - call fit() method
- predict using fitted model - call predict() method


<img src = "01_08.png">

<img src="01_02.png">

<img src="01_01.png">

For more info about ML see:
- S. Raschka, V. Mirjalili, 'Python Machine Learning', Packt Publishing 2nd edition 2017

## LDA classifier

[LDA](https://en.wikipedia.org/wiki/Linear_discriminant_analysis) is supervised learning alghoritm - you need class labels to train t.


LDA training overiew:
1. Compute the average position (centroid) of all the TF-IDF vectors within the class (such as spam SMS messages).
2. Compute the average position (centroid) of all the TF-IDF vectors not in the class (such as nonspam SMS messages).
3. Compute the vector difference between the centroids (the line that connects them).



#### Example - SPAM classifiaction using LDA

We use a one-dimensional distincion of spammines - scalar value of spammines.

In [13]:
import pandas as pd
pd.options.display.width = 120
sms = pd.read_csv('sms-spam.csv')
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms['spam'] = sms.spam.astype(int)
print("len = ", len(sms))
print("spam msg sum = ", sms.spam.sum())

len =  4837
sum =  638


In [15]:
sms.head(6)

Unnamed: 0.1,Unnamed: 0,spam,text
sms0,0,0,"Go until jurong point, crazy.. Available only ..."
sms1,1,0,Ok lar... Joking wif u oni...
sms2!,2,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,3,0,U dun say so early hor... U c already then say...
sms4,4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,5,1,FreeMsg Hey there darling it's been 3 week's n...


Now we do tokenization:

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf_model = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf_model.fit_transform(raw_documents=sms.text).toarray()
print("shape tfidf = ", tfidf_docs.shape)
print("spam msg sum = ", sms.spam.sum())


shape tfidf =  (4837, 9232)
spam msg sum =  638


**Note** The nltk.casual_tokenizer gave you 9,232 words in your vocabulary. You have almost twice as many words as you have messages. And you have almost ten times as many words as spam messages. So your model won’t have a lot of information about the words that will indicate whether a message is spam or not. Usually, a Naive Bayes
classifier won’t work well when your vocabulary is much larger than the number of labeled examples in your dataset. That’s where the semantic analysis techniques of this chapter can help.


Now we do LDA step by step (see [sklearn.discriminant_analysis.LinearDiscriminantAnalysis](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) for black-box):

In [18]:
mask = sms.spam.astype(bool).values
spam_centroid = tfidf_docs[mask].mean(axis=0)
ham_centroid = tfidf_docs[~mask].mean(axis=0)
print("spam centroids = ", spam_centroid.round(2))
print("non-spam centroids = ", ham_centroid.round(2))

spam centroids =  [0.06 0.   0.   ... 0.   0.   0.  ]
non-spam centroids =  [0.02 0.01 0.   ... 0.   0.   0.  ]


Project tfidf vector to the direction between centroids positions:

In [21]:
spamminess_score = tfidf_docs.dot(spam_centroid - ham_centroid)
print("vector between centroids = ", spamminess_score.round(2))


vector between centroids =  [-0.01 -0.02  0.04 ... -0.01 -0.    0.  ]


<img src= "SpamCloudPoints.png">

The arrow from the nonspam centroid to the spam centroid is the line that defines
your trained model. You can see how some of the green dots are on the back side of
the arrow, so you could get a negative spamminess score when you project them onto
this line between the centroids.


We can normalize 'spammines' to [0;1] interval.

In [23]:
from sklearn.preprocessing import MinMaxScaler
sms['lda_score'] = MinMaxScaler().fit_transform(spamminess_score.reshape(-1,1))
sms['lda_predict'] = (sms.lda_score > .5).astype(int)
sms['spam lda_predict lda_score'.split()].round(2).head(20)


Unnamed: 0,spam,lda_predict,lda_score
sms0,0,0,0.23
sms1,0,0,0.18
sms2!,1,1,0.72
sms3,0,0,0.18
sms4,0,0,0.29
sms5!,1,1,0.55
sms6,0,0,0.32
sms7,0,0,0.5
sms8!,1,1,0.89
sms9!,1,1,0.77


Notice that roughly values of lda_score above 0.5 are indication of potential spam.

In [24]:
(1. - (sms.spam - sms.lda_predict).abs().sum() / len(sms)).round(3)

0.977

We got 97.7% of correct preditions!

Calulate confusion matrix

<img src="ConfusionMatrix.png">

In [28]:
from pugnlp.stats import Confusion
Confusion(sms['spam lda_predict'.split()])

  self.__setattr__('_hist_labels', self.sum().astype(int))
  setattr(self, '_hist_classes', self.T.sum())


lda_predict,0,1
spam,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4135,64
1,45,593


## LSA

**Latent semantic analysis** is based on the oldest and most commonly-used technique for dimension reduction, singular value decomposition. SVD was in widespread use long before the term “machine learning” even existed. SVD decomposes a matrix into three square matrices, one of which is diagonal.

Using SVD, LSA can break down your TF-IDF term-document matrix into three simpler matrices. And they can be multiplied back together to produce the original matrix, without any changes. This is like factorization of a large integer.


<img src = "1024px-Singular-Value-Decomposition.svg.png">

<img src = "Singular_value_decomposition_visualisation.svg">

### PCA

One of the advanced LSA approach is to use [**Principal Component Analysis (PCA)**](https://en.wikipedia.org/wiki/Principal_component_analysis). 


The basic idea is to transform coordinates to the one that axis are defined by the eigenvectors of covariance matrix.

**Remeber** the more variance, the more information!

One can consider PCA as SVD with improvements.

PCA is implemented in [Sciki Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

#### Example SMS spam subjects

Read data from csv:

In [2]:
import pandas as pd
pd.options.display.width = 120
sms = pd.read_csv('sms-spam.csv')

index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
sms.head(6)


Unnamed: 0.1,Unnamed: 0,spam,text
sms0,0,0,"Go until jurong point, crazy.. Available only ..."
sms1,1,0,Ok lar... Joking wif u oni...
sms2!,2,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,3,0,U dun say so early hor... U c already then say...
sms4,4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,5,1,FreeMsg Hey there darling it's been 3 week's n...


Construct TF-IDF vectors:

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
len(tfidf.vocabulary_)

9232

In [4]:
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs = tfidf_docs - tfidf_docs.mean()
tfidf_docs.shape


(4837, 9232)

In [5]:
sms.spam.sum()  

638

Do PCA:

In [6]:
from sklearn.decomposition import PCA
pca = PCA(n_components=16)
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=index)
pca_topic_vectors.round(3).head(6)


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.003,0.037,0.011,-0.019,-0.053,0.039,-0.065,0.011,-0.082,0.01,0.007,-0.001,-0.035,-0.006,-0.041
sms1,0.404,-0.094,-0.078,0.051,0.1,0.047,0.023,0.065,0.024,-0.025,-0.005,-0.028,0.046,-0.016,0.049,0.045
sms2!,-0.03,-0.048,0.09,-0.067,0.091,-0.043,-0.0,-0.001,-0.057,0.053,0.123,-0.014,0.023,-0.018,-0.049,-0.041
sms3,0.329,-0.033,-0.035,-0.016,0.052,0.056,-0.166,-0.073,0.062,-0.108,0.024,-0.014,0.066,-0.043,0.034,0.054
sms4,0.002,0.031,0.038,0.034,-0.075,-0.093,-0.044,0.06,-0.046,0.029,0.031,0.007,0.018,0.028,-0.082,0.029
sms5!,-0.016,0.059,0.014,-0.006,0.122,-0.04,0.005,0.167,-0.023,0.065,0.04,-0.062,-0.03,0.074,0.014,-0.034


**Note** We can find the weights of any fitted sklearn transformation by examining its .components_ attribute.

In [7]:
 tfidf.vocabulary_

{'go': 3807,
 'until': 8487,
 'jurong': 4675,
 'point': 6296,
 ',': 13,
 'crazy': 2549,
 '..': 21,
 'available': 1531,
 'only': 5910,
 'in': 4396,
 'bugis': 1973,
 'n': 5594,
 'great': 3894,
 'world': 8977,
 'la': 4811,
 'e': 3056,
 'buffet': 1971,
 '...': 25,
 'cine': 2277,
 'there': 8071,
 'got': 3855,
 'amore': 1296,
 'wat': 8736,
 'ok': 5874,
 'lar': 4848,
 'joking': 4642,
 'wif': 8875,
 'u': 8395,
 'oni': 5906,
 'free': 3604,
 'entry': 3195,
 '2': 471,
 'a': 1054,
 'wkly': 8933,
 'comp': 2386,
 'to': 8192,
 'win': 8890,
 'fa': 3328,
 'cup': 2608,
 'final': 3450,
 'tkts': 8180,
 '21st': 497,
 'may': 5272,
 '2005': 487,
 '.': 15,
 'text': 8020,
 '87121': 948,
 'receive': 6688,
 'question': 6574,
 '(': 9,
 'std': 7651,
 'txt': 8379,
 'rate': 6628,
 ')': 10,
 't': 7889,
 '&': 7,
 "c's": 2020,
 'apply': 1383,
 '08452810075': 115,
 'over': 6003,
 '18': 438,
 "'": 8,
 's': 6959,
 'dun': 3041,
 'say': 7034,
 'so': 7438,
 'early': 3069,
 'hor': 4207,
 'c': 2019,
 'already': 1268,
 'then': 

In [8]:
column_nums, terms = zip(*sorted(zip(tfidf.vocabulary_.values(), tfidf.vocabulary_.keys())))
terms


('!',
 '"',
 '#',
 '#150',
 '#5000',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '. .',
 '. . .',
 '. . . .',
 '. . . . .',
 '. ..',
 '..',
 '.. .',
 '.. . . .',
 '.. ... ...',
 '...',
 '... . . . .',
 '/',
 '0',
 '00',
 '00870405040',
 '0089',
 '01',
 '0121 2025050',
 '01223585236',
 '01223585334',
 '01256987',
 '02',
 '02/06',
 '02/09',
 '0207 153 9153',
 '0207 153 9996',
 '0207-083-6089',
 '02072069400',
 '02073162414',
 '02085076972',
 '03',
 '03530150',
 '04',
 '04/09',
 '05',
 '050703',
 '06',
 '06.05',
 '06/11',
 '07/11',
 '07008009200',
 '07046744435',
 '07090201529',
 '07090298926',
 '07099833605',
 '07123456789',
 '07732584351',
 '07734396839',
 '07742676969',
 '07753741225',
 '0776xxxxxxx',
 '07786200117',
 '077xxx',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '078498',
 '07880867867',
 '0789xxxxxxx',
 '07946746291',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '08',
 '0800',
 '0800 0721072',
 '

In [9]:
weights = pd.DataFrame(pca.components_, columns=terms, index=['topic{}'.format(i) for i in range(16)])
pd.options.display.max_columns = 8
weights.head(4).round(3)


Unnamed: 0,!,"""",#,#150,...,…,┾,〨ud,鈥
topic0,-0.071,0.008,-0.001,-0.0,...,-0.002,0.001,0.001,0.001
topic1,0.063,0.008,0.0,-0.0,...,0.003,0.001,0.001,0.001
topic2,0.071,0.027,0.0,0.001,...,0.002,-0.001,-0.001,-0.001
topic3,-0.059,-0.033,-0.001,-0.0,...,0.001,0.001,0.001,0.001


Now we can see which term contributes to each topic - we restrict to meaningful words:

In [10]:
pd.options.display.max_columns = 12
deals = weights['! ;) :) half off free crazy deal only $ 80 %'.split()].round(3) * 100
deals


Unnamed: 0,!,;),:),half,off,free,crazy,deal,only,$,80,%
topic0,-7.1,0.1,-0.5,-0.0,-0.4,-2.0,-0.0,-0.1,-2.2,0.3,-0.0,-0.0
topic1,6.3,0.0,7.4,0.1,0.4,-2.3,-0.2,-0.1,-3.8,-0.1,-0.0,-0.2
topic2,7.1,0.2,-0.1,0.0,0.3,4.4,0.1,-0.1,0.7,0.0,0.0,0.1
topic3,-5.9,-0.3,-7.1,0.2,0.3,-0.2,0.0,0.1,-2.3,0.1,-0.1,-0.3
topic4,38.0,-0.1,-12.4,-0.1,-0.2,9.9,0.1,-0.2,3.0,0.3,0.1,-0.1
topic5,-26.5,0.1,-1.6,-0.3,-0.7,-1.4,-0.6,-0.2,-1.8,-0.9,0.0,0.0
topic6,-10.9,-0.5,19.9,-0.4,-0.9,-0.6,-0.2,-0.1,-1.4,-0.0,-0.0,-0.1
topic7,16.2,0.1,-18.0,0.8,0.8,-2.9,0.0,0.1,-2.0,-0.3,0.0,-0.1
topic8,34.2,0.1,5.0,-0.4,-0.5,0.3,-0.4,-0.4,3.2,-0.6,-0.0,-0.2
topic9,7.5,-0.3,16.3,1.5,-0.9,6.3,-0.5,-0.4,3.1,-0.4,-0.0,0.0


**Note** The casual_tokenize tokenizer splits "80%" into ["80","%"] and "$80 million" into ["$", 80", "million"]. So unless you use LSA or a 2-gram tokenizer, your NLP pipeline wouldn’t notice the difference between 80% and $80 million. They’d both share the token “80.”

In [11]:
deals.T.sum()

topic0    -11.9
topic1      7.5
topic2     12.7
topic3    -15.5
topic4     38.3
topic5    -33.9
topic6      4.8
topic7     -5.3
topic8     40.3
topic9     32.2
topic10   -29.1
topic11   -49.2
topic12    16.9
topic13    49.0
topic14    23.8
topic15     1.0
dtype: float64

## Yet another PCA semantic analysis

Prepare TF-IDF vector and then using PCA create semantic vector. Finally,we apply PCA to find interesting relations. The code will be more extensive and shows state-of-art NLP codes:

In [12]:
from nltk.tokenize import casual_tokenize
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
# from nltk.stem import PorterStemmer
from sklearn.decomposition import PCA



NUM_TOPICS = 3
NUM_WORDS = 6
NUM_DOCS = NUM_PRETTY = 16
SAVE_SORTED_CORPUS = ''  # 'cats_and_dogs_sorted.txt'
# import nltk
# nltk.download('wordnet')  # noqa
# from nltk.stem.wordnet import WordNetLemmatizer


# STOPWORDS = 'a an and or the do are with from for of on in by if at to into them'.split()
# STOPWORDS += 'to at it its it\'s that than our you your - -- " \' ? , . !'.split()
STOPWORDS = []

# SYNONYMS = dict(zip(
#     'wolv people person women woman man human he  we  her she him his hers'.split(),
#     'wolf her    her    her   her   her her   her her her her her her her'.split()))
# SYNONYMS.update(dict(zip(
#     'ate pat smarter have had isn\'t hasn\'t no  got get become been was were wa be sat seat sit'.split(),
#     'eat pet smart   has  has not    not     not has has is     is   is  is   is is sit sit  sit'.split())))
# SYNONYMS.update(dict(zip(
#     'i me my mine our ours catbird bird birds birder tortoise turtle turtles turtle\'s don\'t'.split(),
#     'i i  i  i    i   i    bird    bird birds bird   turtle   turtle turtle  turtle    not'.split())))
SYNONYMS = {}

stemmer = None  # PorterStemmer()

pd.options.display.width = 110
pd.options.display.max_columns = 14
pd.options.display.max_colwidth = 32




def normalize_corpus_words(corpus, stemmer=stemmer, synonyms=SYNONYMS, stopwords=STOPWORDS):
    docs = [doc.lower() for doc in corpus]
    docs = [casual_tokenize(doc) for doc in docs]
    docs = [[synonyms.get(w, w) for w in words if w not in stopwords] for words in docs]
    if stemmer:
        docs = [[stemmer.stem(w) for w in words if w not in stopwords] for words in docs]
        docs = [[synonyms.get(w, w) for w in words if w not in stopwords] for words in docs]
    docs = [' '.join(w for w in words if w not in stopwords) for words in docs]
    return docs


def tokenize(text, vocabulary, synonyms=SYNONYMS, stopwords=STOPWORDS):
    doc = normalize_corpus_words([text.lower()], synonyms=synonyms, stopwords=stopwords)[0]
    stems = [w for w in doc.split() if w in vocabulary]
    return stems


fun_words = vocabulary = 'cat dog apple lion nyc love big small'
fun_stems = normalize_corpus_words([fun_words])[0].split()[:NUM_WORDS]
fun_words = fun_words.split()


In [20]:
#corpus = get_data('cats_and_dogs_sorted')[:NUM_PRETTY]
f = open("CatsAndDogs.txt", "r")
corpus= f.readlines()

docs = normalize_corpus_words(corpus)
tfidfer = TfidfVectorizer(min_df=1, max_df=.99, stop_words=None, token_pattern=r'(?u)\b\w+\b',
                          vocabulary=fun_stems)
tfidf_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())
id_words = [(i, w) for (w, i) in tfidfer.vocabulary_.items()]
tfidf_dense.columns = list(zip(*sorted(id_words)))[1]
tfidfer.use_idf = False
tfidfer.norm = None
bow_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())
bow_dense.columns = list(zip(*sorted(id_words)))[1]
bow_dense = bow_dense.astype(int)
tfidfer.use_idf = True
tfidfer.norm = 'l2'
bow_pretty = bow_dense.copy()
bow_pretty = bow_pretty[fun_stems]
bow_pretty['text'] = corpus
for col in fun_stems:
    bow_pretty.loc[bow_pretty[col] == 0, col] = ''
# print(bow_pretty)
word_tfidf_dense = pd.DataFrame(tfidfer.transform(fun_stems).todense())
word_tfidf_dense.columns = list(zip(*sorted(id_words)))[1]
word_tfidf_dense.index = fun_stems

tfidf_pretty = tfidf_dense.copy()
tfidf_pretty = tfidf_pretty[fun_stems]
tfidf_pretty = tfidf_pretty.round(2)
for col in fun_stems:
    tfidf_pretty.loc[tfidf_pretty[col] == 0, col] = ''


tfidf_pretty.head(10)

Unnamed: 0,cat,dog,apple,lion,nyc,love
0,,,0.79,,0.62,
1,,,0.79,,0.62,
2,,,,,0.66,0.75
3,,,0.79,,0.62,
4,,,0.79,,0.62,
5,,,1.0,,,
6,1.0,,,,,
7,0.5,,,0.87,,
8,0.54,,,,,0.84
9,,,,,0.66,0.75


Start semantic analysis using PCA:

In [23]:
tfidf_zeros = tfidf_dense.T.sum()[tfidf_dense.T.sum() == 0]
pcaer = PCA(n_components=NUM_TOPICS)

doc_topic_vectors = pd.DataFrame(pcaer.fit_transform(tfidf_dense.values), columns=['top{}'.format(i) for i in range(NUM_TOPICS)])
doc_topic_vectors['text'] = corpus
pd.options.display.max_colwidth = 55
doc_topic_vectors.round(1)

Unnamed: 0,top0,top1,top2,text
0,-0.1,-0.3,0.7,NYC is the Big Apple.\n
1,-0.1,-0.3,0.7,NYC is known as the Big Apple.\n
2,-0.1,-0.3,0.6,I love NYC!\n
3,-0.1,-0.3,0.7,I wore a hat to the Big Apple party in NYC.\n
4,-0.1,-0.3,0.7,Come to NYC. See the Big Apple!\n
...,...,...,...,...
258,-0.1,-0.2,-0.1,Are you a vet?\n
259,-0.1,-0.2,-0.1,My flowers are blooming.\n
260,-0.1,-0.2,-0.1,A single flower grew in Benji's grave.\n
261,-0.1,-0.2,-0.1,Char chased the squirrel.\n


Now we connect terms with topics and text:

In [24]:
word_topic_vectors = pd.DataFrame(pcaer.transform(word_tfidf_dense.values), columns=['top{}'.format(i) for i in range(NUM_TOPICS)])
word_topic_vectors.index = fun_stems

In [26]:
word_topic_vectors

Unnamed: 0,top0,top1,top2
cat,0.800488,0.29634,0.016234
dog,-0.555923,0.695405,0.018666
apple,-0.087553,-0.216128,0.117937
lion,-0.062965,-0.192958,-0.128977
nyc,-0.111172,-0.298512,0.880894
love,-0.085256,-0.247502,0.012591


This is exactly what we want - we associated terms with topics automatically.

### Return to SVD

We now return to the analysis of SVD properties. It will aslo show how PCA is based on.

In [56]:
U, Sigma, VT = np.linalg.svd(tfidf_dense.T)  # <1> Transpose the doc-word tfidf matrix, because SVD works on column vectors
S = pd.DataFrame(np.diag(Sigma.copy()))
doc_labels = ['doc{}'.format(i) for i in range(len(tfidf_dense))]
U_df = pd.DataFrame(U, index=fun_stems)
VT_df = pd.DataFrame(VT, index=doc_labels, columns=doc_labels)

We have the following matrices:

<img src = "Singular_value_decomposition_visualisation.svg">

#### U - left singular vectors (term-topic matrix)

The left singular vectors tell you how to "rotate" the TF-IDF vectors into the topic space, equivalent to creating topics

The U matrix contains the term-topic matrix that tells you about “the company a word keeps.” This is the most important matrix for semantic analysis in NLP. The U matrix is called the “left singular vectors” because it contains row vectors that should be multiplied by a matrix of column vectors from the left. U is the cross-correlation between words and topics based on word co-occurrence in the same document. It’s a square matrix until you start truncating it (deleting columns). 

In [46]:
U_df

Unnamed: 0,0,1,2,3,4,5
cat,-0.991106,0.128944,0.019021,0.010894,0.024432,-0.00227
dog,-0.128622,-0.99143,0.022126,-0.004297,-0.000856,-0.00366
apple,-0.001022,-0.001375,-0.210788,0.091791,-0.001459,-0.97321
lion,-0.024051,0.004013,0.002123,0.003015,-0.999695,0.001343
nyc,-0.019346,-0.02046,-0.953934,0.196251,-0.000748,0.225173
love,-0.014555,0.001572,-0.211479,-0.976173,-0.003099,-0.046249


#### S - singular values

The Sigma or S matrix contains the topic “singular values” in a square diagonal matrix.The singular values tell you how much information is captured by each dimension in your new semantic (topic) vector space. A diagonal matrix has nonzero values only along the diagonal from the upper left to the lower right.

In [57]:
S

Unnamed: 0,0,1,2,3,4,5
0,6.342806,0.0,0.0,0.0,0.0,0.0
1,0.0,5.688725,0.0,0.0,0.0,0.0
2,0.0,0.0,3.502347,0.0,0.0,0.0
3,0.0,0.0,0.0,2.76255,0.0,0.0
4,0.0,0.0,0.0,0.0,2.115862,0.0
5,0.0,0.0,0.0,0.0,0.0,1.741329


#### $V^T$ right singular vectors


The $V^T$ matrix contains the “right singular vectors” as the columns of the document-document matrix. This gives you the shared meaning between documents, because it measures how often documents use the same topics in your new semantic model of the documents. It has the same number of rows (p) and columns as you have documents in your small corpus.


In [51]:
VT_df.head(10)

Unnamed: 0,doc0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,...,doc255,doc256,doc257,doc258,doc259,doc260,doc261,doc262
doc0,-0.002006,-0.002006,-0.003736,-0.002006,-0.002006,-0.000161,-0.156257,-0.081561,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc1,-0.002407,-0.002407,-0.002162,-0.002407,-0.002407,-0.000242,0.022667,0.011966,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc2,-0.215227,-0.215227,-0.224876,-0.215227,-0.215227,-0.060185,0.005431,0.003245,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc3,0.069942,0.069942,-0.219016,0.069942,0.069942,0.033227,0.003944,0.00292,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc4,-0.000761,-0.000761,-0.001335,-0.000761,-0.000761,-0.00069,0.011547,-0.403128,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc5,-0.360514,-0.360514,0.065218,-0.360514,-0.360514,-0.558889,-0.001304,1.4e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc6,-0.134383,0.027968,0.053156,0.033837,0.021439,0.040278,0.97495,-0.00837,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc7,-0.003905,0.077428,0.165968,-0.328559,0.075929,0.141068,-0.008205,0.830735,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc8,-0.116764,-0.028473,-0.141993,-0.025281,-0.107706,0.219134,-0.012857,-0.004296,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
doc9,-0.078585,-0.078585,-0.251298,-0.078585,0.061376,0.137343,0.001279,0.000427,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Truncated SVD

<img src="TruncatedSVM.png">

Truncated SVD is realized in [scikit learn library](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html).

## Truncated SVD for SMS semantic analysis

In [60]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=16, n_iter=100)
svd_topic_vectors = svd.fit_transform(tfidf_docs.values)
svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns, index=index)
svd_topic_vectors.round(3).head(6)


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,0.003,0.037,0.011,-0.019,-0.053,0.039,-0.066,0.012,-0.083,0.007,-0.007,0.002,-0.036,-0.014,0.037
sms1,0.404,-0.094,-0.078,0.051,0.1,0.047,0.023,0.065,0.023,-0.024,-0.004,0.036,0.043,-0.021,0.051,-0.042
sms2!,-0.03,-0.048,0.09,-0.067,0.091,-0.043,-0.0,-0.001,-0.057,0.051,0.125,0.023,0.026,-0.02,-0.042,0.052
sms3,0.329,-0.033,-0.035,-0.016,0.052,0.056,-0.166,-0.074,0.063,-0.108,0.022,0.023,0.073,-0.046,0.022,-0.07
sms4,0.002,0.031,0.038,0.034,-0.075,-0.093,-0.044,0.061,-0.045,0.029,0.028,-0.009,0.027,0.034,-0.083,-0.021
sms5!,-0.016,0.059,0.014,-0.006,0.122,-0.04,0.005,0.167,-0.023,0.064,0.041,0.055,-0.037,0.075,-0.001,0.02


We can now check how good is TruncatedSVD in spam analysis. Compute cos-similarity beetween messages:

In [62]:
import numpy as np
svd_topic_vectors = (svd_topic_vectors.T / np.linalg.norm(\
svd_topic_vectors, axis=1)).T
svd_topic_vectors.iloc[:10].dot(svd_topic_vectors.iloc[:10].T).round(1)


Unnamed: 0,sms0,sms1,sms2!,sms3,sms4,sms5!,sms6,sms7,sms8!,sms9!
sms0,1.0,0.6,-0.1,0.6,-0.0,-0.3,-0.3,-0.1,-0.3,-0.3
sms1,0.6,1.0,-0.2,0.8,-0.2,0.0,-0.2,-0.2,-0.1,-0.1
sms2!,-0.1,-0.2,1.0,-0.2,0.1,0.4,0.0,0.3,0.5,0.4
sms3,0.6,0.8,-0.2,1.0,-0.2,-0.3,-0.1,-0.3,-0.2,-0.1
sms4,-0.0,-0.2,0.1,-0.2,1.0,0.2,0.0,0.1,-0.4,-0.2
sms5!,-0.3,0.0,0.4,-0.3,0.2,1.0,-0.1,0.1,0.3,0.4
sms6,-0.3,-0.2,0.0,-0.1,0.0,-0.1,1.0,0.1,-0.2,-0.2
sms7,-0.1,-0.2,0.3,-0.3,0.1,0.1,0.1,1.0,0.1,0.4
sms8!,-0.3,-0.1,0.5,-0.2,-0.4,0.3,-0.2,0.1,1.0,0.3
sms9!,-0.3,-0.1,0.4,-0.1,-0.2,0.4,-0.2,0.4,0.3,1.0


You should see larger positive cosine similarity (dot products) between any spam message (“sms2!”)!!

This gives you and idea for spam detection.

##  Latent Dirichlet allocation (LDiA)

[LDiA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) creates a semantic vector space model (like your topic vectors) using an
approach similar to how your brain worked - you manually allocated words to topics based on how often they occurred together in the same document. The topic mix for a document can then be determined by the word mixtures in each topic by which topic those words were assigned to. This makes an LDiA topic model much easier to
understand, because the words assigned to topics and topics assigned to documents tend to make more sense than for LSA.


The LDiA approach was developed in 2000 by geneticists in the UK to help them “infer population structure” from sequences of genes. Stanford Researchers (including Andrew Ng) popularized the approach for NLP in 2003.

LDiA is realized in [scikit learn library](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

#### LDiA example

Create BOW vectors:

In [63]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize
np.random.seed(42)
counter = CountVectorizer(tokenizer=casual_tokenize)
bow_docs = pd.DataFrame(counter.fit_transform(raw_documents=sms.text).toarray(), index=index)
column_nums, terms = zip(*sorted(zip(counter.vocabulary_.values(), counter.vocabulary_.keys())))
bow_docs.columns = terms
bow_docs

Unnamed: 0,!,"""",#,#150,#5000,$,%,&,...,—,‘,’,“,…,┾,〨ud,鈥
sms0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
sms1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
sms2!,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0
sms3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
sms4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sms4832!,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
sms4833,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
sms4834,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0
sms4835,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0


In [65]:
sms.loc['sms0'].text


'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [66]:
bow_docs.loc['sms0'][bow_docs.loc['sms0'] > 0].head()


,            1
..           1
...          2
amore        1
available    1
Name: sms0, dtype: int64

Make LDiA:

In [68]:
from sklearn.decomposition import LatentDirichletAllocation as LDiA
ldia = LDiA(n_components=16, learning_method='batch')
ldia = ldia.fit(bow_docs)
ldia.components_.shape
 

(16, 9232)

In [71]:
pd.set_option('display.width', 75)
components = pd.DataFrame(ldia.components_.T, index=terms, columns=columns)
components.round(2).head(3)


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
!,184.03,15.0,72.22,394.95,45.48,36.14,9.55,44.81,0.43,90.23,37.42,44.18,64.4,297.29,41.16,11.7
"""",0.68,4.22,2.41,0.06,152.35,0.06,0.06,0.06,0.45,0.68,8.42,11.42,0.07,62.72,12.27,0.06
#,0.06,0.06,0.06,0.06,0.06,2.07,0.06,0.06,0.06,0.06,0.06,0.06,1.07,4.05,0.06,0.06


In [72]:
components.topic3.sort_values(ascending=False)[:10]


!       394.952246
.       218.049724
to      119.533134
u       118.857546
call    111.948541
£       107.358914
,        96.954384
*        90.314783
your     90.215961
is       75.750037
Name: topic3, dtype: float64

Before you fit your LDA classifier, you need to compute these LDiA topic vectors for all your documents (SMS messages). And let’s see how they are different from the topic vectors produced by SVD and PCA for those same documents:


In [74]:
ldia16_topic_vectors = ldia.transform(bow_docs)
ldia16_topic_vectors = pd.DataFrame(ldia16_topic_vectors, index=index, columns=columns)
ldia16_topic_vectors.round(2).head(10)


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.0,0.62,0.0,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.78,0.01,0.01,0.12,0.01,0.01,0.01,0.01
sms2!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.39,0.0,0.33,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0
sms5!,0.0,0.0,0.28,0.0,0.0,0.0,0.0,0.17,0.0,0.26,0.05,0.0,0.11,0.08,0.05,0.0
sms6,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45,0.0,0.0
sms7,0.0,0.0,0.0,0.0,0.97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms8!,0.57,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
sms9!,0.0,0.0,0.0,0.43,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.44,0.0,0.0


You can see that these topics are more cleanly separated. There are a lot of zeros in your allocation of topics to messages. This is one of the things that makes LDiA topics easier to explain to coworkers when making business decisions based on your NLP pipeline results.


#### LDiA + LDA = spam classifier

We use LDiA vector to use in LDA classifier:

In [76]:
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
X_train, X_test, y_train, y_test =  train_test_split(ldia16_topic_vectors, sms.spam, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia16_spam'] = lda.predict(ldia16_topic_vectors)
round(float(lda.score(X_test, y_test)), 2)


0.94

94% accuracy on the test set  is pretty good, but not quite as PCA - see below
 

We compare it with LDA on TF-IDF vectors, i.e., without LDiA:

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
tfidf_docs = tfidf_docs - tfidf_docs.mean(axis=0)

X_train, X_test, y_train, y_test = train_test_split(tfidf_docs,sms.spam.values, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
print("score on train data = ", round(float(lda.score(X_train, y_train)), 3))
print("score for test data = ", round(float(lda.score(X_test, y_test)), 3))


score on train data =  1.0
score for test data =  0.748


#### 32 LDiA topics

Above we used 16 topics. Now we increase number of topics to 32:

In [79]:
ldia32 = LDiA(n_components=32, learning_method='batch')
ldia32 = ldia32.fit(bow_docs)
ldia32.components_.shape

(32, 9232)

In [80]:
ldia32_topic_vectors = ldia32.transform(bow_docs)
columns32 = ['topic{}'.format(i) for i in range(ldia32.n_components)]
ldia32_topic_vectors = pd.DataFrame(ldia32_topic_vectors, index=index, columns=columns32)
ldia32_topic_vectors.round(2).head()


Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,...,topic24,topic25,topic26,topic27,topic28,topic29,topic30,topic31
sms0,0.0,0.0,0.0,0.06,0.14,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.0,0.0,0.0,0.0,0.53,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.14,0.0,0.0
sms2!,0.0,0.0,0.0,0.0,0.0,0.65,0.0,0.0,...,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.11,0.0,0.0,0.39,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.09,0.0,0.0,0.47,0.0,0.0,0.0,0.0


In [81]:
X_train, X_test, y_train, y_test = train_test_split(ldia32_topic_vectors, sms.spam, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia32_spam'] = lda.predict(ldia32_topic_vectors)

print("score on train data = ", round(float(lda.score(X_train, y_train)), 3))
print("score on test data = ",  round(float(lda.score(X_test, y_test)), 3))



score on train data =  0.933
score on test data =  0.936


## LDA Spam classifier again

We do LDA for TF-IDF vector again using cross-validation for better score estimation:

In [82]:
from sklearn.model_selection import cross_val_score
lda = LDA(n_components=1)
scores = cross_val_score(lda, tfidf_docs, sms.spam, cv=5)
"Accuracy: {:.2f} (+/-{:.2f})".format(scores.mean(), scores.std() * 2)


'Accuracy: 0.77 (+/-0.02)'

#### LSA + LDA SMS Spam calssification

We use PCA data:

In [87]:
X_train, X_test, y_train, y_test = train_test_split(pca_topic_vectors.values, sms.spam, test_size=0.3,random_state=271828)
lda = LDA(n_components=1)
lda.fit(X_train, y_train)
lda.fit(X_train, y_train)
print("score on test data = ", lda.score(X_test, y_test).round(3))

lda = LDA(n_components=1)
scores = cross_val_score(lda, pca_topic_vectors, sms.spam, cv=10)
"Accuracy: {:.3f} (+/-{:.3f})".format(scores.mean(), scores.std() * 2)


score on test data =  0.963


'Accuracy: 0.957 (+/-0.022)'

This is higher accuracy than for Latent Dirichlet Analysis.

## Topic vs Semantic Search

When you search for a document based on a word or partial word it contains, that’s
called **full text search**. This is what search engines do. They break a document into
chunks (usually words) that can be indexed with an inverted index like you’d find at the
back of a textbook. It takes a lot of bookkeeping and guesswork to deal with spelling
errors and typos, but it works pretty well.


[**Semantic search**](https://en.wikipedia.org/wiki/Semantic_search) is full text search that takes into account the meaning of the words
in your query and the documents you’re searching. You’ve learned two
ways
- LSA 
- LDiA

to compute topic vectors that capture the semantics (meaning) of words and documents in a vector. One of the reasons that latent semantic analysis was first called latent semantic indexing was because it promised to power semantic
search with an index of numerical values, like BOW and TF-IDF tables. 

Semantic search was the next big thing in information retrieval.

Unlike BOW and TF-IDF tables, tables of semantic vectors can’t be easily discretized and indexed using traditional inverted index techniques. Traditional indexing approaches work with binary word occurrence vectors, discrete vectors (BOW vectors), sparse continuous vectors (TF-IDF vectors), and low-dimensional continuous
vectors (3D GIS data). 

But high-dimensional continuous vectors, such as topic vectors from LSA or LDiA, are a challenge.58 Inverted indexes work for discrete vectors or binary vectors, like tables of binary or integer word-document vectors, because the
index only needs to maintain an entry for each nonzero discrete dimension. Either that value of that dimension is present or not present in the referenced vector or document. Because TF-IDF vectors are sparse, mostly zero, you don’t need an entry in your index for most dimensions for most documents.

<img src="SemanticSearch.png">

One can note that semantic search get worse around 12 dimensions.

## Disntances and similarities

**Distance** measures disnatnce between two vectors. It can be ordinary euclidean distance or any other 'distance', that measures dissimilarity between vectors.

Distance is assoicated with mathematical meaning of [metric](https://en.wikipedia.org/wiki/Metric_(mathematics)).

Metric must have the following three properties:
- Nonnegativity: metrics can never be negative. metric(A, B) >= 0
- Indiscerniblity: two objects are identical if the metric between them is zero. if metric(A, B) == 0: assert(A == B)
- Symmetry: metrics don’t care about direction. metric(A, B) = metric(B, A)
- Triangle inequality: you can’t get from A to C faster by going through B in-between. metric(A, C) <= metric(A, B) + metric(B, C)


**Similarity** is associated with distance - it ranges between 0.0 (totaly dissimilar) and 1.0 (totally similar).


Typical conversion formula:
- similarity = 1. / (1. + distance)
- distance = (1. / similarity) - 1.

For distances and similarities ranged in $[0;1]$ interval the conversion formulas are simpler:
- similarity = 1. - distance
- distance = 1. - similarity

Example:

Standard similarity distances are defined in [scikit learn library](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html):

- 'cityblock'
- 'cosine', 
- 'euclidean', 
- 'l1',
- 'l2', 
- 'manhattan', 
- 'braycurtis',
- 'canberra', 
- 'chebyshev', 
- 'correlation', 
- 'dice', 
- 'hamming', 
- 'jaccard',
- 'kulsinski', 
- 'mahalanobis', 
- 'matching', 
- 'minkowski', 
- 'rogerstanimoto',
- 'russellrao', 
- 'seuclidean', 
- 'sokalmichener', 
- 'sokalsneath', 
- 'sqeuclidean',
- 'yule'

Each metric has specific application and there is no-one-for-all solution!!!