# WMD And Cosine Similarity General.

Ejemplos GENERALES de WDM y Cosine Similarity:
1. Ejemplo 1 WMD basado en https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632 con código fuente en https://github.com/makcedward/nlp/blob/master/sample/nlp-word_mover_distance.ipynb
2. Ejemplo 2 Cosine Similarity & TF IDF de https://leantechblog.wordpress.com/2020/08/23/como-estimar-la-similitud-entre-documentos-con-python/ con código fuente en https://github.com/cjcarvajal/text-similarity-obama-trump
3. Ejemplo 3 WMD using Spacy basado en https://stackoverflow.com/questions/54535535/how-to-improve-word-mover-distance-similarity-in-python-and-provide-similarity-s (el código es el de la 1ra respuesta).
4. Ejemplo 4 WMD BIEN completo: https://github.com/Seif-Tarek/Document-Similarity-using-Word-Mover-Distance/blob/master/WMD_TextSimilarity.ipynb

### Ejemplo 1 WMD

##### Word Mover's Distance (WMD) is proposed fro distance measurement between 2 documents (or sentences). It leverages Word Embeddings power to overcome those basic distance measurement limitations. 
WMD was introduced by Kusner et al. in 2015. Instead of using Euclidean Distance and other bag-of-words based distance measurement, they proposed to use word embeddings to calculate the similarities. To be precise, it uses normalized Bag-of-Words and Word Embeddings to calculate the distance between documents.

In the previous blog, I shared how we can use simple way to find the "similarity" between two documents (or sentences). At that time, Euclidean Distance, Cosine Distance and Jaccard Similarity are introduced but it has some limitations.  WMD is designed to __overcome synonym problem__.

The typical example is 
- Sentence 1: Obama speaks to the media in Illinois
- Sentence 2: The president greets the press in Chicago

Except the stop words, there is no common words among two sentences but both of them are taking about same topic (at that time).

WMD use word embeddings to calculate the distance so that it can calculate even though there is no common word. The assumption is that similar words should have similar vectors.

First of all, lower case and removing stopwords is an essential step to reduce complexity and preventing misleading. 
- Sentence 1: obama speaks media illinois
- Sentence 2: president greets press chicago

Retrieve vectors from any pre-trained word embeddings models. It can be GloVe, word2vec, fasttext or custom vectors. After that it using normalized bag-of-words (nBOW) to represent the weight or importance. It assumes that higher frequency implies that it is more important.

It allows transfer every word from sentence 1 to sentence 2 because algorithm does not know "obama" should transfer to "president". At the end it will choose the minimum transportation cost to transport every word from sentence 1 to sentence 2.

##### WMD Implementation
By using gensim, we only need to provide two list of tokens then it will take the rest of calculation

In [6]:
"""
    News headline get from 
    
    https://www.reuters.com/article/us-musk-tunnel/elon-musks-boring-co-to-build-high-speed-airport-link-in-chicago-idUSKBN1JA224
    http://money.cnn.com/2018/06/14/technology/elon-musk-boring-company-chicago/index.html
    https://www.theverge.com/2018/6/13/17462496/elon-musk-boring-company-approved-tunnel-chicago

"""

news_headline1 = "Elon Musk's Boring Co to build high-speed airport link in Chicago"
news_headline2 = "Elon Musk's Boring Company to build high-speed Chicago airport link"
news_headline3 = "Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport"
news_headline4 = "Both apple and orange are fruit"

news_headlines = [news_headline1, news_headline2, news_headline3, news_headline4]

In [7]:
# Load Word Embedding Model
import gensim
from gensim.models.keyedvectors import KeyedVectors

print('gensim version: %s' % gensim.__version__)
#glove_model = gensim.models.KeyedVectors.load_word2vec_format('../model/text/stanford/glove/glove.6B.50d.vec')
glove_model = KeyedVectors.load_word2vec_format('/home/fedricio/Desktop/Glove_Word_Emb/glove.6B.50d.txt', binary=False)

gensim version: 4.0.0




ValueError: invalid literal for int() with base 10: 'the'

In [None]:
# Remove stopwords
import spacy
spacy_nlp = spacy.load('en')

headline_tokens = []
for news_headline in news_headlines:
    headline_tokens.append([token.text.lower() for token in spacy_nlp(news_headline) if not token.is_stop])

print(headline_tokens)

In [None]:
subject_headline = news_headlines[0]
subject_token = headline_tokens[0]

print('Headline: ', subject_headline)
print('=' * 50)
print()

for token, headline in zip(headline_tokens, news_headlines):
    print('-' * 50)
    print('Comparing to:', headline)
    distance = glove_model.wmdistance(subject_token, token)
    print('distance = %.4f' % distance)

In gensim implementation, OOV will be removed so that it will not throw an exception or using random vector.

### Ejemplo 2 Cosine Similarity & TF IDF

In [3]:
from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

nlp_article = process_file("Archivos 1-General/Ejemplo2/nlp.txt")
sentiment_analysis_article = process_file("Archivos 1-General/Ejemplo2/sentiment_analysis.txt")
java_certification_article = process_file("Archivos 1-General/Ejemplo2/java_cert.txt")

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
#X = count.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

print('----------------------------------')
print('Leantechblog article similarity:')
print('----------------------------------')
print(similarity_matrix)

michelle_speech = process_file("Archivos 1-General/Ejemplo2/michelle_speech.txt")
melania_speech = process_file("Archivos 1-General/Ejemplo2/melania_speech.txt")

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([michelle_speech,melania_speech])
similarity_matrix = cosine_similarity(X,X)

print('-----------------------------------------')
print('Melania and Michelle speeches similarity:')
print('-----------------------------------------')
print(similarity_matrix)

----------------------------------
Leantechblog article similarity:
----------------------------------
[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]
-----------------------------------------
Melania and Michelle speeches similarity:
-----------------------------------------
[[1.         0.29814417]
 [0.29814417 1.        ]]


### Ejemplo 3 WMD using Spacy

In [3]:
#Para desto descargue spacy: >conda install -c conda-forge spacy
#Y el modelo: python -m spacy download en_core_web_sm

import spacy
spacy_nlp = spacy.load('en_core_web_lg')
text = "Some hotel description"
doc = spacy_nlp(text)
current_tokens = [token.text for token in doc]
#
#for item in doc:
#   if item.ent_type_ == "the_type_to_be_removed":
     # remove word from `current_tokens` list
new_text = " ".join(current_tokens)
doc = spacy_nlp(new_text)

#Descargue wdm: pip install wmd
import wmd
spacy_nlp.add_pipe(wmd.WMD.SpacySimilarityHook(spacy_nlp), last=True)
doc_2 = spacy_nlp("Another hotel description")
print(doc.similarity(doc_2))

0.9536311618244545


### Ejemplo 4 WMD BIEN completo