# WMD And Cosine Similarity General.

Ejemplos GENERALES de WDM y Cosine Similarity:
1. Ejemplo 1 WMD basado en https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632 con código fuente en https://github.com/makcedward/nlp/blob/master/sample/nlp-word_mover_distance.ipynb
2. Ejemplo 2 Cosine Similarity & TF IDF de https://leantechblog.wordpress.com/2020/08/23/como-estimar-la-similitud-entre-documentos-con-python/ con código fuente en https://github.com/cjcarvajal/text-similarity-obama-trump
3. Ejemplo 3 WMD using Spacy basado en https://stackoverflow.com/questions/54535535/how-to-improve-word-mover-distance-similarity-in-python-and-provide-similarity-s (el código es el de la 1ra respuesta).
4. Ejemplo 4 WMD BIEN completo: https://github.com/Seif-Tarek/Document-Similarity-using-Word-Mover-Distance/blob/master/WMD_TextSimilarity.ipynb
5. Keyword extraction using TF*IDF: https://www.analyticsvidhya.com/blog/2020/11/words-that-matter-a-simple-guide-to-keyword-extraction-in-python/

### Ejemplo 1 WMD

##### Word Mover's Distance (WMD) is proposed fro distance measurement between 2 documents (or sentences). It leverages Word Embeddings power to overcome those basic distance measurement limitations. 
WMD was introduced by Kusner et al. in 2015. Instead of using Euclidean Distance and other bag-of-words based distance measurement, they proposed to use word embeddings to calculate the similarities. To be precise, it uses normalized Bag-of-Words and Word Embeddings to calculate the distance between documents.

In the previous blog, I shared how we can use simple way to find the "similarity" between two documents (or sentences). At that time, Euclidean Distance, Cosine Distance and Jaccard Similarity are introduced but it has some limitations.  WMD is designed to __overcome synonym problem__.

The typical example is 
- Sentence 1: Obama speaks to the media in Illinois
- Sentence 2: The president greets the press in Chicago

Except the stop words, there is no common words among two sentences but both of them are taking about same topic (at that time).

WMD use word embeddings to calculate the distance so that it can calculate even though there is no common word. The assumption is that similar words should have similar vectors.

First of all, lower case and removing stopwords is an essential step to reduce complexity and preventing misleading. 
- Sentence 1: obama speaks media illinois
- Sentence 2: president greets press chicago

Retrieve vectors from any pre-trained word embeddings models. It can be GloVe, word2vec, fasttext or custom vectors. After that it using normalized bag-of-words (nBOW) to represent the weight or importance. It assumes that higher frequency implies that it is more important.

It allows transfer every word from sentence 1 to sentence 2 because algorithm does not know "obama" should transfer to "president". At the end it will choose the minimum transportation cost to transport every word from sentence 1 to sentence 2.

##### WMD Implementation
By using gensim, we only need to provide two list of tokens then it will take the rest of calculation

In [1]:
"""
    News headline get from 
    
    https://www.reuters.com/article/us-musk-tunnel/elon-musks-boring-co-to-build-high-speed-airport-link-in-chicago-idUSKBN1JA224
    http://money.cnn.com/2018/06/14/technology/elon-musk-boring-company-chicago/index.html
    https://www.theverge.com/2018/6/13/17462496/elon-musk-boring-company-approved-tunnel-chicago

"""

news_headline1 = "Elon Musk's Boring Co to build high-speed airport link in Chicago"
news_headline2 = "Elon Musk's Boring Company to build high-speed Chicago airport link"
news_headline3 = "Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport"
news_headline4 = "Both apple and orange are fruit"

news_headlines = [news_headline1, news_headline2, news_headline3, news_headline4]

In [2]:
# Load Word Embedding Model
import gensim
from gensim.models.keyedvectors import KeyedVectors

print('gensim version: %s' % gensim.__version__)
#glove_model = gensim.models.KeyedVectors.load_word2vec_format('../model/text/stanford/glove/glove.6B.50d.vec')
glove_model = KeyedVectors.load_word2vec_format('/home/fedricio/Desktop/Embeddings_Utilizados/Glove/glove.6B.50d.txt', binary=False, no_header=True)



gensim version: 4.0.0


In [4]:
# Remove stopwords
import spacy
spacy_nlp = spacy.load('en_core_web_lg')

headline_tokens = []
for news_headline in news_headlines:
    headline_tokens.append([token.text.lower() for token in spacy_nlp(news_headline) if not token.is_stop])

print(headline_tokens)

[['elon', 'musk', 'boring', 'co', 'build', 'high', '-', 'speed', 'airport', 'link', 'chicago'], ['elon', 'musk', 'boring', 'company', 'build', 'high', '-', 'speed', 'chicago', 'airport', 'link'], ['elon', 'musk', 'boring', 'company', 'approved', 'build', 'high', '-', 'speed', 'transit', 'downtown', 'chicago', 'o’hare', 'airport'], ['apple', 'orange', 'fruit']]


In [5]:
subject_headline = news_headlines[0]
subject_token = headline_tokens[0]

print('Headline: ', subject_headline)
print('=' * 50)
print()

for token, headline in zip(headline_tokens, news_headlines):
    print('-' * 50)
    print('Comparing to:', headline)
    distance = glove_model.wmdistance(subject_token, token)
    print('distance = %.4f' % distance)

Headline:  Elon Musk's Boring Co to build high-speed airport link in Chicago

--------------------------------------------------
Comparing to: Elon Musk's Boring Co to build high-speed airport link in Chicago
distance = 0.0000
--------------------------------------------------
Comparing to: Elon Musk's Boring Company to build high-speed Chicago airport link
distance = 0.0734
--------------------------------------------------
Comparing to: Elon Musk’s Boring Company approved to build high-speed transit between downtown Chicago and O’Hare Airport
distance = 0.3675
--------------------------------------------------
Comparing to: Both apple and orange are fruit
distance = 1.1590


In gensim implementation, OOV will be removed so that it will not throw an exception or using random vector.

### Ejemplo 2 Cosine Similarity & TF IDF

In [3]:
from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

nlp_article = process_file("Archivos 1-General/Ejemplo2/nlp.txt")
sentiment_analysis_article = process_file("Archivos 1-General/Ejemplo2/sentiment_analysis.txt")
java_certification_article = process_file("Archivos 1-General/Ejemplo2/java_cert.txt")

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
#X = count.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

print('----------------------------------')
print('Leantechblog article similarity:')
print('----------------------------------')
print(similarity_matrix)

michelle_speech = process_file("Archivos 1-General/Ejemplo2/michelle_speech.txt")
melania_speech = process_file("Archivos 1-General/Ejemplo2/melania_speech.txt")

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([michelle_speech,melania_speech])
similarity_matrix = cosine_similarity(X,X)

print('-----------------------------------------')
print('Melania and Michelle speeches similarity:')
print('-----------------------------------------')
print(similarity_matrix)

----------------------------------
Leantechblog article similarity:
----------------------------------
[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]
-----------------------------------------
Melania and Michelle speeches similarity:
-----------------------------------------
[[1.         0.29814417]
 [0.29814417 1.        ]]


### Ejemplo 3 WMD using Spacy

In [3]:
#Para desto descargue spacy: >conda install -c conda-forge spacy
#Y el modelo: python -m spacy download en_core_web_sm

import spacy
spacy_nlp = spacy.load('en_core_web_lg')
text = "Some hotel description"
doc = spacy_nlp(text)
current_tokens = [token.text for token in doc]
#
#for item in doc:
#   if item.ent_type_ == "the_type_to_be_removed":
     # remove word from `current_tokens` list
new_text = " ".join(current_tokens)
doc = spacy_nlp(new_text)

#Descargue wdm: pip install wmd
import wmd
spacy_nlp.add_pipe(wmd.WMD.SpacySimilarityHook(spacy_nlp), last=True)
doc_2 = spacy_nlp("Another hotel description")
print(doc.similarity(doc_2))

0.9536311618244545


### Ejemplo 4 WMD BIEN completo

In [2]:
from gensim.models import KeyedVectors
import matplotlib.pyplot as  plt
from collections import Counter
import pandas as pd
import numpy as np
import random
import string
import math
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fedricio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Usamos el Word2vec descargado

In [3]:
#Link del cual descargamos el archivo: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
EMBEDDING_FILE = '/home/fedricio/Desktop/Embeddings_Utilizados/Word2vec/GoogleNews-vectors-negative300.bin.gz'
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

#### Leyendo el dataset

In [4]:
data = pd.read_csv('Archivos 1-General/Ejemplo4/DocumentSimilarity_Dataset.csv')
data = data.sample(frac=1).reset_index(drop=True)

data.head()

Unnamed: 0,articles,abstracts,similarity
0,as republicans wrestle with how to oppose pres...,group: chinese police have required some forei...,0
1,three faculty members were killed and three ot...,"the mary rose, flagship of henry viii, was rai...",0
2,all lyle petersen wanted to do was get his mai...,more than 1.5 million people have been infecte...,1
3,a somali suspect in the hijacking of the u.s.-...,"u.n. chief calls pakistan floods ""a global dis...",0
4,can we predict the future of medicine? althoug...,the cleveland clinic has published its top 10 ...,1


#### Explicaciones Cosine SImilarity y WDM

#### 1. Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

![Alt Text](https://i.imgur.com/HqKjGoQ.jpg)


#### 2. Word Mover's Distance
Word Mover's Distance (WMD) uses the word embeddings of the words in two texts to measure the minimum distance that the words in one text need to travel in semantic space to reach the words in the other text.

The WMD is measured by measuring the minimum euclidean distance between each word in the two documents in word2vec space. if the distance is small then words in the two documents are close to each other.

So, If I have the same two sentences:
- sentence 1: "Obama speaks to the media in Illinois"
- sentence 2: "The president greets the press in Chicago"

After removing stopwords, The word mover distance is small as mentioned in the figure.


![Alt Text](https://imgur.com/L1QNfPK.jpg)

In [11]:
WORD = re.compile(r"\w+")

WORD    #VER QUE ES ESTO que lo usan abajo.

re.compile(r'\w+', re.UNICODE)

In [12]:
def text_to_vector(text):
    '''
        converting the document to a term matrix where all words are listed and beside it the frequency of it.

        -- input:
                    text: the document as string 
        -- output:
                    Term matrix: Each word in the two documents and its frequency 
    '''
    words = WORD.findall(text)
    stopwords = nltk.corpus.stopwords.words('english')
    words = [w for w in words if w not in stopwords]
    return Counter(words)

def get_cosine(doc1, doc2):
    '''
        Get the cosine similarity between two documents.
        Depends on the angle between two non zero vectors which are constructed by each word frequency in the two documents.

        -- input:
                      doc1: the first document as string
                      doc2: the second document as string
        -- output:
                      cosine similarity score

    '''
    vec1 = text_to_vector(doc1)
    vec2 = text_to_vector(doc2)
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
        
def wordMdistance(doc1,doc2):
  '''
      return the word mover distance between two documents 

      -- input:
                      doc1: the first document as list of words
                      doc2: the second document as list of words
      -- output:
                      Word Mover's Distance score
  '''
  sum_dist = 0
  i = 0
  for word in sent1:
    mindist = 1000.0
    for word2 in sent2:
      try:
        j = np.copy(word2vec.get_vector(word))
        t = np.copy(word2vec.get_vector(word2))
        dista = np.sqrt(sum((j-t)**2))
        if(dista < mindist):
          mindist = dista
      except:
        continue
    sum_dist+=mindist
    i+=1
  return sum_dist/i

def WMD(doc1,doc2):
  '''
      Preprocess the document first and remove english stopwords then call the function that calculates the word mover distance
     
      -- input:
                      doc1: the first document as string
                      doc2: the second document as string
      -- output:
                      Word Mover's Distance score
  '''
  first_doc = doc1.lower().split()
  second_doc = doc2.lower().split()
  stopwords = nltk.corpus.stopwords.words('english')
  first_doc = [w for w in first_doc if w not in stopwords]
  second_doc = [w for w in second_doc if w not in stopwords]
  return (word2vec.wmdistance(second_doc, first_doc))

### Visualizamos nuevamente el DS y aplicamos las funciones anteriores

In [28]:
data.head()

Unnamed: 0,articles,abstracts,similarity
0,a 6.2-magnitude earthquake struck off the sout...,there is no tsunami threat . the quake was al...,1
1,facebook ceo mark zuckerberg said in an interv...,new york police officer seen in video kicking ...,0
2,andre villas-boas' troubles as chelsea manager...,chelsea manager andre villas-boas under furthe...,1
3,"university park, pennsylvania the fatal expl...",the training materials are a result of a 2013 ...,0
4,"after five months of detention in north korea,...",yohan blake beats usain bolt over 200m at the ...,0


In [23]:
Article_1 = data['articles'][0]
Article_1



In [24]:
Abstract_1 = data['abstracts'][0]
Abstract_1

' there is no tsunami threat . the quake was almost 330 miles southwest of panama city . there are no immediate reports of injuries or damage .'

In [25]:
Sim_1 = data['similarity'][0]
Sim_1

1

In [26]:
Article_2 = data['articles'][1]
Abstract_2 = data['abstracts'][1]
Sim_2 = data['similarity'][1]
Sim_2

0

### Get_cosine de a uno

In [52]:
round(get_cosine(Article_1,Abstract_1),3)

0.565

In [53]:
round(get_cosine(Article_2,Abstract_2),3)

0.017

### WDM de a uno

In [54]:
round(WMD(Article_1,Abstract_1),3)

0.694

In [55]:
round(WMD(Article_2,Abstract_2),3)

1.193

### Nuevo DF con Get_Cosine y WDM para todo el DF anterior.

In [51]:
articles = data['articles'].astype('str')
abstracts = data['abstracts'].astype('str')
similarities = data['similarity']

NewDF = pd.DataFrame({"article":articles,"summary":abstracts, "sim":similarities})
New_DF_Acotado = NewDF.iloc[0:11]

New_DF_Acotado['Cosine'] = New_DF_Acotado.apply(lambda row: round(get_cosine(row['article'],row['summary']),3), axis=1)
New_DF_Acotado['WMD'] = New_DF_Acotado.apply(lambda row: round(WMD(row['article'],row['summary']),3), axis=1)

New_DF_Acotado

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  New_DF_Acotado['Cosine'] = New_DF_Acotado.apply(lambda row: round(get_cosine(row['article'],row['summary']),3), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  New_DF_Acotado['WMD'] = New_DF_Acotado.apply(lambda row: round(WMD(row['article'],row['summary']),3), axis=1)


Unnamed: 0,article,summary,sim,Cosine,WMD
0,a 6.2-magnitude earthquake struck off the sout...,there is no tsunami threat . the quake was al...,1,0.565,0.694
1,facebook ceo mark zuckerberg said in an interv...,new york police officer seen in video kicking ...,0,0.017,1.193
2,andre villas-boas' troubles as chelsea manager...,chelsea manager andre villas-boas under furthe...,1,0.466,0.933
3,"university park, pennsylvania the fatal expl...",the training materials are a result of a 2013 ...,0,0.031,1.246
4,"after five months of detention in north korea,...",yohan blake beats usain bolt over 200m at the ...,0,0.02,1.23
5,a pair of georgia men faced more than a half-h...,ballet opening thursday features live performa...,0,0.051,1.205
6,washington president barack obama's keystone p...,the senate blocked a keystone bill from advanc...,1,0.414,1.119
7,"tina fey's follow-up to ""30 rock"" is getting a...","tina fey's series ""the unbreakable kimmy schmi...",1,0.229,1.193
8,as north koreans face an uncertain future with...,"u.s. secretary of state, israeli prime ministe...",0,0.014,1.167
9,long gone are the days of ice sculptures and c...,much of premium class airline food is hand-pre...,1,0.385,1.021


## Ejemplo 5 - Keyword extraction using TF*IDF

Son 11 pasos de ejemplo.

#### En resumen: Document -> Remove stop words -> Find Term Frequency (TF) -> Find Inverse Document Frequency (IDF) -> Find TF*IDF -> Get top N Keywords

1. Importamos los paquetes necesarios.

We need to tokenize to create word tokens, itemgetter to sort the dictionary, and math to perform log base e operation.

In [36]:
from nltk import tokenize
from operator import itemgetter
import math

2. Declaracion de variables.

We will declare a string variable. It will be a placeholder for the sample text document

In [37]:
doc = 'I am a graduate. I want to learn Python. I like learning Python. Python is easy. Python is interesting. Learning increases thinking. Everyone should invest time in learning'

3. Removemos las stopwords

Stopwords are the frequently occurring words that may not carry significance to our analysis. We can remove the using nltk library

In [38]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
stop_words = set(stopwords.words('english'))

4. Encontramos el número total de palabras (total_words) en el documento. 

This will be required while calculating Term Frequency

In [39]:
total_words = doc.split()
total_word_length = len(total_words)
print(total_word_length)

28


5. Encontramos el número total de oraciones (total_sent_len).

This will be required while calculating Inverse Document Frequency

In [40]:
total_sentences = tokenize.sent_tokenize(doc)
total_sent_len = len(total_sentences)
print(total_sent_len)

7


6. Calculamos TF para cada palabra

We will begin by calculating the word count for each non-stop words and finally divide each element by the result of step 4

In [41]:
tf_score = {}
for each_word in total_words:
    each_word = each_word.replace('.','')
    if each_word not in stop_words:
        if each_word in tf_score:
            tf_score[each_word] += 1
        else:
            tf_score[each_word] = 1
            
# Dividing by total_word_length for each dictionary element
tf_score.update((x, y/int(total_word_length)) for x, y in tf_score.items())
print(tf_score)

{'I': 0.10714285714285714, 'graduate': 0.03571428571428571, 'want': 0.03571428571428571, 'learn': 0.03571428571428571, 'Python': 0.14285714285714285, 'like': 0.03571428571428571, 'learning': 0.07142857142857142, 'easy': 0.03571428571428571, 'interesting': 0.03571428571428571, 'Learning': 0.03571428571428571, 'increases': 0.03571428571428571, 'thinking': 0.03571428571428571, 'Everyone': 0.03571428571428571, 'invest': 0.03571428571428571, 'time': 0.03571428571428571}


7. Funcion para chequear si la palabra (word) está presente en la lista de oraciones (sentences).

This method will be required when calculating IDF

In [42]:
def check_sent(word, sentences): 
    final = [all([w in x for w in word]) for x in sentences] 
    sent_len = [sentences[i] for i in range(0, len(final)) if final[i]]
    return int(len(sent_len))

8. Calculamos IDF para cada palabra,

We will use the function in step 7 to iterate the non-stop word and store the result for Inverse Document Frequency

In [43]:
idf_score = {}
for each_word in total_words:
    each_word = each_word.replace('.','')
    if each_word not in stop_words:
        if each_word in idf_score:
            idf_score[each_word] = check_sent(each_word, total_sentences)
        else:
            idf_score[each_word] = 1

# Performing a log and divide
idf_score.update((x, math.log(int(total_sent_len)/y)) for x, y in idf_score.items())

print(idf_score)

{'I': 0.8472978603872037, 'graduate': 1.9459101490553132, 'want': 1.9459101490553132, 'learn': 1.9459101490553132, 'Python': 0.5596157879354227, 'like': 1.9459101490553132, 'learning': 1.252762968495368, 'easy': 1.9459101490553132, 'interesting': 1.9459101490553132, 'Learning': 1.9459101490553132, 'increases': 1.9459101490553132, 'thinking': 1.9459101490553132, 'Everyone': 1.9459101490553132, 'invest': 1.9459101490553132, 'time': 1.9459101490553132}


9. Calculamos TF * IDF

Since the key of both the dictionary is the same, we can iterate one dictionary to get the keys and multiply the values of both

In [44]:
tf_idf_score = {key: tf_score[key] * idf_score.get(key, 0) for key in tf_score.keys()}
print(tf_idf_score)

{'I': 0.09078191361291467, 'graduate': 0.06949679103768976, 'want': 0.06949679103768976, 'learn': 0.06949679103768976, 'Python': 0.07994511256220323, 'like': 0.06949679103768976, 'learning': 0.08948306917824057, 'easy': 0.06949679103768976, 'interesting': 0.06949679103768976, 'Learning': 0.06949679103768976, 'increases': 0.06949679103768976, 'thinking': 0.06949679103768976, 'Everyone': 0.06949679103768976, 'invest': 0.06949679103768976, 'time': 0.06949679103768976}


10. Creamos una funcion para obtener las N más importantes palabras en el documento.

In [45]:
def get_top_n(dict_elem, n):
    result = dict(sorted(dict_elem.items(), key = itemgetter(1), reverse = True)[:n]) 
    return result

11. Como prueba obtenemos las 5 palabras top:

In [46]:
print(get_top_n(tf_idf_score, 5))

{'I': 0.09078191361291467, 'learning': 0.08948306917824057, 'Python': 0.07994511256220323, 'graduate': 0.06949679103768976, 'want': 0.06949679103768976}
