# Text Similarity


 A common reason for such a charade is that we want to determine **similarity between pairs of documents**, or the **similarity between a specific document** and a set of other documents (such as a user query vs. indexed documents).
 
SOURCES: http://localhost:8888/notebooks/Similarity_Queries.ipynb

https://radimrehurek.com/topic_modeling_tutorial/3%20-%20Indexing%20and%20Retrieval.html

In [5]:
from IPython.core.display import display, HTML
import configparser

config = configparser.ConfigParser()
config.read('../../config.ini')

GENSIM_DICTIONARY_PATH = config['SIMILARITY']['GENSIM_DICTIONARY_PATH']
GENSIM_CORPUS_PATH = config['SIMILARITY']['GENSIM_CORPUS_PATH']
SIMILARITY_INDEX = config['SIMILARITY']['SIMILARITY_INDEX']
AIRLINE_CLEANED_TEXT_PATH = config['SIMILARITY']['AIRLINE_CLEANED_TEXT_PATH']

'../../raw_data/gensim/airline.mm'

### Review Cleaned Text

In [60]:
with open(AIRLINE_CLEANED_TEXT_PATH, 'rb') as f:
    cleaned_text = [line.decode('utf-8').strip() for line in f.readlines()]
    
import pandas as pd
pd.set_option('display.max_colwidth',200)
pd.DataFrame(cleaned_text)

Unnamed: 0,0
0,no show policy fund apply future travel southwest change fee future_travel
1,for example company transfarencysm campaign emphasize southwest approach treat customer fairly honestly respectfully low fare unexpected bag fee change fee hidden fee low_fare bag_fee
2,for example company transfarencysm campaign emphasize southwest approach treat customer fairly honestly respectfully low fare unexpected bag fee change fee hidden fee low_fare bag_fee
3,the campaign highlight importance southwest customer service show southwest understand plan change charge change fee customer_service
4,while customer pay difference airfare customer charge change fee difference airfare
5,the passenger protection rules require airline pay compensation passenger deny boarding involuntarily oversold flight ii refund check bag fee permanently lose luggage iii prominently disclose pote...
6,the passenger protection rules require airline pay compensation passenger deny boarding involuntarily oversold flight ii refund check bag fee permanently lose luggage iii prominently disclose pote...
7,the passenger protection rules require advertise airfare include governmentmandat tax fee ii passenger allow hold reservation hour make payment iii passenger allow cancel pay reservation penalty h...
8,in december congress enact statute index immigration custom fee inflation begin
9,finally department agriculture animal plant health inspection service publish final regulation october modify international agriculture inspection fee the_U.S._Department Plant_Health_Inspection_S...


### Index Documents

In [6]:
from gensim import corpora, models, similarities

ModuleNotFoundError: No module named 'gensim'

In [13]:
dictionary = corpora.Dictionary.load(GENSIM_DICTIONARY_PATH)
corpus = corpora.MmCorpus(GENSIM_CORPUS_PATH)

In [15]:
print(dictionary)
print(corpus)

Dictionary(134 unique tokens: ['show', 'apply', 'travel', 'southwest', 'change']...)
MmCorpus(32 documents, 134 features, 419 non-zero entries)


In [26]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
lsi_corpus = lsi[corpus]

In [None]:
# index all the documents

In [32]:
from gensim.similarities import MatrixSimilarity, SparseMatrixSimilarity, Similarity

# transform corpus to LSI space and index it
index = similarities.MatrixSimilarity(lsi[corpus]) 


index.save(SIMILARITY_INDEX)
#index = similarities.MatrixSimilarity.load(SIMILARITY_INDEX)

In [None]:
# add new documents to the index
# useful to continually add new documents to search against
index.add_documents(lsi_corpus)

In [33]:
# initialize the index
#%time index_dense = MatrixSimilarity(lsi_corpus, num_features=lsi_corpus.num_terms)
#index = similarities.Similarity(SIMILARITY_INDEX,lsi_corpus, num_features=lsi_corpus.num_terms) # transform corpus to LSI space and index it

### Similarity Query

In [38]:
doc = "Airline fee for landing_fee"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(0, -0.10312934054394524), (1, 0.34198630506276362)]


In [63]:
sims = index[vec_lsi] # perform a similarity query against the corpus
sims

array([ 0.59441066,  0.83202821,  0.83202821,  0.31721264,  0.37263808,
        0.22230884,  0.22230884,  0.33240592,  0.78784192,  0.21768281,
        0.5102374 ,  0.5102374 ,  0.99999094,  0.35403904,  0.45829552,
        0.62159282,  0.62159282,  0.95994979,  0.61456001,  0.98673123,
        0.8169539 ,  0.9634344 ,  0.97074395,  0.97286832,  0.9840433 ,
        0.9642309 ,  0.97074395,  0.97286832,  0.90252972,  0.96595037,
        0.29459408,  0.97178912], dtype=float32)

In [67]:
# set the number of matches to return (sorted by most relevant)
index.num_best = 3
print(index[vec_lsi])

[(12, 0.99999094009399414), (19, 0.98673123121261597), (24, 0.98404330015182495)]


### The Norm

### Cosine Similarity

In [1]:
sqlalchemy_url = 'http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(sqlalchemy_url)
HTML(iframe)

The similarity in vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. The inner product is usually normalized. The most popular similarity measure is the cosine coefficient, which measures the angle between the a document vector and the query vector.

SOURCE: http://cogsys.imm.dtu.dk/thor/projects/multimedia/textmining/node5.html

Think about it this way. In the numerator of cosine similarity, only terms that exist in both documents contribute to the dot product. If both of the term have high tfidf values, then they add a lot to the numerator. If a term does not exist in either documents, then it adds nothing to the numerator. On the other hand, the deonominator normalizes the documents, so that a document with many terms is punished with a larger denominator. 