# Cosine Using Gensim


This notebook will try to accomplish next thing - create an embedding vector for a message in a chat and then find in another chat message with similar meaning by using cosine similarity.

In [2]:
import utils
import numpy as np
import pandas as pd
import nltk
import string
import gensim
import logging

from os import getcwd
from gensim import corpora
from gensim import models
from gensim import similarities

#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Loading data 

From our pandas manupulations

In [2]:
#Instantiating data to a separate variables
idf_text = pd.read_pickle('idf_text')
idf_text_list = idf_text.tolist()
idf_tokens = pd.read_pickle('idf_tokens')
hammas_text = pd.read_pickle('hammas_text')
hammas_text_list = hammas_text.tolist()
hammas_tokens = pd.read_pickle('hammas_tokens')

Using the gensim.corpora.Dictionary class we create dictionary for all  This dictionary defines the vocabulary of all words that our processing knows about.

In [3]:
#Creating a dictionary and a frequency corpus from tokens I got after a preprocessing in pandas.
h_dictionary = corpora.Dictionary(hammas_tokens)
h_corpus = [h_dictionary.doc2bow(text) for text in hammas_tokens]
i_dictionary = corpora.Dictionary(idf_tokens)
i_corpus = [i_dictionary.doc2bow(text) for text in idf_tokens]

#saving results on disk
h_dictionary.save('h_dictionary')
i_dictionary.save('i_dictionary')
corpora.MmCorpus.serialize('h_corpus',h_corpus) 
corpora.MmCorpus.serialize('i_corpus',i_corpus) 

#running TF-IDF model
h_tfidf = models.TfidfModel(h_corpus)
i_tfidf = models.TfidfModel(i_corpus)
h_lsi = models.LsiModel(h_corpus, id2word=h_dictionary, num_topics=2)
i_lsi = models.LsiModel(i_corpus, id2word=i_dictionary, num_topics=2)


In [10]:
h_dict = corpora.dictionary.Dictionary.load('h_dictionary')

In [12]:
h_dict[0]

'amplifi'

I keep reusing the same preprocess function I created earlier to preprocess text I want to query on.

In [14]:
doc = "7 october lauch attack"
h_vec_bow = h_dictionary.doc2bow(utils.process_text(doc))
h_vec_lsi = h_lsi[h_vec_bow]  # convert the query to LSI space
print(f'Original text - {doc}\nProcessed text - {utils.process_text(doc)}\nEmbedding vector - {h_vec_lsi}')

i_vec_bow = i_dictionary.doc2bow(utils.process_text(doc))
i_vec_lsi = i_lsi[i_vec_bow]  # convert the query to LSI space
print(f'Original text - {doc}\nProcessed text - {utils.process_text(doc)}\nEmbedding vector - {i_vec_lsi}')

h_index = similarities.MatrixSimilarity(h_lsi[h_corpus]) 
i_index = similarities.MatrixSimilarity(i_lsi[i_corpus]) 
h_sims = h_index[h_vec_lsi]
i_sims = i_index[i_vec_lsi]

h_sims = sorted(enumerate(h_sims), key=lambda item: -item[1])
i_sims = sorted(enumerate(i_sims), key=lambda item: -item[1])

Original text - 7 october lauch attack
Processed text - ['7', 'octob', 'lauch', 'attack']
Embedding vector - [(0, 0.05136082870771824), (1, 0.030826041360716774)]
Original text - 7 october lauch attack
Processed text - ['7', 'octob', 'lauch', 'attack']
Embedding vector - [(0, 0.052981933711779616), (1, -0.043356627502943515)]


In [16]:
for doc_position, doc_score in h_sims[:10]:
    print(doc_score, hammas_text.tolist()[doc_position])


1.0 🚨 An injury of a Palestinian with live fire in the village of Nabi Saleh, west of Ramallah.
1.0 Video from Lebanese journalist Ali Shuaib via Al-Manar TV shows a Lebanese farmer defending his land from the zionist occupation that is attempting to build a fence on Lebanese lands.Ali Shuaib eloquently says:  “When raising your voice no longer makes sense…He wanted to speak with his body to the zionist war machine.”The southern Lebanese hero is farmer Ismail Nasser, from the town of Kafr Shuba.
1.0 🟢 Ismail Haniyeh, head of the political bureau of Hamas,:March towards the border! Think outside the box! Spread out the equations!The resistance has begun its strategic and thunderous strikes, and it still controls the pace of this battle despite the occupier's brutality, indiscriminate killings, and deliberate striking of homes.These crimes, which the world is also witnessing, reflect once again the nature of this Nazi enemy, this fascist monster, where many of our martyrs, hundreds of th

In [17]:
for doc_position, doc_score in i_sims[:10]:
    print(doc_score, idf_text.tolist()[doc_position])


0.99999946 IDF: A short while ago, Israeli civilians burned vehicles and possessions belonging to Palestinians in the town of Turmus Aya.Security forces entered the town in order to extinguish the fires, prevent clashes and to collect evidence. The Israeli civilians exited the town and the Israel Police has opened an investigation into the event.The IDF condemns these serious incidents of violence and destruction of property. Such events prevent the IDF and security forces from focusing on their main mission  maintaining the security of the State of Israel and preventing terrorism.
0.9999989 IDF: IDF, ISA and Israel Police forces conducted counterterrorism activities in Judea and SamariaOvernight, IDF, ISA and Israel Border Police forces conducted counterterrorism activities in order to apprehend Islamic Jihad terrorist operatives in a number of locations in Judea and Samaria, including the towns of Bayt Sira and in the Qalandiya camp.In the town of Idhna and in the city of Hebron, six