# Findind the similarity between the query and the document

Gensim : It is an open source library in python written by Radim Rehurek which is used in unsupervised topic modelling and natural language processing. It is designed to extract semantic topics from documents. It can handle large text collections

In [None]:
import os
import gensim
from gensim.models import LsiModel
from gensim import models
from gensim import corpora
from gensim.utils import lemmatize
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import remove_stopwords, stem_text
from gensim.parsing.preprocessing import strip_numeric
import pandas as pd
from gensim import similarities

In [None]:
# reading the corpus
cor = pd.read_csv('/content/sample_data/phy_corpus.txt', sep ='\n', header = None)[0]

In [None]:
cor

0      An airplane accelerates down a runway at 3.20 ...
1      A car starts from rest and accelerates uniform...
2      Upton Chuck is riding the Giant Drop at Great ...
3      A race car accelerates uniformly from 18.5 m/s...
4      A feather is dropped on the moon from a height...
                             ...                        
266    A rocket  is fired  vertically  and  ascends  ...
267    The Earth orbits the Sun at a distance of 1500...
268    The cyclist in Figure 2.15 is travelling at 15...
269    88. Bicycle A bicycle accelerates from 0.0 m/s...
270     A weather balloon is floating at a constant h...
Name: 0, Length: 271, dtype: object

In [None]:
cor.shape

(271,)

In [None]:
# defining a function which removes all the numeric values and tokenizes after that
def preprocessing():
  for document in cor:
    doc = strip_numeric(stem_text(document))
    yield gensim.utils.tokenize(doc,lower= True)

texts = preprocessing() # text contain all the tokens related to 271 documents

In [None]:
texts = preprocessing() 

# creating a dictionary w.r.t the vocabulary of words used there
dictionary = corpora.Dictionary(texts)
print(dictionary)  
dictionary.filter_extremes(no_below=1, keep_n=700)

Dictionary(921 unique tokens: ['a', 'acceler', 'airplan', 'an', 'at']...)


Now our document size is 271 and the no. of words is 700

In [None]:
# now convert the dictionary to the bag of words
# for every token in the text we have created
# rows--> words and columns--> documents and the values-->tfidf value
from gensim.models import TfidfModel
doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in preprocessing()]
tfidf = models.TfidfModel(doc_term_matrix)
corpus_tfidf = tfidf[doc_term_matrix]

In [None]:
# this lsi will create our U\sigmaV^t
# initialize an LSI transformation
# num_topics: Number of requested factors (latent dimensions); In the image program we used different singular value for dimension reduction
# num_topics implies that singular value

# corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or a sparse matrix of shape
# here as corpus we are passing tfidf table
# id2word (dict of {int: str}, optional) – ID to word mapping
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=1) 

# generating a query
doc = 'car acceleration speed far' # query parameters here=4
print(type(doc)) # str

# doc2bow--> a mapping between words and their integer ids.
vec_bow = dictionary.doc2bow(doc.lower().split())
print(type(vec_bow)) # list
print(vec_bow) # the index here is 1, 
print(dictionary[19])
print(dictionary[32])
print(dictionary[56])
print(dictionary[63])

<class 'str'>
<class 'list'>
[(19, 1), (32, 1), (56, 1), (63, 1)]
car
far
acceleration
speed


In [None]:
vec_lsi = lsi[vec_bow] # convert the query to LSI space
index = similarities.MatrixSimilarity(lsi[doc_term_matrix])
unsorted_similarity = index[vec_lsi]
sorted_similarity = sorted(enumerate(unsorted_similarity), key= lambda item: -item[1])
for index,similarity in sorted_similarity:
  print(similarity, cor[index])



1.0 An airplane accelerates down a runway at 3.20 m/s2 for 32.8 s until is finally lifts off the ground. Determine the distance traveled before takeoff.
1.0 A car starts from rest and accelerates uniformly over a time of 5.21 seconds for a distance of 110 m. Determine the acceleration of the car.
1.0 Upton Chuck is riding the Giant Drop at Great America. If Upton free falls for 2.60 seconds, what will be his final velocity and how far will he fall?
1.0 A race car accelerates uniformly from 18.5 m/s to 46.1 m/s in 2.47 seconds. Determine the acceleration of the car and the distance traveled.
1.0 A feather is dropped on the moon from a height of 1.40 meters. The acceleration of gravity on the moon is 1.67 m/s2. Determine the time for the feather to fall to the surface of the moon.
1.0 Rocket-powered sleds are used to test the human response to acceleration. If a rocket-powered sled is accelerated to a speed of 444 m/s in 1.83 seconds, then what is the acceleration and what is the distanc

In [None]:
vec_lsi[0]

(0, 0.595422052592812)

So, here we gave the singular value (say,sigma) as 1, therefore only one single similarity has been generated. Which is not quiet efficient. The matrix vec_lsi is holding one value at index 0-->0.59. 

Now we take singular value as 2

In [None]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) 
doc = 'car acceleration speed far' 
vec_bow = dictionary.doc2bow(doc.lower().split())

vec_lsi = lsi[vec_bow] # convert the query to LSI space
index = similarities.MatrixSimilarity(lsi[doc_term_matrix])
unsorted_similarity = index[vec_lsi]
sorted_similarity = sorted(enumerate(unsorted_similarity), key= lambda item: -item[1])
for index,similarity in sorted_similarity:
  print(similarity, cor[index])



0.9999981 A race car accelerates uniformly from 18.5 m/s to 46.1 m/s in 2.47 seconds. Determine the acceleration of the car and the distance traveled.
0.9998269 A car traveling at 22.4 m/s skids to a stop in 2.55 s. Determine the skidding distance of the car (assume uniform acceleration).
0.9994363 A car starts from rest and accelerates uniformly for 5.21 seconds over a distance of 110 m.  Determine the acceleration of the car.
0.9990283 A car starts from rest and accelerates uniformly over a time of 5.21 seconds for a distance of 110 m. Determine the acceleration of the car.
0.998698 A race car accelerates uniformly from 18.5 m/s to 46.1 m/s in 2.47 seconds. Determine the acceleration of the car. 
0.99853987 The speed of a car is reduced from 90 km/hr to 36 km/hr in 5 s. What is a distance travelled by the car during this time interval
0.997599 A dragster accelerates to a speed of 112 m/s over a distance of 398 m. Determine the acceleration (assume uniform) of the dragster.
0.9956529 

In [None]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=15) 
doc = 'car acceleration speed far' 
vec_bow = dictionary.doc2bow(doc.lower().split())

vec_lsi = lsi[vec_bow] # convert the query to LSI space
index = similarities.MatrixSimilarity(lsi[doc_term_matrix])
unsorted_similarity = index[vec_lsi]
sorted_similarity = sorted(enumerate(unsorted_similarity), key= lambda item: -item[1])
for index,similarity in sorted_similarity:
  print(similarity, cor[index])



0.95803994 A driver of a car traveling at -15 m/s applies the brakes, causing a uniform accelerationof +2 m/s2 . If the brakes are applied for 2.5 seconds what is the velocity of the car at theend of the braking period? How far has the car moved during the braking period?
0.9331158 A car has an acceleration of 3 m/s². If the initial velocity of the car is 5 m/s, determine: (a) How far the car travels in 6 s; (b) How far the car has travelled when it reaches a velocity of 30 m/s. 
0.9267618 A driver of a car traveling at 15.0 m/s applies the brakes, causing a uniform acceleration of -2.0 m/s2.  How long does it take the car to accelerate to a final speed of 10.0 m/s?  How far has the car moved during the braking period? 
0.9209407 A car starts from rest and travels for 5 seconds with a uniform acceleration of– 1.5 m/s2 . What is the final velocity of the car.? How far does the car travel in this timeinterval?
0.91146415 A car starts from rest and travels for 5.0 s with a uniform acceler