Word2Vec

In [6]:
import re
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Sample corpus
documents = [
'Owning a car is no longer a luxury, but it has become a necessity. Whether you drive to work or enjoy weekend drives with the family,\
 having a car can simplify your travels and not to forget the ease and comfort it brings',
'Purchasing their dream vehicle is easier than ever before for millions of Indians – thanks to the widespread availability of car loans in India.',
'Car loans offer you the money for the vehicle upfront. You can then comfortably repay the borrowed amount via affordable monthly EMIs.',
'An auto loan is a secured loan, as the car acts as the guarantee. There is no need to provide any additional asset or mortgage while procuring the loan.',
'Before you apply for an auto loan, you need to compare the interest rates charged by lenders. Even slight variations in the interest rates can play a huge role in increasing or reducing your overall burden.',
'To make it easy for you, here in this guide, we list out the interest rates charged by leading lenders for auto loans in India. You can use this handy table to quickly compare the interest rates before you make a decision.'
]

# Sample queries

queries = [
'what does car loan offer',
'guide me about loans'
]

documents_df=pd.DataFrame(documents,columns=['documents'])
queries_df = pd.DataFrame(queries,columns=['queries'])

In [7]:
documents_df

Unnamed: 0,documents
0,"Owning a car is no longer a luxury, but it has..."
1,Purchasing their dream vehicle is easier than ...
2,Car loans offer you the money for the vehicle ...
3,"An auto loan is a secured loan, as the car act..."
4,"Before you apply for an auto loan, you need to..."
5,"To make it easy for you, here in this guide, w..."


In [8]:
queries_df

Unnamed: 0,queries
0,what does car loan offer
1,guide me about loans


In [9]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/himanshujanbandhu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
stop_words_l=stopwords.words('english')
documents_df['documents_cleaned']=documents_df.documents.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words_l) )


In [11]:
documents_df

Unnamed: 0,documents,documents_cleaned
0,"Owning a car is no longer a luxury, but it has...",owning car longer luxury become necessity wh...
1,Purchasing their dream vehicle is easier than ...,purchasing dream vehicle easier ever millions ...
2,Car loans offer you the money for the vehicle ...,car loans offer money vehicle upfront comfort...
3,"An auto loan is a secured loan, as the car act...",auto loan secured loan car acts guarantee ne...
4,"Before you apply for an auto loan, you need to...",apply auto loan need compare interest rates c...
5,"To make it easy for you, here in this guide, w...",make easy you guide list interest rates char...


In [12]:
queries_df['queries_cleaned']=queries_df.queries.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words_l) )


In [13]:
queries_df

Unnamed: 0,queries,queries_cleaned
0,what does car loan offer,car loan offer
1,guide me about loans,guide loans


In [14]:
# tokenize and pad every document to make them of the same size
# from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
# tokenizer=Tokenizer()
# tokenizer.fit_on_texts(documents_df.documents_cleaned)
# tokenized_documents=tokenizer.texts_to_sequences(documents_df.documents_cleaned)
# tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
# vocab_size=len(tokenizer.word_index)+1
# print (tokenized_documents[0])

In [15]:
# tokenizer.fit_on_texts(queries_df.queries_cleaned)
# tokenized_queries = tokenizer.texts_to_sequences(queries_df.queries_cleaned)
# tokenized_paded_queries=pad_sequences(tokenized_queries,maxlen=64,padding='post')
# print (tokenized_queries[0])

In [16]:
# from sklearn.metrics.pairwise import cosine_similarity

In [17]:
# loading pre-trained embeddings, each word is represented as a 300 dimensional vector
from gensim.models.word2vec import Word2Vec

In [18]:
lst = documents_df.documents_cleaned.tolist()
corpus = [x.split() for x in lst]
corpus

[['owning',
  'car',
  'longer',
  'luxury',
  'become',
  'necessity',
  'whether',
  'drive',
  'work',
  'enjoy',
  'weekend',
  'drives',
  'family',
  'car',
  'simplify',
  'travels',
  'forget',
  'ease',
  'comfort',
  'brings'],
 ['purchasing',
  'dream',
  'vehicle',
  'easier',
  'ever',
  'millions',
  'indians',
  'thanks',
  'widespread',
  'availability',
  'car',
  'loans',
  'india'],
 ['car',
  'loans',
  'offer',
  'money',
  'vehicle',
  'upfront',
  'comfortably',
  'repay',
  'borrowed',
  'amount',
  'via',
  'affordable',
  'monthly',
  'emis'],
 ['auto',
  'loan',
  'secured',
  'loan',
  'car',
  'acts',
  'guarantee',
  'need',
  'provide',
  'additional',
  'asset',
  'mortgage',
  'procuring',
  'loan'],
 ['apply',
  'auto',
  'loan',
  'need',
  'compare',
  'interest',
  'rates',
  'charged',
  'lenders',
  'even',
  'slight',
  'variations',
  'interest',
  'rates',
  'play',
  'huge',
  'role',
  'increasing',
  'reducing',
  'overall',
  'burden'],
 ['

In [19]:
model = Word2Vec(corpus,min_count=1,size= 50,window =5, sg = 1)

In [20]:
print('Vocabulary size:', len(model.wv.vocab))

Vocabulary size: 80


In [21]:
model.vector_size

50

In [22]:
my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
    my_dict[key] = model.wv[key]

In [23]:
my_dict

{'owning': array([ 0.00136722,  0.00073302, -0.00312678, -0.00843529, -0.00012492,
         0.00358586,  0.00995158,  0.00784087, -0.00539048,  0.00616643,
        -0.00563996,  0.00810973,  0.00928352, -0.00347594,  0.00532333,
         0.00586007,  0.00423579,  0.00237673, -0.0093098 ,  0.00609129,
        -0.00837261, -0.00056596, -0.00981516, -0.00349451, -0.00253408,
         0.00594895,  0.00265248, -0.00679193,  0.00739154,  0.00383918,
        -0.00739351,  0.00939363, -0.00628943, -0.0023068 , -0.00632067,
         0.00488165,  0.00902812, -0.0073284 , -0.00842327,  0.00097731,
        -0.00383816,  0.00144603, -0.00429299,  0.00019618,  0.00794138,
        -0.00628094, -0.00216637,  0.0013938 , -0.00870527, -0.00493647],
       dtype=float32),
 'car': array([-0.00384832,  0.00579113, -0.00235709, -0.00637885, -0.0036277 ,
         0.00776676,  0.00751472, -0.00672752,  0.00175185, -0.00259413,
        -0.00324512,  0.00325516, -0.00257964, -0.00025642, -0.0008734 ,
         0

In [24]:
model.wv.vocab

{'owning': <gensim.models.keyedvectors.Vocab at 0x7fee30755a00>,
 'car': <gensim.models.keyedvectors.Vocab at 0x7fee30755b20>,
 'longer': <gensim.models.keyedvectors.Vocab at 0x7fee30755b80>,
 'luxury': <gensim.models.keyedvectors.Vocab at 0x7fee30755a60>,
 'become': <gensim.models.keyedvectors.Vocab at 0x7fee30755af0>,
 'necessity': <gensim.models.keyedvectors.Vocab at 0x7fee30755910>,
 'whether': <gensim.models.keyedvectors.Vocab at 0x7fee30755670>,
 'drive': <gensim.models.keyedvectors.Vocab at 0x7fee30755e50>,
 'work': <gensim.models.keyedvectors.Vocab at 0x7fee30755eb0>,
 'enjoy': <gensim.models.keyedvectors.Vocab at 0x7fee30755f10>,
 'weekend': <gensim.models.keyedvectors.Vocab at 0x7fee30755f70>,
 'drives': <gensim.models.keyedvectors.Vocab at 0x7fee30755fd0>,
 'family': <gensim.models.keyedvectors.Vocab at 0x7fee30769070>,
 'simplify': <gensim.models.keyedvectors.Vocab at 0x7fee307690d0>,
 'travels': <gensim.models.keyedvectors.Vocab at 0x7fee30769130>,
 'forget': <gensim.model

In [25]:
# Function returning vector reperesentation of a query
# def get_embedding_w2v(query_tokens):
#     embeddings = []
#     if len(query_tokens)<1:
#         return np.zeros(300)
#     else:
#         for tok in query_tokens:
#             if tok in model.wv.vocab:
#                 embeddings.append(model.wv.word_vec(tok))
#             else:
#                 embeddings.append(np.random.rand(300))
#         # mean the vectors of individual words to get the vector of the document
#         return np.mean(embeddings, axis=0)

# # Getting Word2Vec Vectors for Queries
# queries_df['vector']=queries_df['queries_cleaned'].apply(lambda x :get_embedding_w2v(x.split()))

In [26]:
# from sklearn.metrics.pairwise import cosine_similarity

# # Function for calculating average precision for a query
# def average_precision(qid,qvector):
  
#   # Getting the ground truth and document vectors
#   qresult=testing_result.loc[testing_result['qid']==qid,['docid','rel']]
#   qcorpus=testing_corpus.loc[testing_corpus['docid'].isin(qresult['docid']),['docid','vector']]
#   qresult=pd.merge(qresult,qcorpus,on='docid')
  
#   # Ranking documents for the query
#   qresult['similarity']=qresult['vector'].apply(lambda x: cosine_similarity(np.array(qvector).reshape(1, -1),np.array(x).reshape(1, -1)).item())
#   qresult.sort_values(by='similarity',ascending=False,inplace=True)

#   # Taking Top 10 documents for the evaluation
#   ranking=qresult.head(10)['rel'].values
  
#   # Calculating precision
#   precision=[]
#   for i in range(1,11):
#     if ranking[i-1]:
#       precision.append(np.sum(ranking[:i])/i)
  
#   # If no relevant document in list then return 0
#   if precision==[]:
#     return 0

#   return np.mean(precision)

# # Calculating average precision for all queries in the test set
# testing_queries['AP']=testing_queries.apply(lambda x: average_precision(x['qid'],x['vector']),axis=1)

# # Finding Mean Average Precision
# print('Mean Average Precision=>',testing_queries['AP'].mean())


In [27]:
search = queries_df['queries_cleaned'][0].split()

In [28]:
search

['car', 'loan', 'offer']

In [29]:
res = model.wv.most_similar(positive=search,topn=4)
res

[('increasing', 0.2691025137901306),
 ('burden', 0.25217533111572266),
 ('quickly', 0.24354708194732666),
 ('handy', 0.22967523336410522)]

Doc2Vec

In [30]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec

In [31]:
data = ["The process of searching for a job can be very stressful, but it doesn’t have to be. Start with a\
        well-written resume that has appropriate keywords for your occupation. Next, conduct a targeted job search\
        for positions that meet your needs.",
        "Gardening in mixed beds is a great way to get the most productivity from a small space. Some investment\
        is required, to purchase materials for the beds themselves, as well as soil and compost. The\
        investment will likely pay-off in terms of increased productivity.",
        "Looking for a job can be very stressful, but it doesn’t have to be. Begin by writing a good resume with\
        appropriate keywords for your occupation. Second, target your job search for positions that match your\
        needs."]

In [32]:
# import nltk
# nltk.download('all')

In [33]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

In [34]:
print (tagged_data)

[TaggedDocument(words=['the', 'process', 'of', 'searching', 'for', 'a', 'job', 'can', 'be', 'very', 'stressful', ',', 'but', 'it', 'doesn', '’', 't', 'have', 'to', 'be', '.', 'start', 'with', 'a', 'well-written', 'resume', 'that', 'has', 'appropriate', 'keywords', 'for', 'your', 'occupation', '.', 'next', ',', 'conduct', 'a', 'targeted', 'job', 'search', 'for', 'positions', 'that', 'meet', 'your', 'needs', '.'], tags=['0']), TaggedDocument(words=['gardening', 'in', 'mixed', 'beds', 'is', 'a', 'great', 'way', 'to', 'get', 'the', 'most', 'productivity', 'from', 'a', 'small', 'space', '.', 'some', 'investment', 'is', 'required', ',', 'to', 'purchase', 'materials', 'for', 'the', 'beds', 'themselves', ',', 'as', 'well', 'as', 'soil', 'and', 'compost', '.', 'the', 'investment', 'will', 'likely', 'pay-off', 'in', 'terms', 'of', 'increased', 'productivity', '.'], tags=['1']), TaggedDocument(words=['looking', 'for', 'a', 'job', 'can', 'be', 'very', 'stressful', ',', 'but', 'it', 'doesn', '’', '

In [35]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=1)

In [36]:
model.build_vocab(tagged_data)

In [37]:
model.corpus_count

3

In [38]:
model.train(tagged_data, total_examples=model.corpus_count,epochs=100)

In [39]:
query = 'process of searching a job'.lower()

In [40]:
#query_tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(query)]

In [41]:
query_tokenized = word_tokenize(query)

In [42]:
print(query)
print(query_tokenized)

process of searching a job
['process', 'of', 'searching', 'a', 'job']


In [43]:
#print(query_tagged_data)
query_vec = model.infer_vector(query_tokenized)

In [44]:
len(query_vec)

50

In [45]:
from sklearn.metrics.pairwise import cosine_similarity

In [46]:
model.docvecs[1]

array([-0.03016173, -0.180061  ,  0.01067809,  0.03839802,  0.1139955 ,
        0.21592875, -0.41447943,  0.5810642 ,  0.24364698, -0.09098065,
        0.1062103 , -0.39281523,  0.29902577,  0.18467925,  0.52572256,
        0.08900933, -0.25466555,  0.1321458 , -0.59352624, -0.32279444,
        0.01047566, -0.19169186,  0.16327046,  0.28052536, -0.63208956,
       -0.1084457 , -0.26082107, -0.35793042, -0.15362631,  0.08357804,
        0.17266004,  0.21756546,  0.17086874, -0.14229459,  0.579431  ,
       -0.16992411, -0.45083657, -0.9059317 , -0.120997  ,  0.27234802,
        0.18110901,  0.35833496, -0.05521316,  0.47466668, -0.12469569,
        0.03187736,  0.50552547, -0.07633592, -0.25747195, -0.6123367 ],
      dtype=float32)

In [47]:
query_vec

array([-0.00982835, -0.08048742,  0.01718349,  0.02073304,  0.05296005,
        0.0990138 , -0.18309185,  0.26962605,  0.11534223, -0.03852508,
        0.0544016 , -0.18225178,  0.14874598,  0.07485695,  0.24424621,
        0.03098448, -0.12384521,  0.0519974 , -0.27887937, -0.14808168,
        0.00786661, -0.09095431,  0.07908096,  0.13627738, -0.30050296,
       -0.05592032, -0.11396559, -0.17436309, -0.07770725,  0.02663039,
        0.07478848,  0.10936848,  0.08822712, -0.06428901,  0.2596563 ,
       -0.06788894, -0.2032232 , -0.4232457 , -0.0610315 ,  0.13013156,
        0.07520054,  0.15975401, -0.0245743 ,  0.22894537, -0.06321605,
        0.00396605,  0.22424985, -0.04089557, -0.11540613, -0.28411666],
      dtype=float32)

In [48]:
model.docvecs[0]

array([-0.00702693, -0.125128  ,  0.02104058,  0.03788082,  0.07891085,
        0.15297   , -0.31005806,  0.43893063,  0.18619025, -0.05855317,
        0.07008303, -0.2934002 ,  0.23925248,  0.13832709,  0.39931768,
        0.05294726, -0.20011395,  0.07651761, -0.44981158, -0.228433  ,
        0.00897306, -0.14460835,  0.13567004,  0.20386796, -0.48529738,
       -0.08220822, -0.19643193, -0.28162366, -0.1089528 ,  0.04368239,
        0.11772045,  0.16304518,  0.13455425, -0.10592616,  0.43866968,
       -0.11157077, -0.31827307, -0.6803794 , -0.07548494,  0.20232473,
        0.13810536,  0.26604298, -0.04847639,  0.35716367, -0.09231748,
        0.01001503,  0.36654377, -0.06047364, -0.20695136, -0.46088654],
      dtype=float32)

In [49]:
import numpy as np

In [50]:
data_array = np.array([model.docvecs[0],model.docvecs[1],model.docvecs[2]])

In [51]:
query_v = query_vec.reshape(1,50)

In [52]:
data_array.shape

(3, 50)

In [53]:
results = cosine_similarity(data_array, query_v)

In [54]:
res = np.argsort(results, axis=0)

In [55]:
res

array([[2],
       [1],
       [0]])

In [56]:
k=1
for i in res[-1:-4:-1]:
    print("result ", k, "=============================")
    print(data[i[0]])
    print("=============================")
    k+=1

The process of searching for a job can be very stressful, but it doesn’t have to be. Start with a        well-written resume that has appropriate keywords for your occupation. Next, conduct a targeted job search        for positions that meet your needs.
Gardening in mixed beds is a great way to get the most productivity from a small space. Some investment        is required, to purchase materials for the beds themselves, as well as soil and compost. The        investment will likely pay-off in terms of increased productivity.
Looking for a job can be very stressful, but it doesn’t have to be. Begin by writing a good resume with        appropriate keywords for your occupation. Second, target your job search for positions that match your        needs.


In [57]:
similar_doc = model.docvecs.most_similar('0')

In [58]:
print(similar_doc)

[('1', 0.999192476272583), ('2', 0.999113917350769)]


In [59]:
data[int(similar_doc[0][0])]

'Gardening in mixed beds is a great way to get the most productivity from a small space. Some investment        is required, to purchase materials for the beds themselves, as well as soil and compost. The        investment will likely pay-off in terms of increased productivity.'

Pretrained word2vec model

In [60]:
import pandas as pd
import re

In [61]:
import gensim

In [62]:
W2V_PATH='GoogleNews-vectors-negative300.bin'
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(W2V_PATH, binary=True)

In [63]:
documents = [
"The terms ‘work’, ‘energy’ and ‘power’ are frequently used\
in everyday language. A farmer ploughing the field, a\
construction worker carrying bricks, a student studying for\
a competitive examination, an artist painting a beautiful\
landscape, all are said to be working. In physics, however,\
the word ‘Work’ covers a definite and precise meaning.\
Somebody who has the capacity to work for 14-16 hours a\
day is said to have a large stamina or energy. We admire a\
long distance runner for her stamina or energy. Energy is\
thus our capacity to do work. In Physics too, the term ‘energy’\
is related to work in this sense, but as said above the term\
‘work’ itself is defined much more precisely. The word ‘power’\
is used in everyday life with different shades of meaning. In\
karate or boxing we talk of ‘powerful’ punches. These are\
delivered at a great speed. This shade of meaning is close to\
the meaning of the word ‘power’ used in physics. We shall\
find that there is at best a loose correlation between the\
physical definitions and the physiological pictures these\
terms generate in our minds. The aim of this chapter is to\
develop an understanding of these three physical quantities.\
Before we proceed to this task, we need to develop a\
mathematical prerequisite, namely the scalar product of two vectors." ,
    
"6.1.1 The Scalar Product\
We have learnt about vectors and their use in Chapter 4.\
Physical quantities like displacement, velocity, acceleration,\
force etc. are vectors. We have also learnt how vectors are\
added or subtracted. We now need to know how vectors are\
multiplied. There are two ways of multiplying vectors which\
we shall come across : one way known as the scalar product\
gives a scalar from two vectors and the other known as the\
vector product produces a new vector from two vectors. We\
shall look at the vector product in Chapter 7. Here we take\
up the scalar product of two vectors. The scalar product or\
dot product of any two vectors A and B, denoted as A.B",
    
"We see that if there is no displacement, there\
is no work done even if the force is large. Thus,\
when you push hard against a rigid brick wall,\
the force you exert on the wall does no work. Yet\
your muscles are alternatively contracting and\
relaxing and internal energy is being used up\
and you do get tired. Thus, the meaning of work\
in physics is different from its usage in everyday\
language.\
No work is done if :\
(i) the displacement is zero as seen in the\
example above. A weightlifter holding a 150\
kg mass steadily on his shoulder for 30 s\
does no work on the load during this time.\
(ii) the force is zero. A block moving on a smooth\
horizontal table is not acted upon by a\
horizontal force (since there is no friction), but\
may undergo a large displacement.\
(iii) the force and displacement are mutually\
perpendicular. This is so since, for θ = π/2 rad\
(= 90o\
), cos (π /2) = 0. For the block moving on\
a smooth horizontal table, the gravitational\
force mg does no work since it acts at right\
angles to the displacement. If we assume that\
the moon’s orbits around the earth is\
perfectly circular then the earth’s\
gravitational force does no work. The moon’s\
instantaneous displacement is tangential\
while the earth’s force is radially inwards and\
From Eq. (6.4) it is clear that work and energy\
have the same dimensions, . The SI unit\
of these is joule (J), named after the famous British\
physicist James Prescott Joule (1811-1869). Since\
work and energy are so widely used as physical\
concepts, alternative units abound and some of\
these are listed in Table 6.1.",
    
"The word potential suggests possibility or\
capacity for action. The term potential energy\
brings to one’s mind ‘stored’ energy. A stretched\
bow-string possesses potential energy. When it\
is released, the arrow flies off at a great speed.\
The earth’s crust is not uniform, but has\
discontinuities and dislocations that are called\
fault lines. These fault lines in the earth’s crust\
are like ‘compressed springs’. They possess a\
large amount of potential energy. An earthquake\
results when these fault lines readjust. Thus,\
potential energy is the ‘stored energy’ by virtue\
of the position or configuration of a body. The\
body left to itself releases this stored energy in\
the form of kinetic energy. Let us make our notion\
of potential energy more concrete.\
The gravitational force on a ball of mass m is\
mg . g may be treated as a constant near the earth\
surface. By ‘near’ we imply that the height h of\
the ball above the earth’s surface is very small\
compared to the earth’s radius RE (h <<RE) so that\
we can ignore the variation of g near the earth’s\
surface*. In what follows we have taken the\
upward direction to be positive. Let us raise the\
ball up to a height h. The work done by the external\
agency against the gravitational force is mgh. This\
work gets stored as potential energy.\
Gravitational potential energy of an object, as a\
function of the height h, is denoted by V(h) and it\
is the negative of work done by the gravitational\
force in raising the object to that height.\
V (h) = mgh\
If h is taken as a variable, it is easily seen that\
the gravitational force F equals the negative of\
the derivative of V(h) with respect to h. Thus,\
d\
d F V(h) m g h =− =−\
The negative sign indicates that the\
gravitational force is downward. When released,\
the ball comes down with an increasing speed.\
Just before it hits the ground, its speed is given\
by the kinematic relation,\
v2 = 2gh",
    
"Heat\
We have seen that the frictional force is excluded\
from the category of conservative forces. However,\
work is associated with the force of friction. A\
block of mass m sliding on a rough horizontal\
surface with speed v0 comes to a halt over a\
distance x0. The work done by the force of kinetic\
friction f over x0 is –f x0. By the work-energy\
theorem 2 m v /2 f x . o 0 = If we confine our scope\
to mechanics, we would say that the kinetic\
energy of the block is ‘lost’ due to the frictional\
force. On examination of the block and the table\
we would detect a slight increase in their\
temperatures. The work done by friction is not\
‘lost’, but is transferred as heat energy. This\
raises the internal energy of the block and the\
table. In winter, in order to feel warm, we\
generate heat by vigorously rubbing our palms\
together. We shall see later that the internal\
energy is associated with the ceaseless, often\
random, motion of molecules. A quantitative idea\
of the transfer of heat energy is obtained by\
noting that 1 kg of water releases about 42000 J\
of energy when it cools by10 °C.\
6.10.2 Chemical Energy\
One of the greatest technical achievements of\
humankind occurred when we discovered how\
to ignite and control fire. We learnt to rub two\
flint stones together (mechanical energy), got\
them to heat up and to ignite a heap of dry leaves\
(chemical energy), which then provided\
sustained warmth. A matchstick ignites into a\
bright flame when struck against a specially\
prepared chemical surface. The lighted\
matchstick, when applied to a firecracker,\
results in a spectacular display of sound and\
light.\
Chemical energy arises from the fact that the\
molecules participating in the chemical reaction\
have different binding energies. A stable chemical\
compound has less energy than the separated parts.\
A chemical reaction is basically a rearrangement\
of atoms. If the total energy of the reactants is more\
than the products of the reaction, heat is released\
and the reaction is said to be an exothermic\
reaction. If the reverse is true, heat is absorbed and\
the reaction is endothermic. Coal consists of\
carbon and a kilogram of it when burnt releases\
3 × 107 J of energy.\
Chemical energy is associated with the forces\
that give rise to the stability of substances. These\
forces bind atoms into molecules, molecules into\
polymeric chains, etc. The chemical energy\
arising from the combustion of coal, cooking gas,\
wood and petroleum is indispensable to our daily\
existence.\
6.10.3 Electrical",

"We have seen that the total mechanical energy\
of the system is conserved if the forces doing work\
on it are conservative. If some of the forces\
involved are non-conservative, part of the\
mechanical energy may get transformed into\
other forms such as heat, light and sound.\
However, the total energy of an isolated system\
does not change, as long as one accounts for all\
forms of energy. Energy may be transformed from\
one form to another but the total energy of an\
isolated system remains constant. Energy can\
neither be created, nor destroyed.\
Since the universe as a whole may be viewed\
as an isolated system, the total energy of the\
universe is constant. If one part of the universe\
loses energy, another part must gain an equal\
amount of energy.\
The principle of conservation of energy cannot\
be proved. However, no violation of this principle\
has been observed. The concept of conservation\
and transformation of energy into various forms\
links together various branches of physics,\
chemistry and life sciences. It provides a\
unifying, enduring element in our scientific\
pursuits. From engineering point of view all\
electronic, communication and mechanical\
devices rely on some forms of energy\
transformation."
    

]



In [64]:
query = ['what is work done']

In [65]:
documents_df=pd.DataFrame(documents,columns=['documents'])

In [66]:
documents_df

Unnamed: 0,documents
0,"The terms ‘work’, ‘energy’ and ‘power’ are fre..."
1,6.1.1 The Scalar ProductWe have learnt about v...
2,"We see that if there is no displacement, there..."
3,The word potential suggests possibility orcapa...
4,HeatWe have seen that the frictional force is ...
5,We have seen that the total mechanical energyo...


In [67]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/himanshujanbandhu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [68]:
stop_words_l=stopwords.words('english')
documents_df['documents_cleaned']=documents_df.documents.apply(lambda x: " ".join(re.sub(r'[^a-zA-Z]',' ',w).lower() for w in x.split() if re.sub(r'[^a-zA-Z]',' ',w).lower() not in stop_words_l) )

In [71]:
# tokenize and pad every document to make them of the same size
# !pip install tensorflow

# from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
# tokenizer=Tokenizer()
# tokenizer.fit_on_texts(documents_df.documents_cleaned)
from nltk.tokenize import word_tokenize


NameError: name 'tokenizer' is not defined

In [68]:
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
print(tokenized_paded_documents)

[[ 2  1 10 11 12 20 21 22  2  1 12 23 13 24 14 25  4 26 27  4 15 16  2  1
  28 29  7 30  5 31  8 32 33 34 17  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1 35 36 37 38 39 40 41 42  5 15 10 13 43 44 45 46  9 47 48  2  1  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1 18  5 49 50  8 51 52 53 54 18  5  1  4 55 56 57  8  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  1  7 58 59 60 61 62 63 64 65 66 67 17  1 68 69 70 71  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3  6 72  9  6  7 73  3  3  6 74 16  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3 

In [69]:
tokenizer.word_index

{'accomplish': 31,
 'algorithm': 34,
 'algorithms': 12,
 'application': 9,
 'approaches': 7,
 'automatically': 21,
 'available': 17,
 'based': 14,
 'broad': 61,
 'build': 23,
 'carry': 56,
 'cases': 82,
 'categories': 62,
 'certain': 57,
 'closely': 35,
 'comfortable': 85,
 'computational': 37,
 'computer': 11,
 'computers': 5,
 'computing': 74,
 'concernedabout': 79,
 'correctness': 80,
 'creates': 75,
 'data': 4,
 'delivers': 44,
 'depending': 63,
 'developing': 88,
 'development': 73,
 'discipline': 16,
 'discovering': 49,
 'divided': 59,
 'domains': 47,
 'employs': 28,
 'engineer': 19,
 'engineering': 6,
 'execute': 78,
 'experience': 22,
 'explicitly': 52,
 'explorative': 90,
 'feedback': 67,
 'field': 48,
 'focuses': 39,
 'fully': 32,
 'improve': 20,
 'involves': 18,
 'iterative': 89,
 'known': 26,
 'learning': 1,
 'logic': 77,
 'machine': 2,
 'making': 40,
 'mathematical': 13,
 'meanwhile': 83,
 'methods': 45,
 'model': 24,
 'nature': 64,
 'optimization': 43,
 'or': 66,
 'perfor

In [70]:
len(tokenizer.word_index)

91

In [71]:
vocab_size=len(tokenizer.word_index)+1
print(vocab_size)

92


In [72]:
tokenizer.word_index.items()

dict_items([('learning', 1), ('machine', 2), ('software', 3), ('data', 4), ('computers', 5), ('engineering', 6), ('approaches', 7), ('tasks', 8), ('application', 9), ('study', 10), ('computer', 11), ('algorithms', 12), ('mathematical', 13), ('based', 14), ('the', 15), ('discipline', 16), ('available', 17), ('involves', 18), ('engineer', 19), ('improve', 20), ('automatically', 21), ('experience', 22), ('build', 23), ('model', 24), ('sample', 25), ('known', 26), ('training', 27), ('employs', 28), ('various', 29), ('teach', 30), ('accomplish', 31), ('fully', 32), ('satisfactory', 33), ('algorithm', 34), ('closely', 35), ('related', 36), ('computational', 37), ('statistics', 38), ('focuses', 39), ('making', 40), ('predictions', 41), ('using', 42), ('optimization', 43), ('delivers', 44), ('methods', 45), ('theory', 46), ('domains', 47), ('field', 48), ('discovering', 49), ('perform', 50), ('without', 51), ('explicitly', 52), ('programmed', 53), ('so', 54), ('provided', 55), ('carry', 56), (

In [73]:
import numpy as np

In [74]:
# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
    if word in model_w2v:
        embedding_matrix[i]=model_w2v[word]

In [75]:
embedding_matrix.shape

(92, 300)

In [76]:
query[0]

'machine learning algorithms'

In [79]:
query_tokenizer = Tokenizer()
query_tokenizer.fit_on_texts(query)

In [80]:
query_tokenizer.word_index

{'algorithms': 3, 'learning': 2, 'machine': 1}

In [92]:
query_tokenizer.word_index.items()

dict_items([('machine', 1), ('learning', 2), ('algorithms', 3)])

In [81]:
query_vocab_size=len(query_tokenizer.word_index)+1
print(query_vocab_size)

4


In [82]:
query_matrix=np.zeros((query_vocab_size,300))
for word,i in query_tokenizer.word_index.items():
    if word in model_w2v:
        query_matrix[i]=model_w2v[word]

In [91]:
query_matrix.shape

(4, 300)

In [93]:
query_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.25585938, -0.02209473,  0.02905273, ...,  0.04541016,
        -0.33984375, -0.08154297],
       [-0.08837891,  0.1484375 , -0.06298828, ...,  0.02026367,
         0.11621094,  0.17578125],
       [ 0.22753906, -0.09130859,  0.45703125, ..., -0.11621094,
        -0.15917969,  0.30664062]])

In [85]:
from sklearn.metrics.pairwise import cosine_similarity

In [86]:
results = cosine_similarity(embedding_matrix, query_matrix)

In [95]:
results.shape

(92, 4)

In [88]:
res = np.argsort(results, axis=0)

In [96]:
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
for i in range(len(tokenized_paded_documents)):
    for j in range(len(tokenized_paded_documents[0])):
        document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
document_word_embeddings.shape

(6, 64, 300)

In [97]:
document_word_embeddings

array([[[ 0.25585938, -0.02209473,  0.02905273, ...,  0.04541016,
         -0.33984375, -0.08154297],
        [-0.08837891,  0.1484375 , -0.06298828, ...,  0.02026367,
          0.11621094,  0.17578125],
        [-0.05981445, -0.04223633, -0.07910156, ..., -0.11230469,
          0.12060547, -0.15429688],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]],

       [[ 0.25585938, -0.02209473,  0.02905273, ...,  0.04541016,
         -0.33984375, -0.08154297],
        [-0.08837891,  0.1484375 , -0.06298828, ...,  0.02026367,
          0.11621094,  0.17578125],
        [-0.08544922, -0.00230408, -0.0390625 , ...,  0.21191406,
         -0.31835938, -0.05200195],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  

In [98]:
tokenized_query=query_tokenizer.texts_to_sequences(query)
print(tokenized_query)

[[1, 2, 3]]


In [99]:
tokenized_paded_query=pad_sequences(tokenized_query,maxlen=64,padding='post')
print(tokenized_paded_query)

[[1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [100]:
query_word_embeddings=np.zeros((len(tokenized_paded_query),64,300))
for i in range(len(tokenized_paded_query)):
    for j in range(len(tokenized_paded_query[0])):
        query_word_embeddings[i][j]=query_matrix[tokenized_paded_query[i][j]]
query_word_embeddings.shape

(1, 64, 300)

In [103]:
nsamples, nx, ny = document_word_embeddings.shape
document_word_embeddings = document_word_embeddings.reshape((nsamples,nx*ny))

In [104]:
nsamples, nx, ny = query_word_embeddings.shape
query_word_embeddings = query_word_embeddings.reshape((nsamples,nx*ny))

In [105]:
results = cosine_similarity(document_word_embeddings, query_word_embeddings)

In [106]:
results

array([[0.16410047],
       [0.20806154],
       [0.25309566],
       [0.24802323],
       [0.10865737],
       [0.05159032]])

In [107]:
res = np.argsort(results, axis=0,)
res

array([[5],
       [4],
       [0],
       [1],
       [3],
       [2]])

In [108]:
res[-1:-4:-1]

array([[2],
       [3],
       [1]])

In [112]:
k=1
for i in res[-1:-5:-1]:
    print("result ", k, "=============================")
    print(documents[i[0]])
    print("=============================")
    k+=1

Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks.
Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement
Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.
Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm i