### Author: Armaan Bhullar
### Email: jhaniria@gmail.com

#### This notebook explains the basic of using LSI (Latent Semantic Indexing) techniques for:
* Topic Modelling
* Creating a simple searchable index over documents
* Similarity match between documents

In [18]:
import gensim
from gensim import corpora
from gensim.models import LsiModel

In [None]:
book_names = ["A dive into maths","Advanced maths","Common probability distributions","The maths of probability", \
              "Stochastic probability maths", "Was Thanos right?", "Stark being stark !",
              "A study of marvel characters: Stark vs Dr. Strange", "Why is Dr. strange so strange ?"]
# The above list contains books from primarily 3 topics: Maths, Probability and Marvel universe

### Preprocessing - 
* tokenization
* lower string
* Stop word removal - Not implemented here

In [75]:
book_names_split = [[word.lower() for word in doc.split(" ")] for doc in book_names]
##You can use a very advanced tokenizer here from gensim/nltk ...
print(book_names_split)

[['a', 'dive', 'into', 'maths'], ['advanced', 'maths'], ['common', 'probability', 'distributions'], ['the', 'maths', 'of', 'probability'], ['stochastic', 'probability', 'maths'], ['was', 'thanos', 'right?'], ['stark', 'being', 'stark', '!'], ['a', 'study', 'of', 'marvel', 'characters:', 'stark', 'vs', 'dr.', 'strange'], ['why', 'is', 'dr.', 'strange', 'so', 'strange', '?']]


### Let's start by creating:
* Dictionary: Basically assigns an ID to each word
* Corpora: Uses the ID to create a mathematical representation of the documents, called corpus

In [77]:
dictionary = corpora.dictionary.Dictionary(book_names_split)
print("Dictionary : ", dictionary.token2id)
corpus = [dictionary.doc2bow(doc) for doc in book_names_split]
print("Corpus: ", corpus[0])

Dictionary :  {'a': 0, 'dive': 1, 'into': 2, 'maths': 3, 'advanced': 4, 'common': 5, 'distributions': 6, 'probability': 7, 'of': 8, 'the': 9, 'stochastic': 10, 'right?': 11, 'thanos': 12, 'was': 13, '!': 14, 'being': 15, 'stark': 16, 'characters:': 17, 'dr.': 18, 'marvel': 19, 'strange': 20, 'study': 21, 'vs': 22, '?': 23, 'is': 24, 'so': 25, 'why': 26}
Corpus:  [(0, 1), (1, 1), (2, 1), (3, 1)]


## Now we train a model of LSI class, the created corpora and dictionary are common for all basic Gensim models

In [78]:
### Now let's fit our model
model = LsiModel(corpus=corpus, id2word=dictionary)

Main features:
* A topic is composed of a weighted collection of words
* Hiigher weight => More contribution to topic
* Negative weights are natural, implyingnegative correlation with the topic
* Not all topics are meaningful, the top topics are more meaningful

In [79]:
model.show_topics(5,10)

[(0,
  '0.560*"strange" + 0.383*"dr." + 0.333*"stark" + 0.238*"of" + 0.236*"a" + 0.206*"vs" + 0.206*"characters:" + 0.206*"marvel" + 0.206*"study" + 0.177*"so"'),
 (1,
  '-0.481*"maths" + -0.348*"probability" + 0.326*"strange" + -0.313*"stark" + -0.255*"of" + -0.215*"a" + 0.209*"is" + 0.209*"why" + 0.209*"?" + 0.209*"so"'),
 (2,
  '0.556*"stark" + -0.465*"maths" + -0.367*"probability" + 0.228*"!" + 0.228*"being" + -0.182*"strange" + -0.150*"the" + -0.142*"stochastic" + -0.141*"why" + -0.141*"is"'),
 (3,
  '-0.398*"stark" + 0.332*"a" + -0.320*"!" + -0.320*"being" + -0.277*"probability" + 0.241*"vs" + 0.241*"marvel" + 0.241*"study" + 0.241*"characters:" + 0.184*"of"'),
 (4,
  '0.433*"probability" + -0.375*"dive" + -0.375*"into" + -0.345*"maths" + 0.270*"common" + 0.270*"distributions" + -0.256*"a" + 0.248*"of" + -0.133*"advanced" + 0.129*"the"')]

## Now let's do some topic analysis

### Given a new book title,
* preprocess in the same way as training data 
* create a bow representation using same dictionary as the training data

In [83]:
doc = "Elementary maths"
doc_preprocessed = [word.lower() for word in doc.split(" ")]
doc_bow = dictionary.doc2bow(doc_preprocessed)

In [84]:
per_topic_match = model[doc_bow]
print("The per topic match of document: {}".format(per_topic_match))
matched_topic = sorted( per_topic_match, key=lambda x:x[1],  reverse=True)[0][0]
print("Max match with the following topic : {}".format(model.show_topic(matched_topic)))
print(model.show_topic(2))

The per topic match of document: [(0, 0.07998423196480817), (1, -0.4814256385830975), (2, -0.4654984031690677), (3, -0.10482334268432869), (4, -0.34526711553703254), (6, -0.21329975586071215), (7, -0.19778523223281436), (8, -0.10654996870877514)]
Max match with the following topic : [('strange', 0.5604572621811904), ('dr.', 0.38334018270966697), ('stark', 0.33326651589548706), ('of', 0.23801568143827062), ('a', 0.23637241806558176), ('vs', 0.2062231032381434), ('characters:', 0.2062231032381434), ('marvel', 0.20622310323814338), ('study', 0.20622310323814338), ('so', 0.1771170794715234)]
[('stark', 0.5559415579606246), ('maths', -0.46549840316906793), ('probability', -0.3673124601702952), ('!', 0.22799864355803423), ('being', 0.22799864355803418), ('strange', -0.18247849776964653), ('the', -0.15027891565685814), ('stochastic', -0.14171380857932003), ('why', -0.14121138430710156), ('is', -0.14121138430710153)]


## Now let's create a searchable index and run a query for a docA against it to find most smilar existing doc:
1. Create index of existing documents
2. Get the per topic match of the docA
3. Run the index on the per_topic_match of docA in step2 above, this returns similarity scores across existing documents

In [85]:
## Now let's create an index 
from gensim import similarities
index = similarities.MatrixSimilarity(model[corpus])

In [86]:
for book_name, match in zip(book_names, index[per_topic_match]):
    print(book_name, match)

A dive into maths 0.60588354
Advanced maths 0.8568487
Common probability distributions 9.313226e-09
The maths of probability 0.60588354
Stochastic probability maths 0.6996141
Was Thanos right? 0.0
Stark being stark ! -1.3591489e-08
A study of marvel characters: Stark vs Dr. Strange -3.434252e-09
Why is Dr. strange so strange ? -1.7462298e-09
