# **Gensim = "Generate Similar"**
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

**Features-**
*   All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core).
*   Intuitive interfaces-
 *   Easy to plug in your own input corpus/datastream (simple streaming API)
 * Easy to extend with other Vector Space algorithms (simple transformation API)  
*  Efficient multicore implementations of popular algorithms, such as online LSI, LDA, Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
*   Distributed computing: can run LSA and LDA on a cluster of computers.

In [109]:
import gensim
from pprint import pprint
import spacy
from gensim import corpora

In [110]:
text_corpus = ["The fox jumps over the dog",
              "The fox is very clever and quick",
              "The dog is slow and lazy",
              "The cat is smarter than the fox and the dog",
              "Python is an excellent programming language",
              "Java and ruby are other programming languages",
              "Python and Java are very popular programming languages",
              "Python programmes are smaller than java programs"]

In [111]:
#pre-processing

nlp = spacy.load("en_core_web_sm")

def preprocess(txt_corpus):
  pp_corpus = []
  for line in txt_corpus:
    doc = nlp(line)
    #Removing punctuations and stop words
    pp_tokens = [token for token in doc if not (token.is_punct or token.is_stop)]
    #Lemmatization
    pp_corpus.append([token.lemma_ if token.lemma_.lower() is not '-PRON-' else token.lower() for token in pp_tokens])
  return pp_corpus

pp_corpus  = preprocess(text_corpus)
pp_corpus

[['fox', 'jump', 'dog'],
 ['fox', 'clever', 'quick'],
 ['dog', 'slow', 'lazy'],
 ['cat', 'smart', 'fox', 'dog'],
 ['Python', 'excellent', 'programming', 'language'],
 ['Java', 'ruby', 'programming', 'language'],
 ['Python', 'Java', 'popular', 'programming', 'language'],
 ['Python', 'programme', 'small', 'java', 'program']]

**Bag of Words (BOW)**

In [112]:
dictionary = corpora.Dictionary(pp_corpus)
bow_corpus = [dictionary.doc2bow(doc) for doc in pp_corpus]

In [113]:
print(dictionary.token2id)

{'dog': 0, 'fox': 1, 'jump': 2, 'clever': 3, 'quick': 4, 'lazy': 5, 'slow': 6, 'cat': 7, 'smart': 8, 'Python': 9, 'excellent': 10, 'language': 11, 'programming': 12, 'Java': 13, 'ruby': 14, 'popular': 15, 'java': 16, 'program': 17, 'programme': 18, 'small': 19}


In [114]:
#To save memory, Gensim omits all vector elements with value 0.0
bow_corpus

[[(0, 1), (1, 1), (2, 1)],
 [(1, 1), (3, 1), (4, 1)],
 [(0, 1), (5, 1), (6, 1)],
 [(0, 1), (1, 1), (7, 1), (8, 1)],
 [(9, 1), (10, 1), (11, 1), (12, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1)],
 [(9, 1), (11, 1), (12, 1), (13, 1), (15, 1)],
 [(9, 1), (16, 1), (17, 1), (18, 1), (19, 1)]]

**TF-IDF**

Tf-idf stands for **term frequency-inverse document frequency**.
This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Applications-


*   Information Retrieval
*   Keyword Extraction
*   Stopwords filtering etc






In [117]:
from  gensim import models
# Create the TF-IDF model
tfidf = models.TfidfModel(bow_corpus,normalize=True)  #training
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x7f39ee66f780>

In [116]:
corpus_tfidf = tfidf[bow_corpus] 

*Note-*

Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration.

In [18]:
for doc in corpus_tfidf:
    print(doc)

[(0, 0.39239043318859274), (1, 0.39239043318859274), (2, 0.8319011334792957)]
[(1, 0.31639356562839216), (3, 0.6707813025230176), (4, 0.6707813025230176)]
[(0, 0.31639356562839216), (5, 0.6707813025230176), (6, 0.6707813025230176)]
[(0, 0.30165504678093485), (1, 0.30165504678093485), (7, 0.6395343874660627), (8, 0.6395343874660627)]
[(9, 0.36527597081532565), (10, 0.7744161642390763), (11, 0.36527597081532565), (12, 0.36527597081532565)]
[(11, 0.3431499531386567), (12, 0.3431499531386567), (13, 0.4850047483738612), (14, 0.7275071225607918)]
[(9, 0.3245721388452534), (11, 0.3245721388452534), (12, 0.3245721388452534), (13, 0.4587470494748973), (15, 0.688120574212346)]
[(9, 0.22954235351308377), (16, 0.4866493367774363), (17, 0.4866493367774363), (18, 0.4866493367774363), (19, 0.4866493367774363)]


# **Topic Models-**

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.

**Latent Dirichlet allocation(LDA)**

* In NLP, LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
* For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. 

In [None]:
lda_model = gensim.models.LdaModel(corpus_tfidf, 
                                   num_topics = 2, 
                                   id2word = dictionary.id2token,                                    
                                   passes = 50,
                                   per_word_topics = 10)

In [146]:
top_topics = lda_model.top_topics(corpus_tfidf ,topn =5)

Topic Coherence-

Topic Coherence is a measure used to evaluate topic models.
Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. It is defined as the average / median of the pairwise word-similarity scores of the words in the topic (e.g. PMI).

In [151]:
pprint(top_topics)

[([(0.07817043, 'programming'),
   (0.07817026, 'language'),
   (0.0735714, 'Java'),
   (0.07235142, 'Python'),
   (0.06485425, 'excellent')],
  -3.1855571233754922),
 ([(0.09691611, 'fox'),
   (0.08535472, 'jump'),
   (0.085127614, 'dog'),
   (0.074984066, 'quick'),
   (0.07498405, 'clever')],
  -10.810484484839957)]


In [148]:
# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / 2
print('Average topic coherence: %.4f.' % avg_topic_coherence)

Average topic coherence: -6.9980.


**Latent semantic indexing (LSI)**

*   LSI is an indexing and retrieval method that uses SVD to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

* LSI is based on the principle that words that are used in the same contexts tend to have similar meanings.

* A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

In [130]:
from gensim.models import LsiModel

#Here we transformed our Tf-Idf corpus via LSI into a latent 3-D space (3-D because we set num_topics=3)
lsi_model = LsiModel(corpus_tfidf,id2word = dictionary, num_topics =3)  # train model

In [131]:
vector = lsi_model[corpus_tfidf[5]]  # apply model to tfidf document
vector

[(0, 0.7452348915110869), (2, -0.24031610763830696)]

In [134]:
lsi_model.print_topics(3)

[(0,
  '0.449*"programming" + 0.449*"language" + 0.429*"Java" + 0.327*"popular" + 0.322*"Python" + 0.315*"ruby" + 0.308*"excellent" + 0.047*"program" + 0.047*"java" + 0.047*"programme"'),
 (1,
  '0.459*"fox" + 0.459*"dog" + 0.444*"jump" + 0.322*"cat" + 0.322*"smart" + 0.208*"slow" + 0.208*"lazy" + 0.208*"quick" + 0.208*"clever" + 0.000*"Java"'),
 (2,
  '0.468*"small" + 0.468*"programme" + 0.468*"java" + 0.468*"program" + 0.238*"Python" + -0.174*"ruby" + -0.142*"Java" + 0.075*"excellent" + -0.065*"language" + -0.065*"programming"')]

In [133]:
corpus_lsi = lsi_model[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
for doc, as_text in zip(corpus_lsi, text_corpus):
    print(doc, as_text)

[(1, 0.729605340630538)] The fox jumps over the dog
[(1, 0.42469706069369845)] The fox is very clever and quick
[(1, 0.4246970606936987)] The dog is slow and lazy
[(1, 0.689295072973576)] The cat is smarter than the fox and the dog
[(0, 0.6839200687841975), (2, 0.09779960609472999)] Python is an excellent programming language
[(0, 0.7452348915110869), (2, -0.24031610763830696)] Java and ruby are other programming languages
[(0, 0.8174724665245316), (2, -0.05726191138435917)] Python and Java are very popular programming languages
[(0, 0.16458343363665945), (2, 0.9661657160778017)] Python programmes are smaller than java programs


# **Word2Vec Model-**

In [30]:
from gensim.models import Word2Vec

In [93]:
!wget 'http://nlp.stanford.edu/data/glove.6B.zip' 'drive/My Drive'

--2020-07-21 14:45:37--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-07-21 14:45:37--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-07-21 14:45:38--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [97]:
import zipfile
zip_ref = zipfile.ZipFile("/content/glove.6B.zip", 'r')
zip_ref.extractall("drive/My Drive/")
zip_ref.close()

In [98]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('/content/drive/My Drive/glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


(400000, 100)

In [None]:
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

In [108]:
model.most_similar('banana',topn=7)

  if np.issubdtype(vec.dtype, np.int):


[('coconut', 0.7097253799438477),
 ('mango', 0.7054824233055115),
 ('bananas', 0.6887733936309814),
 ('potato', 0.6629636287689209),
 ('pineapple', 0.6534532904624939),
 ('fruit', 0.6519855260848999),
 ('peanut', 0.6420576572418213)]

In [107]:
model.most_similar(negative= 'dog' ,topn=4)

  if np.issubdtype(vec.dtype, np.int):


[('aquaculturists', 0.5666294097900391),
 ('oakleys', 0.5649899244308472),
 ('http://www.opel.com', 0.5606860518455505),
 ('ricefields', 0.5550658702850342)]

In [101]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.7699


  if np.issubdtype(vec.dtype, np.int):


In [103]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

analogy('japan', 'japanese', 'australia')

  if np.issubdtype(vec.dtype, np.int):


'australian'

In [104]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


In [56]:
corpus = ( "Human machine interface for lab abc computer applications."
                "A survey of user opinion of computer system response time."
                "The EPS user interface management system."
                "System and human system engineering testing of EPS."
                "Relation of user perceived response time to error measurement."
                "The intersection graph of paths in trees."
                "Graph minors IV Widths of trees and well quasi ordering."
                "Graph minors a survey")

doc1 = nlp(corpus)

In [57]:
sentences = [str(sentence) for sentence in doc1.sents]
sentences

['Human machine interface for lab abc computer applications.',
 'A survey of user opinion of computer system response time.',
 'The EPS user interface management system.',
 'System and human system engineering testing of EPS.Relation of user perceived response time to error measurement.',
 'The intersection graph of paths in trees.',
 'Graph minors IV Widths of trees and well quasi ordering.',
 'Graph minors a survey']

In [58]:
tokenized_sentences = preprocess(sentences)
print(tokenized_sentences)

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'application'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['EPS', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps.relation', 'user', 'perceive', 'response', 'time', 'error', 'measurement'], ['intersection', 'graph', 'path', 'tree'], ['Graph', 'minor', 'IV', 'Widths', 'tree', 'quasi', 'ordering'], ['Graph', 'minor', 'survey']]


In [76]:
# instantiating and training the Word2Vec model
model = gensim.models.Word2Vec(tokenized_sentences, min_count=1,window=5,size=100 , compute_loss=True,iter = 15)

# getting the training loss value
training_loss = model.get_latest_training_loss()
training_loss

546.4014282226562

In [73]:
word_vec = model.wv['computer']
word_vec

array([-4.1612535e-04, -2.2380303e-03,  2.2937683e-03, -3.1812473e-03,
        2.8154387e-03, -3.7848067e-03, -1.4736252e-04, -7.4351655e-04,
        2.5403826e-03, -3.5814167e-04,  4.4296319e-03,  6.0696085e-04,
        1.7392690e-03,  1.1080391e-03, -2.8571922e-03,  4.3057656e-04,
        7.7703310e-04,  4.4810087e-03,  4.1730711e-03, -4.7920914e-03,
       -4.2877247e-04,  2.8659985e-03,  8.2127699e-05,  9.3507167e-04,
        2.7063708e-03,  4.6429560e-03,  1.8652041e-03,  3.3616212e-03,
       -3.5358602e-03, -4.1404823e-03, -3.0854866e-03,  9.0099184e-04,
       -4.7521577e-03, -3.2824520e-03,  6.3185004e-04,  2.6604035e-03,
        4.3843249e-03, -6.5386965e-04,  4.4207289e-03, -2.4266243e-03,
        5.6468189e-04, -3.3247694e-03,  3.8855260e-03, -2.6469135e-03,
       -2.4846315e-03,  4.8967609e-03, -1.3287842e-03,  2.2864668e-03,
       -3.8460798e-03, -8.0878247e-04, -6.9844187e-04,  3.8065044e-03,
       -1.0497979e-04, -2.7236375e-03, -1.2956148e-03,  4.7890400e-03,
      

In [77]:
model.save('./w2v')
model = gensim.models.Word2Vec.load('./w2v')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [78]:
more_sentences = [['how', 'are' ,'you'] , ['have','a','nice','day']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=5)

(4, 35)

# **Text Summarization**

In [None]:
from gensim.summarization import summarize

* This module automatically summarizes the given text, by extracting one or more important sentences from the text.
In a similar way, it can also extract keywords.

* This summarizer is based on [ “TextRank” algorithm by Mihalcea et al.](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf).


***Note-***
Gensim’s summarization only works for English for now, because the text is pre-processed.





In [28]:
text = (
    "Thomas A. Anderson is a man living two lives. By day he is an "
    "average computer programmer and by night a hacker known as "
    "Neo. Neo has always questioned his reality, but the truth is "
    "far beyond his imagination. Neo finds himself targeted by the "
    "police when he is contacted by Morpheus, a legendary computer "
    "hacker branded a terrorist by the government. Morpheus awakens "
    "Neo to the real world, a ravaged wasteland where most of "
    "humanity have been captured by a race of machines that live "
    "off of the humans' body heat and electrochemical energy and "
    "who imprison their minds within an artificial reality known as "
    "the Matrix. As a rebel against the machines, Neo must return to "
    "the Matrix and confront the agents: super-powerful computer "
    "programs devoted to snuffing out Neo and the entire human "
    "rebellion. "
)

In [None]:
pprint(summarize(text, split=True , ratio = .4))

['By day he is an average computer programmer and by night a hacker known as '
 'Neo. Neo has always questioned his reality, but the truth is far beyond his '
 'imagination.',
 'Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.']


In [None]:
pprint(summarize(text, split=False , word_count=100))

('By day he is an average computer programmer and by night a hacker known as '
 'Neo. Neo has always questioned his reality, but the truth is far beyond his '
 'imagination.\n'
 'Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.\n'
 'As a rebel against the machines, Neo must return to the Matrix and confront '
 'the agents: super-powerful computer programs devoted to snuffing out Neo and '
 'the entire human rebellion.')


In [None]:
from gensim.summarization import keywords
print(keywords(text))

humanity
human
neo
humans body
super
hacker
reality
