#                                 Use of Gensim package in python for TOPIC MODELING : 

First, let's describe topic modeling and its need in short. The amount of data available on the internet is growing every second. But all of this huge data is not relevant to us for any particular work. So we need a sort of a tool which would help us to understand the main themes or topics of a document prior reading it. So, the technique should be able to do so without knowing the topic/theme of the document. We might be also interested to see the dynamic change in the themes with time in a document.  LDA(Latent Dirichlet Allocation) is a generative model which serves this purpose. We will see the application part of it. For a given corpus we will see how we can extract the hidden topics with the help of Gensim package in python.

Prerequisites are: You should have nltk stopwords, spacy model, gensim, pyLDAvis, and 'en' library of spacy package downloaded in your system.

Run this in terminal:
(if it doesn't work try some other methods to install these ) : 
"pip3 install spacy" , "pip3 install gensim" , "pip3 install pyLDAvis" , "python3 -m spacy download en"

In [23]:
#Run this in python interpreter
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/s18210071/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Import all the needed packages:

In [24]:
from pprint import pprint
import pandas as pd
import nltk
import numpy as np
import re
from nltk.tokenize import sent_tokenize

# For Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy as spc

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  
import matplotlib.pyplot as plt


# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)


Import the Stopwords: 

In [25]:
from nltk.corpus import stopwords
stop_words_list = stopwords.words('english')
# we can extend the stopwords if needed according to our need: 
# stop_words_list.extend(['word1', 'word2', 'word3'])

Define function to tokenize sentences into list of words, ignoring the punctuations: 

In [26]:
def sent_word_converter(sentences):
    for s in sentences:
        yield(gensim.utils.simple_preprocess(str(s), deacc=True))  # deacc=True ignores punctuations

Define function to remove stopwords:

In [27]:
def delete_stopwords(words):
    return [[word for word in simple_preprocess(str(element)) if word not in stop_words_list] for element in words]

Define function to make Biagrams:

In [28]:
def bigram_maker(words):
    return [bigram_mod[element] for element in words]

Define function to make Trigrams:

In [29]:
def trigram_maker(words):
    return [trigram_mod[bigram_mod[element]] for element in words]

Define function for the lemmatization task:

In [30]:
def lemmatize(text, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    text_out = []
    for sen in text:
        doct = nlp(" ".join(sen)) 
        text_out.append([token.lemma_ for token in doct if token.pos_ in allowed_postags])
    return text_out

Import the dataset or document: (Here we are importing a science fiction book's data set)

In [31]:
filename="aliensfaq.txt"
f=open(filename,'r')
file1=f.read()
f.close()

Divide the whole file into list of senteces and remove the non english words:

In [32]:
list_sent=nltk.sent_tokenize(file1)
english_words=set(nltk.corpus.words.words())
english_inputs=[]
for i in list_sent:
    j=" ".join(w for w in nltk.wordpunct_tokenize(i) if w.lower() in english_words or not w.isalpha())
    english_inputs.append(j)

With the help of regular expression let's clean the data first:

In [33]:
#remove words which includes @ from the list:
english_inputs=[re.sub('\S*@\S*\s?', '', sent) for sent in english_inputs]
#remove new lines from the list:
english_inputs=[re.sub('\s+', ' ', sent) for sent in english_inputs]
# Remove distracting single quotes from the list:
english_inputs=[re.sub("\'", "", sent) for sent in english_inputs]

Let's break this list of sentences to list of words, and then build the bigram and trigrmas, and remove stopwords:

In [34]:
input_words=list(sent_word_converter(english_inputs))

# Build the bigram and trigram models
bigrams = gensim.models.Phrases(input_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigrams = gensim.models.Phrases(bigrams[input_words], threshold=100)  

bigram_mod = gensim.models.phrases.Phraser(bigrams)
trigram_mod = gensim.models.phrases.Phraser(trigrams)
non_stopwords = delete_stopwords(input_words)



Now let's make the Bigrams and Trigrams lemmatize them and them make the topics:

In [35]:
# Form Bigrams
bigram_words = bigram_maker(non_stopwords)

# Load spacy 'en' model, we only need tagger component
nlp = spc.load('en', disable=['parser', 'ner'])

# Lemmatize keeping only noun, adj, vb, adv
lemmatized_words = lemmatize(bigram_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# Create Dictionary
id2word = corpora.Dictionary(lemmatized_words)

# Create Corpus
texts = lemmatized_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Building LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                        id2word=id2word,
                                        num_topics=5, 
                                        random_state=100,
                                        update_every=1,
                                        chunksize=100,
                                        passes=10,
                                        alpha='auto',
                                        per_word_topics=True)
# change the code and Print the Keyword in the 10 topics
#pprint(lda_model.print_topics())
topics=lda_model.print_topics()
print("Top 5 topics are: ")
ct=1
for i in topics:
    print("\nTopic ",ct,": ")
    ct+=1
    pprint(i[1])
    
doc_lda = lda_model[corpus]

Top 5 topics are: 

Topic  1 : 
('0.055*"alien" + 0.027*"could" + 0.024*"get" + 0.021*"specie" + '
 '0.020*"script" + 0.017*"merchandise" + 0.017*"memorable" + 0.016*"make" + '
 '0.015*"different" + 0.015*"scene"')

Topic  2 : 
('0.034*"organism" + 0.027*"bishop" + 0.018*"small" + 0.016*"find" + '
 '0.014*"creature" + 0.014*"life_cycle" + 0.014*"head" + 0.013*"probably" + '
 '0.013*"long" + 0.012*"tail"')

Topic  3 : 
('0.083*"host" + 0.023*"egg" + 0.019*"nest" + 0.016*"larva" + 0.013*"natural" '
 '+ 0.012*"would" + 0.011*"emergence" + 0.011*"order" + 0.011*"possible" + '
 '0.010*"food"')

Topic  4 : 
('0.044*"would" + 0.024*"frequently" + 0.022*"environment" + 0.021*"use" + '
 '0.018*"large" + 0.015*"alien" + 0.015*"queen" + 0.015*"embryo" + '
 '0.014*"case" + 0.014*"also"')

Topic  5 : 
('0.078*"may" + 0.044*"add" + 0.018*"adult" + 0.016*"form" + 0.015*"imago" + '
 '0.014*"queen" + 0.013*"nymph" + 0.013*"implantation" + 0.012*"new" + '
 '0.012*"section"')


Note: Here I have taken help from theses site to understand how to use Gensim for topic modeling:
https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html and https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/