<a href="https://colab.research.google.com/github/CgriefTesla/text_mining_report/blob/main/Text_mining_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preparation

importing some lib and downloading some model


In [1]:
!pip install nltk
!pip install gensim
!pip install pyLDAvis



## downloading some data


In [2]:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("brown")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Dataset
Load the corpus from NLTK package.

In [3]:
from nltk.corpus import brown as corpus

### check out the content of the corpus.

In [4]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:20]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that 

The total number of documents.

In [5]:
len(corpus.fileids())

500

train the model with all documents.

In [6]:
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:5])
print("num of docs:", len(docs))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...], ['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...], ['Several', 'defendants', 'in', 'the', 'Summerdale', ...], ['Oslo', 'The', 'most', 'positive', 'element', 'to', ...], ['East', 'Providence', 'should', 'organize', 'its', ...]]
num of docs: 500


## Data preprocessing
First, defining some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our LDA analysis result.

Try to ignore numbers and words through regular expression.

In [127]:
# English stopwords defined by the NLTK package.
en_stop = nltk.corpus.stopwords.words('english')

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<","!"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000","one","two","first"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price","would","make","know","could","like", "go", "take","might","may"]          \
         +["mr.","mrs.","must", "even","new","state","get","man","come","time","see","many","little","years","day","also","af","give","men","use","seem","much","back","work"]   \
         +["well","look","tell","last","form","way","good","us","still","world","people","school","want","need","never"]   \
         +["since","high","life","become","however","small","small","another","long"]   \
         +en_stop

Next, defining several preprocessing functions.

In [128]:
from nltk.corpus import wordnet as wn # import for lemmatize
from collections import defaultdict

def preprocess_word(word, stopwordset):
    
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove "," and "." and "''"
    if word in [",",".","''"]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    frequency = defaultdict(int)
    ## delete the word only appear once, I think they are noise
    for token in document:
        frequency[token] += 1
    document = [token for token in document if frequency[token] > 1]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

check out the preprocessing result.

In [None]:
# before
print(docs[0][:25]) 

# after
print(preprocess_documents(docs)[0][:25])

Next, reshape our documents with the available format for the gensim LDA model.

In [129]:
import gensim
from gensim import corpora

In [130]:
# build the dictionary
dictionary = corpora.Dictionary(preprocess_documents(docs))
# construct the 
corpus_ = [dictionary.doc2bow(doc) for doc in preprocess_documents(docs)]

Let us check out the contents of the built dictionary and corpus.

In [79]:
# token2id is the attribute which indicates the mapping between words and dictionary ID

print(dictionary.token2id)





In [80]:
# corpus_ contains words of each document with a list (ID, appear frequency)

# note that there is not the appearing order in the documents, but the order of the dictionary
print(corpus_[0][:10]) 


[(0, 2), (1, 4), (2, 3), (3, 2), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2)]


Compare the original document with our preprocessing result that is available for the LDA model.

In [81]:
# before
print([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]])

# after
print(dictionary.doc2bow(preprocess_document(corpus.words(corpus.fileids()[0]))))


['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
[(0, 2), (1, 4), (2, 3), (3, 2), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2), (10, 3), (11, 2), (12, 3), (13, 4), (14, 5), (15, 2), (16, 3), (17, 2), (18, 2), (19, 2), (20, 2), (21, 3), (22, 2), (23, 2), (24, 2), (25, 2), (26, 9), (27, 3), (28, 2), (29, 3), (30, 5), (31, 6), (32, 3), (33, 4), (34, 2), (35, 9), (36, 2), (37, 2), (38, 3), (39, 2), (40, 2), (41, 2), (42, 17), (43, 2), (44, 5), (45, 3), (46, 11), (47, 2), (48, 3), (49, 2), (50, 3), (51, 14), (52, 3), (53, 2), (54, 2), (55, 2), (56, 2), (57, 2), (58, 4), (59, 4), (60, 2), (61, 2), (62, 2), (63, 4), (64, 14), (65, 2), (66, 8), (67, 2), (68, 2), (69, 4), (70, 4), (71, 3), (72, 2), (73, 4), (74, 2), (75, 5), (76, 8), (77, 3), (78, 3), (79, 5), (80, 3), (81, 2), (82, 4), (83, 2), (84, 3), (85, 

## Training

In [131]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus_,
                                           num_topics=15,
                                           id2word=dictionary,
                                           alpha=0.1,                 # optional LDA hyperparameter alpha
                                           eta=0.1,                   # optional LDA hyperparameter beta
                                           #minimum_probability=0.0    # optional the lower bound of the topic/word generative probability
                                          )

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

Check out the learned parameters.

In [132]:
# the top num_words of words for each topic (topic ID, the word generative probability for the topic).

topics = ldamodel.print_topics(num_words=15)
for topic in topics:
    print(topic)

(0, '0.003*"ask" + 0.003*"right" + 0.003*"head" + 0.003*"place" + 0.003*"call" + 0.003*"old" + 0.003*"around" + 0.002*"home" + 0.002*"eyes" + 0.002*"thought" + 0.002*"night" + 0.002*"found" + 0.002*"write" + 0.002*"try" + 0.002*"going"')
(1, '0.003*"house" + 0.003*"increase" + 0.003*"city" + 0.002*"call" + 0.002*"place" + 0.002*"home" + 0.002*"right" + 0.002*"stand" + 0.002*"country" + 0.002*"problem" + 0.002*"church" + 0.002*"thought" + 0.002*"left" + 0.002*"point" + 0.002*"great"')
(2, '0.004*"american" + 0.003*"general" + 0.003*"house" + 0.003*"old" + 0.003*"right" + 0.002*"god" + 0.002*"program" + 0.002*"church" + 0.002*"interest" + 0.002*"present" + 0.002*"found" + 0.002*"poem" + 0.002*"show" + 0.002*"great" + 0.002*"unite"')
(3, '0.003*"american" + 0.003*"business" + 0.003*"great" + 0.003*"area" + 0.003*"system" + 0.002*"house" + 0.002*"move" + 0.002*"member" + 0.002*"government" + 0.002*"problem" + 0.002*"president" + 0.002*"show" + 0.002*"interest" + 0.002*"old" + 0.002*"point"

In [122]:
# for each document, show the probabilities of topics which beyond the minimum_probability [(topic ID, probability)]

for n,item in enumerate(corpus_[:10]):
    print("document ID "+str(n)+":" ,end="")
    print(ldamodel.get_document_topics(item))

document ID 0:[(2, 0.9977025)]
document ID 1:[(1, 0.51054955), (2, 0.057294566), (6, 0.4188303), (9, 0.011549506)]
document ID 2:[(0, 0.8056379), (1, 0.099599086), (4, 0.0929914)]
document ID 3:[(3, 0.05980213), (4, 0.5381774), (11, 0.39962235)]
document ID 4:[(0, 0.9976446)]
document ID 5:[(1, 0.08140957), (6, 0.9160681)]
document ID 6:[(2, 0.0410272), (7, 0.066643484), (9, 0.8902328)]
document ID 7:[(9, 0.99714315)]
document ID 8:[(7, 0.45780355), (14, 0.54016954)]
document ID 9:[(6, 0.017813614), (11, 0.97394204)]


In [123]:
# the categories of documents
categories = [corpus.categories(fileid) for fileid in corpus.fileids()]

In [124]:
n=0

# nth document's topic distribution
print(ldamodel.get_document_topics(corpus_[n]))

# nth document's category
print(categories[n])

# show the original document
print(" ".join(docs[n]))

[(2, 0.9977025)]
['news']


## Visualization
Analyze our result through visualization.

In [125]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

  from collections import Iterable
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


In [126]:
# it will spend about 20 minutes to visualize the result if you train the model with all documents
# please note that gensim start topics with index 0 to K-1, but pyLDAvis start the index with 1 to K


lda_display = pyLDAvis.gensim_models.prepare(ldamodel, corpus_, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)