<a href="https://colab.research.google.com/github/CgriefTesla/text_mining_report/blob/main/Text_mining_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preparation

importing some lib and downloading some model


In [1]:
!pip install nltk
!pip install gensim
!pip install pyLDAvis



## downloading some data


In [2]:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("brown")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Dataset
Load the corpus from NLTK package.

In [3]:
from nltk.corpus import brown as corpus

### check out the content of the corpus.

In [4]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:20]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that 

The total number of documents.

In [5]:
len(corpus.fileids())

500

train the model with all documents.

In [6]:
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:5])
print("num of docs:", len(docs))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...], ['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...], ['Several', 'defendants', 'in', 'the', 'Summerdale', ...], ['Oslo', 'The', 'most', 'positive', 'element', 'to', ...], ['East', 'Providence', 'should', 'organize', 'its', ...]]
num of docs: 500


## Data preprocessing
First, defining some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our LDA analysis result.

Try to ignore numbers and words through regular expression.

In [7]:
# English stopwords defined by the NLTK package.
en_stop = nltk.corpus.stopwords.words('english')

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<","!"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000","one","two","first"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price","would","make","know","could","like", "go", "take","might","may"]          \
         +["mr.","mrs.","must", "even","new","state","get","man","come","time","see","many","little","years","day","also","af","give","men","use","seem","much","back","work"]   \
         +["well","look","tell","last","form","way","good","us","still","world","people","school","want","need","never"]   \
         +["since","high","life","become","however","small","small","another","long"]   \
         +en_stop

Next, defining several preprocessing functions.

In [8]:
from nltk.corpus import wordnet as wn # import for lemmatize
from collections import defaultdict

def preprocess_word(word, stopwordset):
    
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove "," and "." and "''"
    if word in [",",".","''"]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    frequency = defaultdict(int)
    ## delete the word only appear once, I think they are noise
    for token in document:
        frequency[token] += 1
    document = [token for token in document if frequency[token] > 1]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

check out the preprocessing result.

In [9]:
# before
print(docs[0][:25]) 

# after
print(preprocess_documents(docs)[0][:25])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
['fulton', 'county', 'grand', 'jury', 'friday', "atlanta's", 'primary', 'election', 'irregularity', 'place', 'jury', 'city', 'executive', 'committee', 'charge', 'election', 'praise', 'city', 'atlanta', 'manner', 'election', 'term', 'jury', 'charge', 'fulton']


Next, reshape our documents with the available format for the gensim LDA model.

In [10]:
import gensim
from gensim import corpora

In [11]:
# build the dictionary
dictionary = corpora.Dictionary(preprocess_documents(docs))
# construct the 
corpus_ = [dictionary.doc2bow(doc) for doc in preprocess_documents(docs)]

Let us check out the contents of the built dictionary and corpus.

In [12]:
# token2id is the attribute which indicates the mapping between words and dictionary ID

print(dictionary.token2id)





In [13]:
# corpus_ contains words of each document with a list (ID, appear frequency)

# note that there is not the appearing order in the documents, but the order of the dictionary
print(corpus_[0][:10]) 


[(0, 2), (1, 4), (2, 3), (3, 2), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2)]


Compare the original document with our preprocessing result that is available for the LDA model.

In [14]:
# before
print([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]])

# after
print(dictionary.doc2bow(preprocess_document(corpus.words(corpus.fileids()[0]))))


['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
[(0, 2), (1, 4), (2, 3), (3, 2), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2), (10, 3), (11, 2), (12, 3), (13, 4), (14, 5), (15, 2), (16, 3), (17, 2), (18, 2), (19, 2), (20, 3), (21, 2), (22, 2), (23, 2), (24, 9), (25, 3), (26, 2), (27, 3), (28, 5), (29, 6), (30, 3), (31, 4), (32, 2), (33, 9), (34, 2), (35, 2), (36, 3), (37, 2), (38, 2), (39, 2), (40, 17), (41, 2), (42, 5), (43, 3), (44, 11), (45, 2), (46, 3), (47, 2), (48, 3), (49, 14), (50, 3), (51, 2), (52, 2), (53, 2), (54, 2), (55, 2), (56, 4), (57, 4), (58, 2), (59, 2), (60, 2), (61, 4), (62, 14), (63, 2), (64, 8), (65, 2), (66, 2), (67, 4), (68, 4), (69, 3), (70, 2), (71, 4), (72, 2), (73, 5), (74, 8), (75, 3), (76, 3), (77, 5), (78, 2), (79, 4), (80, 2), (81, 3), (82, 6), (83, 2), (84, 2), (85, 

## Training

In [15]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus_,
                                           num_topics=15,
                                           id2word=dictionary,
                                           alpha=0.1,                 # optional LDA hyperparameter alpha
                                           eta=0.1,                   # optional LDA hyperparameter beta
                                           #minimum_probability=0.0    # optional the lower bound of the topic/word generative probability
                                          )

Check out the learned parameters.

In [16]:
# the top num_words of words for each topic (topic ID, the word generative probability for the topic).

topics = ldamodel.print_topics(num_words=15)
for topic in topics:
    print(topic)

(0, '0.004*"house" + 0.003*"unite" + 0.003*"point" + 0.002*"government" + 0.002*"child" + 0.002*"place" + 0.002*"system" + 0.002*"shall" + 0.002*"around" + 0.002*"left" + 0.002*"right" + 0.002*"upon" + 0.002*"side" + 0.002*"old" + 0.002*"call"')
(1, '0.003*"american" + 0.003*"program" + 0.002*"head" + 0.002*"right" + 0.002*"three" + 0.002*"found" + 0.002*"system" + 0.002*"show" + 0.002*"problem" + 0.002*"place" + 0.002*"business" + 0.002*"around" + 0.002*"call" + 0.002*"plan" + 0.002*"development"')
(2, '0.004*"line" + 0.002*"point" + 0.002*"system" + 0.002*"group" + 0.002*"house" + 0.002*"church" + 0.002*"week" + 0.002*"age" + 0.002*"provide" + 0.002*"present" + 0.002*"interest" + 0.002*"child" + 0.002*"three" + 0.002*"place" + 0.002*"head"')
(3, '0.003*"church" + 0.002*"program" + 0.002*"great" + 0.002*"end" + 0.002*"place" + 0.002*"war" + 0.002*"three" + 0.002*"call" + 0.002*"social" + 0.002*"old" + 0.002*"write" + 0.002*"turn" + 0.002*"open" + 0.002*"away" + 0.002*"unite"')
(4, '0.

In [17]:
# for each document, show the probabilities of topics which beyond the minimum_probability [(topic ID, probability)]

for n in range(0,499,50):
    print("document ID "+str(n)+":" ,end="")
    print(ldamodel.get_document_topics(corpus_[n]))

document ID 0:[(7, 0.6319925), (12, 0.3645142)]
document ID 50:[(0, 0.03840721), (4, 0.9589309)]
document ID 100:[(12, 0.94691974), (13, 0.050316658)]
document ID 150:[(1, 0.997515)]
document ID 200:[(3, 0.9962907)]
document ID 250:[(10, 0.99512464)]
document ID 300:[(12, 0.998199)]
document ID 350:[(14, 0.99706745)]
document ID 400:[(10, 0.9968355)]
document ID 450:[(13, 0.9961897)]


In [18]:
# the categories of documents
categories = [corpus.categories(fileid) for fileid in corpus.fileids()]

In [19]:
for n in range(0,499,50):
  print('-------------------------')
  print("This is document:",n)
  # nth document's topic distribution
  print(ldamodel.get_document_topics(corpus_[n]))

  # nth document's category
  print(categories[n])

  # show the original document
  print(" ".join(docs[n]))

-------------------------
This is document: 0
[(7, 0.63611555), (12, 0.36175108)]
['news']
-------------------------
This is document: 50
[(0, 0.041525424), (4, 0.95581275)]
['editorial']
-------------------------
This is document: 100
[(12, 0.95028055), (13, 0.04695583)]
['religion']
If we look about the world today , we can see clearly that there are two especially significant factors shaping the future of our civilization : science and religion . Science is placing in our hands the ultimate power of the universe , the power of the atom . Religion , or the lack of it , will decide whether we use this power to build a brave new world of peace and abundance for all mankind , or whether we misuse this power to leave a world utterly destroyed . How can we have the wisdom to meet such a new and difficult challenge ? ? We may feel pessimistic at the outlook . And yet there is a note of hope , because this same science that is giving us the power of the atom is also giving us atomic vision 

## Visualization
Analyze our result through visualization.

In [20]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

  from collections import Iterable
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


In [21]:
# it will spend about 20 minutes to visualize the result if you train the model with all documents
# please note that gensim start topics with index 0 to K-1, but pyLDAvis start the index with 1 to K


lda_display = pyLDAvis.gensim_models.prepare(ldamodel, corpus_, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)