<a href="https://colab.research.google.com/github/CgriefTesla/text_mining_report/blob/main/Text_mining_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preparation

importing some lib and downloading some model


In [1]:
!pip install nltk
!pip install gensim
!pip install pyLDAvis



## downloading some data


In [2]:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("brown")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Dataset
Load the corpus from NLTK package.

In [3]:
from nltk.corpus import brown as corpus

### check out the content of the corpus.

In [4]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:20]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that 

The total number of documents.

In [5]:
len(corpus.fileids())

500

train the model with all documents.

In [6]:
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:5])
print("num of docs:", len(docs))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...], ['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...], ['Several', 'defendants', 'in', 'the', 'Summerdale', ...], ['Oslo', 'The', 'most', 'positive', 'element', 'to', ...], ['East', 'Providence', 'should', 'organize', 'its', ...]]
num of docs: 500


## Data preprocessing
First, defining some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our LDA analysis result.

Try to ignore numbers and words through regular expression.

In [7]:
# English stopwords defined by the NLTK package.
en_stop = nltk.corpus.stopwords.words('english')

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<","!"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000","one","two","first"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price","would","make","know","could","like", "go", "take","might","may"]          \
         +["mr.","mrs.","must", "even","new","state","get","man","come","time","see","many","little","years","day","also","af","give","men","use","seem","much","back","work"]   \
         +["well","look","tell","last","form","way","good","us","still","world","people","school","want","need","never"]   \
         +["since","high","life","become","however","small","small","another","long"]   \
         +en_stop

Next, defining several preprocessing functions.

In [8]:
from nltk.corpus import wordnet as wn # import for lemmatize
from collections import defaultdict

def preprocess_word(word, stopwordset):
    
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove "," and "." and "''"
    if word in [",",".","''"]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    frequency = defaultdict(int)
    ## delete the word only appear once, I think they are noise
    for token in document:
        frequency[token] += 1
    document = [token for token in document if frequency[token] > 1]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

check out the preprocessing result.

In [9]:
# before
print(docs[0][:25]) 

# after
print(preprocess_documents(docs)[0][:25])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
['fulton', 'county', 'grand', 'jury', 'friday', "atlanta's", 'primary', 'election', 'irregularity', 'place', 'jury', 'city', 'executive', 'committee', 'charge', 'election', 'praise', 'city', 'atlanta', 'manner', 'election', 'term', 'jury', 'charge', 'fulton']


Next, reshape our documents with the available format for the gensim LDA model.

In [10]:
import gensim
from gensim import corpora

In [11]:
# build the dictionary
dictionary = corpora.Dictionary(preprocess_documents(docs))
# construct the 
corpus_ = [dictionary.doc2bow(doc) for doc in preprocess_documents(docs)]

Let us check out the contents of the built dictionary and corpus.

In [12]:
# token2id is the attribute which indicates the mapping between words and dictionary ID

print(dictionary.token2id)





In [13]:
# corpus_ contains words of each document with a list (ID, appear frequency)

# note that there is not the appearing order in the documents, but the order of the dictionary
print(corpus_[0][:10]) 


[(0, 2), (1, 4), (2, 3), (3, 2), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2)]


Compare the original document with our preprocessing result that is available for the LDA model.

In [14]:
# before
print([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]])

# after
print(dictionary.doc2bow(preprocess_document(corpus.words(corpus.fileids()[0]))))


['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
[(0, 2), (1, 4), (2, 3), (3, 2), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2), (10, 3), (11, 2), (12, 3), (13, 4), (14, 5), (15, 2), (16, 3), (17, 2), (18, 2), (19, 2), (20, 3), (21, 2), (22, 2), (23, 2), (24, 9), (25, 3), (26, 2), (27, 3), (28, 5), (29, 6), (30, 3), (31, 4), (32, 2), (33, 9), (34, 2), (35, 2), (36, 3), (37, 2), (38, 2), (39, 2), (40, 17), (41, 2), (42, 5), (43, 3), (44, 11), (45, 2), (46, 3), (47, 2), (48, 3), (49, 14), (50, 3), (51, 2), (52, 2), (53, 2), (54, 2), (55, 2), (56, 4), (57, 4), (58, 2), (59, 2), (60, 2), (61, 4), (62, 14), (63, 2), (64, 8), (65, 2), (66, 2), (67, 4), (68, 4), (69, 3), (70, 2), (71, 4), (72, 2), (73, 5), (74, 8), (75, 3), (76, 3), (77, 5), (78, 2), (79, 4), (80, 2), (81, 3), (82, 6), (83, 2), (84, 2), (85, 

## Training

In [15]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus_,
                                           num_topics=15,
                                           id2word=dictionary,
                                           alpha=0.1,                 # optional LDA hyperparameter alpha
                                           eta=0.1,                   # optional LDA hyperparameter beta
                                           #minimum_probability=0.0    # optional the lower bound of the topic/word generative probability
                                          )

Check out the learned parameters.

In [16]:
# the top num_words of words for each topic (topic ID, the word generative probability for the topic).

topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.003*"child" + 0.003*"right" + 0.003*"home" + 0.003*"miss" + 0.003*"house" + 0.003*"program" + 0.003*"around" + 0.003*"interest" + 0.003*"ask" + 0.002*"american"')
(1, '0.003*"point" + 0.002*"try" + 0.002*"thought" + 0.002*"child" + 0.002*"place" + 0.002*"open" + 0.002*"home" + 0.002*"going" + 0.002*"right" + 0.002*"ask"')
(2, '0.003*"show" + 0.003*"american" + 0.003*"place" + 0.003*"great" + 0.003*"write" + 0.002*"class" + 0.002*"point" + 0.002*"three" + 0.002*"though" + 0.002*"junior"')
(3, '0.003*"government" + 0.003*"place" + 0.003*"system" + 0.003*"american" + 0.003*"program" + 0.002*"child" + 0.002*"problem" + 0.002*"public" + 0.002*"interest" + 0.002*"church"')
(4, '0.003*"area" + 0.003*"water" + 0.003*"show" + 0.003*"place" + 0.002*"house" + 0.002*"city" + 0.002*"child" + 0.002*"line" + 0.002*"system" + 0.002*"problem"')
(5, '0.004*"system" + 0.002*"great" + 0.002*"call" + 0.002*"member" + 0.002*"provide" + 0.002*"found" + 0.002*"place" + 0.002*"right" + 0.002*"hand" + 0.

In [17]:
# for each document, show the probabilities of topics which beyond the minimum_probability [(topic ID, probability)]

for n in range(0,499,50):
    print("document ID "+str(n)+":" ,end="")
    print(ldamodel.get_document_topics(corpus_[n]))

document ID 0:[(6, 0.88426626), (7, 0.028813224), (8, 0.08495126)]
document ID 50:[(4, 0.035569314), (6, 0.5475998), (8, 0.41214526)]
document ID 100:[(4, 0.20188853), (8, 0.7953478)]
document ID 150:[(8, 0.8137033), (12, 0.1839892)]
document ID 200:[(3, 0.14551952), (4, 0.17760538), (12, 0.67369556)]
document ID 250:[(5, 0.9970797)]
document ID 300:[(4, 0.998199)]
document ID 350:[(13, 0.9970675)]
document ID 400:[(6, 0.9968355)]
document ID 450:[(1, 0.25838518), (11, 0.7288575)]


In [18]:
# the categories of documents
categories = [corpus.categories(fileid) for fileid in corpus.fileids()]

In [19]:
for n in range(0,499,50):
  print('-------------------------')
  print("This is document:",n)
  # nth document's topic distribution
  print(ldamodel.get_document_topics(corpus_[n]))

  # nth document's category
  print(categories[n])

  # show the original document
  print(" ".join(docs[n]))

-------------------------
This is document: 0
[(6, 0.8917124), (7, 0.030430581), (8, 0.07588768)]
['news']
-------------------------
This is document: 50
[(4, 0.030092781), (6, 0.5539805), (8, 0.41312146)]
['editorial']
-------------------------
This is document: 100
[(4, 0.2026865), (8, 0.7945499)]
['religion']
If we look about the world today , we can see clearly that there are two especially significant factors shaping the future of our civilization : science and religion . Science is placing in our hands the ultimate power of the universe , the power of the atom . Religion , or the lack of it , will decide whether we use this power to build a brave new world of peace and abundance for all mankind , or whether we misuse this power to leave a world utterly destroyed . How can we have the wisdom to meet such a new and difficult challenge ? ? We may feel pessimistic at the outlook . And yet there is a note of hope , because this same science that is giving us the power of the atom is a

## Visualization
Analyze our result through visualization.

In [20]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

  from collections import Iterable
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


In [21]:
# it will spend about 20 minutes to visualize the result if you train the model with all documents
# please note that gensim start topics with index 0 to K-1, but pyLDAvis start the index with 1 to K


lda_display = pyLDAvis.gensim_models.prepare(ldamodel, corpus_, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

# Content
## Summaryofyourdataset
I use the Brown corpus in nltk.The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were first published then, and were written by native speakers of American English.

Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.

The original data entry was done on upper-case only keypunch machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.

The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:

A. PRESS: Reportage (44 texts)

    1 Political

    2 Sports

    3 Society

    4 Spot News

    5 Financial

    6 Cultural

B. PRESS: Editorial (27 texts)

    1 Institutional Daily

    2 Personal

    3 Letters to the Editor

C. PRESS: Reviews (17 texts)

    1 theatre

    2 books

    3 music

    4 dance

D. RELIGION (17 texts)

    1 Books

    2 Periodicals

    3 Tracts

E. SKILL AND HOBBIES (36 texts)

    1 Books

    2 Periodicals

F. POPULAR LORE (48 texts)

    1 Books

    2 Periodicals

G. BELLES-LETTRES - Biography, Memoirs, etc. (75 texts)

    1 Books

    2 Periodicals

H. MISCELLANEOUS: US Government & House Organs (30 texts)

    1 Government Documents

    2 Foundation Reports

    3 Industry Reports

    4 College Catalog

    5 Industry House organ

J. LEARNED (80 texts)

    1 Natural Sciences

    2 Medicine

    3 Mathematics

    4 Social and Behavioral Sciences

    5 Political Science, Law, Education

    6 Humanities

    7 Technology and Engineering

K. FICTION: General (29 texts)

    1 Novels

    2 Short Stories

L. FICTION: Mystery and Detective Fiction (24 texts)

    1 Novels

    2 Short Stories

M. FICTION: Science (6 texts)

    1 Novels

    2 Short Stories

N. FICTION: Adventure and Western (29 texts)

    1 Novels

    2 Short Stories

P. FICTION: Romance and Love Story (29 texts)

    1 Novels

    2 Short Stories

R. HUMOR (9 texts)

    1 Novels

    2 Essays, etc.

## Analysis results
### List top 10 keywords per each topic
(0, '0.004*"house" + 0.003*"unite" + 0.003*"point" + 0.002*"government" + 0.002*"child" + 0.002*"place" + 0.002*"system" + 0.002*"shall" + 0.002*"around" + 0.002*"left"')

(1, '0.003*"american" + 0.003*"program" + 0.002*"head" + 0.002*"right" + 0.002*"three" + 0.002*"found" + 0.002*"system" + 0.002*"show" + 0.002*"problem" + 0.002*"place"')

(2, '0.004*"line" + 0.002*"point" + 0.002*"system" + 0.002*"group" + 0.002*"house" + 0.002*"church" + 0.002*"week" + 0.002*"age" + 0.002*"provide" + 0.002*"present"')

(3, '0.003*"church" + 0.002*"program" + 0.002*"great" + 0.002*"end" + 0.002*"place" + 0.002*"war" + 0.002*"three" + 0.002*"call" + 0.002*"social" + 0.002*"old"')

(4, '0.003*"place" + 0.003*"problem" + 0.003*"government" + 0.002*"program" + 0.002*"church" + 0.002*"interest" + 0.002*"old" + 0.002*"show" + 0.002*"increase" + 0.002*"change"')

(5, '0.002*"program" + 0.002*"government" + 0.002*"force" + 0.002*"american" + 0.002*"city" + 0.002*"place" + 0.002*"right" + 0.002*"system" + 0.002*"house" + 0.002*"line"')

(6, '0.003*"area" + 0.003*"class" + 0.003*"place" + 0.002*"great" + 0.002*"general" + 0.002*"per" + 0.002*"program" + 0.002*"show" + 0.002*"water" + 0.002*"church"')

(7, '0.003*"area" + 0.003*"city" + 0.003*"court" + 0.003*"john" + 0.002*"interest" + 0.002*"group" + 0.002*"president" + 0.002*"old" + 0.002*"house" + 0.002*"call"')

(8, '0.003*"place" + 0.003*"show" + 0.003*"call" + 0.002*"home" + 0.002*"old" + 0.002*"problem" + 0.002*"foam" + 0.002*"area" + 0.002*"house" + 0.002*"course"')

(9, '0.003*"american" + 0.003*"interest" + 0.003*"great" + 0.002*"old" + 0.002*"place" + 0.002*"brown" + 0.002*"call" + 0.002*"system" + 0.002*"john" + 0.002*"fact"')

(10, '0.003*"child" + 0.003*"american" + 0.003*"program" + 0.003*"show" + 0.002*"ask" + 0.002*"car" + 0.002*"head" + 0.002*"place" + 0.002*"old" + 0.002*"right"')

(11, '0.003*"place" + 0.002*"system" + 0.002*"member" + 0.002*"child" + 0.002*"right" + 0.002*"point" + 0.002*"great" + 0.002*"american" + 0.002*"area" + 0.002*"john"')

(12, '0.003*"house" + 0.003*"american" + 0.002*"place" + 0.002*"great" + 0.002*"child" + 0.002*"general" + 0.002*"three" + 0.002*"head" + 0.002*"government" + 0.002*"public"')

(13, '0.004*"right" + 0.004*"old" + 0.003*"around" + 0.003*"miss" + 0.003*"thought" + 0.003*"ask" + 0.003*"turn" + 0.003*"eyes" + 0.003*"woman" + 0.002*"house"')

(14, '0.004*"church" + 0.003*"great" + 0.003*"show" + 0.003*"place" + 0.002*"point" + 0.002*"catholic" + 0.002*"child" + 0.002*"house" + 0.002*"system" + 0.002*"upon"')

### Choose 10 documents and show their topic distribution.
You can run the notebook above, the result is too long so I don't put it here.

### Compare the results with different topic numbers.
Also, you could change the parameter in the above code. I found the less the topic number the more difference between topics. But exactly it does not mean the better result, because if we set less topics, then the noise will affect the result more.

# Consideration
I think this dataset is not very suitable for content analyze. Because this dataset is classified by their genre but not content. And in the process of adding noise to stopwords, there are so much noise in this dataset, I could not find them all.