===========================================


Title: 7.2 Exercises


Author: Chad Wood


Date: 31 Jan 2022


Modified By: Chad Wood


Description: This program demonstrates the use of different methods for key phrase extraction (Lord of The Rings Dialog) and topic modeling (research articles) to build and test a machine learning model that performs topic summarization on text documents. The model uses Latent Semantic Indexing.


=========================================== 

In [1]:
import pandas as pd

lotr_script = pd.read_csv('data/lotr_scripts.csv', index_col=0)

In [2]:
lotr_script.head()

Unnamed: 0,char,dialog,movie
0,DEAGOL,"Oh Smeagol Ive got one! , Ive got a fish Smeag...",The Return of the King
1,SMEAGOL,"Pull it in! Go on, go on, go on, pull it in!",The Return of the King
2,DEAGOL,Arrghh!,The Return of the King
3,SMEAGOL,Deagol!,The Return of the King
4,SMEAGOL,Deagol!,The Return of the King


### Keyphrase Extraction

<b><i>*    Collocations</b></i>

In [3]:
from nltk.corpus import gutenberg
import lib.normalizer as nm
import nltk
from operator import itemgetter

norm_lotr = nm.Normalizer(lotr_script.dialog)
norm_lotr = norm_lotr.normalize(
    strip_html=True, remove_special_chars=True, 
    remove_digits=True, remove_stopwords=True,
    remove_accented_chars=True, expand_contractions=True,
    text_lower=True, to_str=True)

# Compare first lines
print(lotr_script.dialog[0], '\n', norm_lotr[0])

Oh Smeagol Ive got one! , Ive got a fish Smeagol, Smeagol!     
 oh smeagol ive got one ive got fish smeagol smeagol


In [4]:
# Creates trailing list sequences
def compute_ngrams(sequence, n): # n -> degree of n-gram
    return list(
        zip(*(sequence[index:]
              for index in range(n)))
    )

# Creates single body of text
def flatten_corpus(corpus):
    return ' '.join([document.strip()
                     for document in corpus])

def get_top_ngrams(corpus, ngram_val=1, limit=5):
    corpus = flatten_corpus(corpus)
    tokens = nltk.word_tokenize(corpus)
    ngrams = compute_ngrams(tokens, ngram_val)
    ngrams_freq_dist = nltk.FreqDist(ngrams) # Records dict of each outcome occurrences
    sorted_ngrams_fd = sorted(ngrams_freq_dist.items(), # Orders dict ascending
                              key=itemgetter(1), reverse=True)
    sorted_ngrams = sorted_ngrams_fd[0:limit] # Trims sorted_ngrams_fd
    sorted_ngrams = [(' '.join(text), freq) # Dict to List comp
                     for text, freq in sorted_ngrams]
    
    return sorted_ngrams


top_bigrams = get_top_ngrams(corpus=norm_lotr, ngram_val=2, limit=10)
top_trigrams = get_top_ngrams(corpus=norm_lotr, ngram_val=3, limit=10)

top_bigrams, top_trigrams

([('can not', 75),
  ('mr frodo', 48),
  ('let us', 18),
  ('minas tirith', 16),
  ('middle earth', 12),
  ('let go', 11),
  ('frodo frodo', 10),
  ('must go', 9),
  ('helms deep', 9),
  ('peregrin took', 9)],
 [('grond grond grond', 7),
  ('can not hold', 5),
  ('can not get', 5),
  ('death death death', 4),
  ('han mathon ne', 4),
  ('heh heh heh', 4),
  ('hmm hmm hmm', 4),
  ('baggins baggins baggins', 4),
  ('hold mr frodo', 3),
  ('can not leave', 3)])

<b><i>* Weighted Tag-Based Phrase Extraction</b></i>

In [40]:
norm_sentences = nm.Normalizer(lotr_script.dialog)
norm_sentences = norm_sentences.normalize(
    strip_html=True, remove_special_chars=True, 
    remove_digits=True, remove_accented_chars=True, 
    expand_contractions=True, to_str=True)

# Tokenizes dialog
tok_sentences = lotr_script.dialog.apply(lambda x: nltk.sent_tokenize(str(x)))

# Returns sentences longer than 15 words for better data
# long_sents = tok_sentences.loc[tok_sentences.apply(lambda x: len(' '.join(x).split()) > 15)]

In [41]:
import itertools
stopwords = nltk.corpus.stopwords.words('english')

# Returns noun phrase chunks from text list
def get_chunks(sentences, grammar=r'NP: {<DT>? <JJ>* <NN.*>+}', 
               stopword_list=stopwords):
    
    all_chunks = []
    chunker = nltk.chunk.regexp.RegexpParser(grammar)

    for sentence in sentences:
        tagged_sents = [nltk.pos_tag(nltk.word_tokenize(sentence))] # Tags POS in tokenized text

        # Chunks text by regex
        chunks = [chunker.parse(tagged_sent)
                  for tagged_sent in tagged_sents]
        # Returns list of (word, tag, IOB-tag)
        wtc_sents = [nltk.chunk.tree2conlltags(chunk)
                     for chunk in chunks]    
        # Flattens list
        flattened_chunks = list(
            itertools.chain.from_iterable(
                wtc_sent for wtc_sent in wtc_sents)
        )        
        # Filters for non-0 chunks
        valid_chunks_tagged = [(status, [wtc for wtc in chunk])
                           for status, chunk
                           in itertools.groupby(flattened_chunks,
                                                lambda word_pos_chunk: word_pos_chunk[2] != 'O')] 
        # Filters stopwords
        valid_chunks = [' '.join(word.lower()
                                 for word, tag, chunk in wtc_group
                                 if word.lower() not in stopword_list)
                        for status, wtc_group in valid_chunks_tagged if status]     
        
        all_chunks.append(valid_chunks)
        
    return all_chunks

In [44]:
# Weighted Tag-Based Phrase Extraction
from gensim import corpora, models

def get_tfidf_weighted_keyphrases(sentences, 
                                  grammar=r'NP: {<DT>? <JJ>* <NN.*>+}', 
                                  top_n=10):
    # Flattens series to list of str
    sentences = [item for sublist in sentences for item in sublist]
    # Chunks text to noun phrases
    valid_chunks = get_chunks(sentences, grammar=grammar)
    # Maps norm words to integer IDs
    dictionary = corpora.Dictionary(valid_chunks)
    # Bag of Words
    corpus = [dictionary.doc2bow(chunk) for chunk in valid_chunks]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    # Uses dict integer IDs for weights
    weighted_phrases = {dictionary.get(idx): value
                        for doc in corpus_tfidf
                        for idx, value in doc}
    # Sorts ascending
    weighted_phrases = sorted(weighted_phrases.items(),
                              key=itemgetter(1), reverse=True)
    weighted_phrases = [(term, round(wt, 3)) for term, wt in weighted_phrases]
    
    return weighted_phrases[:top_n]

In [47]:
get_tfidf_weighted_keyphrases(sentences=tok_sentences, top_n=5)

[('smeagol ive', 1.0),
 ('arrghh', 1.0),
 ('murderer', 1.0),
 ('name', 1.0),
 ('oooohhh', 1.0)]

### Topic Modeling

In [52]:
import os
import numpy as np
import pandas as pd

data_PATH = 'data/nipstxt/'
folders = ['nips{0:02}'.format(i) for i in range(0,13)]

articles = []
for folder in folders:
    file_names = os.listdir(data_PATH+folder)
    for file_name in file_names:
        with open(data_PATH + folder+'/'+file_name, encoding='utf-8',
                  errors='ignore',mode='r+') as f:
            data = f.read()
        articles.append(data)

len(articles)

1740

In [101]:
stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

# Normalizes data 
def normalizer_toker(articles):
    norm_articles = []
    for article in articles:
        article = article.lower()
        article_tokens = [token.strip() for token in wtk.tokenize(article)]
        # Cleans tokens of numerics, i < 1, and stopwords
        # Updated from textbook to encode str into byte object
        article_tokens = [token for token in article_tokens if 
                          not token.isnumeric() and len(token) > 1 and token not in stop_words]
        # Removes Nonetypes     
        article_tokens = list(filter(None, article_tokens))
        
        if article_tokens:
            norm_articles.append(article_tokens)
            
    return norm_articles    

norm_articles = normalizer_toker(articles)
len(norm_articles)

1740

<b><i>* Text Representation with Feature Engineering</b></i>

In [106]:
# Attempts to extract useful bigrams and remove some unuseful bigrams

# Textbook uses delimiter=b'_' 
# ...which no longer works unless using < python3
bigram = gensim.models.Phrases(norm_articles, min_count=20, threshold=20, 
                               delimiter='_')
bigram_model = gensim.models.phrases.Phraser(bigram)

# Sample
print(bigram_model[norm_articles[0]][:50])

['connectivity', 'versus', 'entropy', 'yaser', 'abu_mostafa', 'california_institute', 'technology_pasadena', 'ca_abstract', 'connectivity', 'neural_network', 'number', 'synapses', 'per', 'neuron', 'relate', 'complexity', 'problems', 'handle', 'measured', 'entropy', 'switching', 'theory', 'would', 'suggest', 'relation', 'since', 'boolean_functions', 'implemented', 'using', 'circuit', 'low', 'connectivity', 'using', 'two', 'input', 'nand', 'gates', 'however', 'network', 'learns', 'problem', 'examples', 'using', 'local', 'learning', 'rule', 'prove', 'entropy', 'problem', 'becomes']


In [112]:
# Generates phrases for tokenized corpus for phrase : to mapping
# Allows ML by providing number tensors

norm_corpus_bigrams = [bigram_model[doc] for doc in norm_articles]

# Creates dict of documents
dictionary = gensim.corpora.Dictionary(norm_corpus_bigrams)

print(f'Sample: {list(dictionary.items())[10:20]}')
print(f'Total: {len(dictionary)}')

Sample: [(10, 'abu_mostafa'), (11, 'access'), (12, 'accommodate'), (13, 'according'), (14, 'accumulated'), (15, 'acknowledgement_work'), (16, 'addison_wesley'), (17, 'afosr'), (18, 'aip'), (19, 'air_force')]
Total: 82825


In [113]:
# Filters left and right skew outliers
# ... by <20 occurences or >60% of docs
# This retains common doc-specific words
dictionary.filter_extremes(no_below=20, no_above=0.6)
print(f'Total: {len(dictionary)}')

Total: 8684


In [114]:
# Creates Bag of Words vector
bow_corpus = [dictionary.doc2bow(text) for text in norm_corpus_bigrams] # Returns (tok_id, count)
print(f'Sample: {bow_corpus[1][:30]}') 

Sample: [(12, 3), (14, 1), (15, 1), (16, 1), (17, 16), (20, 1), (24, 1), (26, 1), (31, 3), (35, 1), (36, 1), (40, 3), (41, 5), (42, 1), (49, 1), (54, 3), (55, 1), (57, 1), (60, 1), (62, 3), (65, 5), (66, 4), (67, 2), (76, 1), (77, 1), (78, 1), (80, 3), (86, 1), (87, 4), (88, 1)]


<b><i>* Latent Semantic Indexing</b></i>

In [115]:
# lsi works with assumption that words used in same context have same meaning

TOPICS = 10 # Topics used from corpa
lsi_bow_model = gensim.models.LsiModel(bow_corpus, id2word=dictionary, # Builds model for LSI
                                       num_topics=TOPICS, onepass=True, chunksize=1740, power_iters=1000)

In [116]:
# Prints 10 topics with 20 words from each topic
for topic_id, topic in lsi_bow_model.print_topics(num_topics=10, num_words=20):
    print(f'Topic #{str(topic_id+1)}:\n{topic}\n')

Topic #1:
0.213*"training" + 0.147*"error" + 0.140*"state" + 0.126*"units" + 0.125*"models" + 0.122*"weights" + 0.099*"parameters" + 0.098*"method" + 0.095*"unit" + 0.094*"neurons" + 0.093*"layer" + 0.092*"noise" + 0.090*"linear" + 0.089*"image" + 0.089*"vector" + 0.086*"patterns" + 0.084*"neural_network" + 0.083*"functions" + 0.082*"control" + 0.080*"neuron"

Topic #2:
0.297*"neurons" + -0.252*"training" + 0.248*"neuron" + 0.229*"cells" + 0.220*"cell" + -0.179*"error" + 0.157*"response" + 0.152*"activity" + 0.144*"visual" + 0.138*"stimulus" + 0.112*"motion" + 0.112*"synaptic" + 0.104*"firing" + -0.100*"class" + 0.095*"neural" + -0.092*"method" + 0.092*"cortical" + 0.091*"circuit" + 0.090*"layer" + 0.086*"synapses"

Topic #3:
-0.488*"state" + 0.245*"training" + 0.225*"image" + -0.195*"control" + 0.193*"units" + -0.190*"states" + 0.147*"images" + 0.133*"layer" + -0.129*"action" + 0.126*"features" + -0.125*"policy" + 0.114*"unit" + -0.113*"optimal" + -0.113*"neuron" + 0.107*"recognition"

In [117]:
# Attempts to separate subthemes within articles by positive vs negative orientation of vector space
# As direction
for n in range(TOPICS):
    print(f'Topic #{str(n+1)}:')
    print('*'*50)
    d1 = []
    d2 = []
    for term, weight in lsi_bow_model.show_topic(n, topn=20): #topn = number of words to be included
        if weight >= 0: # Positively orientated VS
            d1.append((term, round(weight, 3)))
        else: # Negatively orientated VS
            d2.append((term, round(weight, 3)))
    print(f'Direction 1: {d1}')
    print('-'*50)
    print(f'Direction 2: {d2}')
    print('-'*50)


Topic #1:
**************************************************
Direction 1: [('training', 0.213), ('error', 0.147), ('state', 0.14), ('units', 0.126), ('models', 0.125), ('weights', 0.122), ('parameters', 0.099), ('method', 0.098), ('unit', 0.095), ('neurons', 0.094), ('layer', 0.093), ('noise', 0.092), ('linear', 0.09), ('image', 0.089), ('vector', 0.089), ('patterns', 0.086), ('neural_network', 0.084), ('functions', 0.083), ('control', 0.082), ('neuron', 0.08)]
--------------------------------------------------
Direction 2: []
--------------------------------------------------
Topic #2:
**************************************************
Direction 1: [('neurons', 0.297), ('neuron', 0.248), ('cells', 0.229), ('cell', 0.22), ('response', 0.157), ('activity', 0.152), ('visual', 0.144), ('stimulus', 0.138), ('motion', 0.112), ('synaptic', 0.112), ('firing', 0.104), ('neural', 0.095), ('cortical', 0.092), ('circuit', 0.091), ('layer', 0.09), ('synapses', 0.086)]
-----------------------------

In [120]:
# Attempts to get matrices (U, S, VT) from model using SVD

term_topic = lsi_bow_model.projection.u
singular_values = lsi_bow_model.projection.s
topic_document = (gensim.matutils.corpus2dense(lsi_bow_model[bow_corpus],
                 len(singular_values)).T / singular_values).T

term_topic.shape, singular_values.shape, topic_document.shape

((8684, 10), (10,), (10, 1740))

In [121]:
# Retrieves proportion of topic in each doc, showing significance of topic

document_topics = pd.DataFrame(np.round(topic_document.T, 3),
                              columns=['T'+str(i) for i in range(1, TOPICS+1)])
document_topics.head()

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10
0,0.015,0.012,-0.014,0.014,-0.016,-0.004,0.002,0.026,0.007,-0.021
1,0.038,0.032,-0.017,0.045,-0.001,0.006,0.059,0.04,-0.048,-0.006
2,0.025,-0.001,-0.021,0.021,-0.001,0.011,0.019,0.011,0.0,-0.018
3,0.027,0.026,-0.006,0.027,-0.009,-0.015,-0.004,0.042,-0.01,-0.031
4,0.036,0.002,-0.018,0.022,-0.021,0.042,0.027,0.05,0.036,0.042


In [126]:
# Ignores +/- orientation, visually allows comparison between start
# ... of doc and topics to verify accuracy

doc_numbers = [13, 250, 500]

for doc_number in doc_numbers:
    top_topics = list(document_topics.columns[np.argsort(-
                                                        np.absolute(
                                                        document_topics.iloc[
                                                            doc_number].values))[:3]])
    print(f'Doc # {str(doc_number)}:')
    print(f'Top 3 Topics: {top_topics}')
    print(f'Article Summary: \n{articles[doc_number][:500]}')
    print('-'*50,'\n')

Doc # 13:
Top 3 Topics: ['T8', 'T3', 'T9']
Article Summary: 
137 
On the 
Power of Neural Networks for 
Solving Hard Problems 
Jehoshua Bruck 
Joseph W. Goodman 
Information Systems Laboratory 
Department of Electrical Engineering 
Stanford University 
Stanford, CA 94305 
Abstract 
This paper deals with a neural network model in which each neuron 
performs a threshold logic function. An important property of the model 
is that it always converges to a stable state when operating in a serial 
mode [2,5]. This property is the basis of the potential applicat
-------------------------------------------------- 

Doc # 250:
Top 3 Topics: ['T10', 'T1', 'T6']
Article Summary: 
542 Kassebaum, Tenorio and Schaefers 
The Cocktail Party Problem: 
Speech/Data Signal Separation Comparison 
between Backpropagation and SONN 
John Kassebaum 
jakec.ecn.purdue.edu 
Manoel Fernando Tenorio 
tenorioee.ecn.purdue.edu 
Chrlstoph Schaefers 
Parallel Distributed Structures Laboratory 
School of Electrical En

<i>As we can see, document 13 speaks about neurons and neural networks, and the models summarized it into topics ['T10', 'T1', 'T6']; respectively, the cell above which seperates each subtopic by vector spaces yields that out topic summarization model is visually effective. The same stands for our other two tests, doc# 250 and 500.</i>