# Modern Data Science 
**(Module 05: Deep Learning)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session E - Text Analytics (2) : Text Summarization and Information Extraction

**The purpose of this session is to introduce how to work with textual data which are tremendously produced everyday. In this practical session, we present the following topics:**

1. Revise text pre-processing techniques, also called text normalization, which involves using a variety of techniques to convert raw text into well defined sequences of linguistic components that have standard structure and notation.

2. Topic modelling models.

3. Automated text summarization.

** References and additional reading and resources**
- [Introduction to Topic Modeling in Python](http://chdoig.github.io/pygotham-topic-modeling/#/)

---





## 0. Text Preprocessing ( or Normalization)  revisited

In the previous session, we have gone through basic operations for text pre-processing techniques used in text analytics application including:
- Expanding contractions
- Lemmatization
- Removing special characters and symbols
- Removing stopwords


In [None]:
from contractions import CONTRACTION_MAP
import re # regular expression lib
# this routine will expand the contration in texts using some pre-defined contractions and rules
# see the list of English contractions here https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions. 
# The list of pre-defined contractions is stored in constractions.py file in CONTRACTION_MAP

   
# this function looks for each contraction and called above function
def expand_contractions(text, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),flags=re.IGNORECASE|re.DOTALL)
    # this function returns each expanded contraction
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]

        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())  

        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text    

from nltk import pos_tag
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet as wn
import nltk
import string

# Annotate text tokens with POS tags
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

def pos_tag_text(text):
    
    def to_wn_tags(pos_tag):
        if pos_tag.startswith('J'):
            return wn.ADJ
        elif pos_tag.startswith('V'):
            return wn.VERB
        elif pos_tag.startswith('N'):
            return wn.NOUN
        elif pos_tag.startswith('R'):
            return wn.ADV
        else:
            return None
    
    tagged_text = pos_tag(tokenize_text(text))
    tagged_lower_text = [(word.lower(), to_wn_tags(pos_tag))
                         for word, pos_tag in
                         tagged_text]
    return tagged_lower_text
    
# lemmatize text based on POS tags  
def lemmatize_text(text):
    
    pos_tagged_text = pos_tag_text(text)
    lemmatized_tokens = [wnl.lemmatize(word, pos_tag) if pos_tag else word  for word, pos_tag in pos_tagged_text]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

def tokenize_text(text):
    tokens = nltk.word_tokenize(text) 
    tokens = [token.strip() for token in tokens]
    return tokens

def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

stopword_list = nltk.corpus.stopwords.words('english')


def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

# Combining all steps to normalize corpus
def normalize_corpus(corpus, tokenize=False):    
    normalized_corpus = []    
    for text in corpus: # we will process every document
        text = expand_contractions(text, CONTRACTION_MAP)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
        else:
            normalized_corpus.append(text)
            
    return normalized_corpus

## 1. Topic modeling for information extraction

Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms that are distinguishable from each other, and these cluster of
words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and also make semantic connections among words that co-occur together frequently in various documents. In this session, we will cover the following two methods:
- Latent semantic indexing/analysis (LSI/A)
- Latent Dirichlet allocation (LDA)

We will leverage gensim and scikit-learn for our practical implementations and also look at how to build our own topic model based on latent semantic indexing. This will give you an idea of how these techniques work and also how to convert mathematical frameworks into practical implementations. We will use the following toy corpus initially to test our topic models

In [None]:
toy_corpus = ["The fox jumps over the dog",
              "The fox is very clever and quick",
              "The dog is slow and lazy",
              "The cat is smarter than the fox and the dog",
              "Python is an excellent programming language",
              "Java and Ruby are other programming languages",
              "Python and Java are very popular programming languages",
              "Python programs are smaller than Java programs"]

You can see that we have eight documents in the preceding corpus: the first four talk about various animals, and the last four are about programming languages. Thus this shows that there are two distinct topics in the corpus. We generalized that using
our brains, but the following sections will try to extract that same information using computational methods.

### 1.1 Latent Semantic Indexing

In the practical session of Week 8, we have demonstrated how to convert textual data to numeric data, e.g. BOW matrix, tf-idf feature matrix using ``sklearn``. In this session, we will now try to implement an LSI by leveraging **``gensim``** and extract topics from the toy corpus. To start, we load the necessary dependencies and normalize the toy corpus using the following code snippet

In [None]:
from gensim import corpora, models
import numpy as np

norm_tokenized_corpus = normalize_corpus(toy_corpus, tokenize=True)

norm_tokenized_corpus

We now build a dictionary or vocabulary, which gensim uses to map each unique term into a numeric value. Once built, we convert the preceding tokenized corpus into a numeric Bag of Words vector representation where each term and its frequency in a
sentence is depicted by a tuple (term, frequency), as seen in the following snippet:

In [None]:
dictionary = corpora.Dictionary(norm_tokenized_corpus)
dictionary.token2id

In [None]:
# convert tokenized documents into bag of words vectors
corpus = [dictionary.doc2bow(text) for text in norm_tokenized_corpus]
corpus

In [None]:
# build tf-idf feature vectors
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

# fix the number of topics
total_topics = 2
# build the topic model
lsi = models.LsiModel(corpus_tfidf,
                      id2word=dictionary,
                      num_topics=total_topics)

Now that our topic modeling framework is built, we can see the generated topics in the following code snippet:

In [None]:
for index, topic in lsi.print_topics(total_topics):
    print('Topic #'+str(index+1))
    print(topic)
    print()

At first, ignoring the weights, you can see that the first topic contains terms related to programming languages and
the second topic contains terms related to animals, which is in line with the main two concepts from our toy corpus mentioned earlier. If you now look at the weights, higher weightage and same sign exists for the terms that contribute toward each of the topics.

Let us now look at the next technique to build topic models using latent Dirichlet allocation.

### 1.2 Latent Dirichlet Allocation

The latent Dirichlet allocation (LDA) technique is a generative probabilistic model where each document is assumed to have a combination of topics similar to a probabilistic latent semantic indexing model—but in this case, the latent topics contain a Dirichlet prior over them. 
We use gensim in the following implementation to build an LDA-based topic model:

In [None]:
lda = models.LdaModel(corpus_tfidf,
                      id2word=dictionary,
                      iterations=100,
                      num_topics=total_topics)

lda.show_topics()


We see how the concepts are quite distinguishing across the two topics just as before, but note in this case the weights are positive, making it easier to interpret than LSI.

### 1.3 Word Cloud visualization for topics

For better representation, we can use [Wordcloud](https://github.com/amueller/word_cloud) to plot each topic and the corresponding terms and their probabilities. First we can download the correct version of word cloud at [Windows Binaries for Python Extension Packages](http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud), e.g. wordcloud‑1.3.2‑cp36‑cp36m‑win_amd64.whl, then
- Copying the above file to the folder containing this notebooks
- Running the following command

In [None]:
# remember to replace your file name
!python -m pip install <filename>

# !python -m pip install wordcloud-1.3.2-cp36-cp36m-win_amd64.whl


We can create a word cloud for a chosen topic as follows:

In [None]:
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

from wordcloud import WordCloud
topicID=0
weights={}
# Extract 1000 terms from an arbitrarily chosen topic
for pair in lda.show_topic(topicID, topn=5):
    weights[pair[0]]=pair[1]
    
# Initialize the cloud

wc = WordCloud(
    background_color="black",
    max_words=20,
    width=256,
    height=180,
    relative_scaling=0,
    stopwords=stopwords.words('english')
)

# # Generate the cloud

wc.generate_from_frequencies(weights)
wc.fit_words(weights)
plt.imshow(wc)
plt.axis('off')
plt.show()
print(weights)

## 2. Topic modeling with a Real-world dataset

### 2.1 Dataset: 20 Newsgroups revisited

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, probably for his [Newsweeder: Learning to filter netnews](http://qwone.com/~jason/20Newsgroups/lang95.bib) paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. *comp.sys.ibm.pc.hardware / comp.sys.mac.hardware*), while others are highly unrelated (e.g *misc.forsale / soc.religion.christian*). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:

``comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x	
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey	
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale	
talk.politics.misc
talk.politics.guns
talk.politics.mideast	
talk.religion.misc
alt.atheism
soc.religion.christian``


### Data preparation
We limit to use 5 classes of documents in this section to reduce processing time. Note that we will reuse the functions defined in Section 1 and 2. First, Let us start with loading the necessary dataset and defining functions for building the training and testing datasets:

We define two functions: ``prepare_datasets()`` to split given documents into testing and training data; ``remove_empty_docs()`` to remove empty documents:

In [None]:
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels, test_size=0.33,random_state=42)
    return train_X, test_X, train_Y, test_Y

def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)
    return filtered_corpus, filtered_labels

We can now get the data (in case you do not have the data downloaded, feel free to connect to the Internet and take some time to download the complete corpus)

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism','rec.sport.baseball','talk.politics.mideast','comp.graphics', 'sci.space']
dataset = fetch_20newsgroups(subset='train',categories=categories, remove=('headers', 'footers', 'quotes'))


In [None]:
news_corpus = normalize_corpus(dataset)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus):    
    vectorizer = CountVectorizer()
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_extractor(corpus, ngram_range=(1,1)):
    vectorizer = TfidfVectorizer(min_df=1,norm='l2',smooth_idf=True,use_idf=True,ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [None]:
# bag of words features
bow_vectorizer, bow_features = bow_extractor(news_corpus)


In [None]:
# tfidf features
tfidf_vectorizer, tfidf_features = tfidf_extractor(news_corpus)

### 2.2  Latent Semantic Indexing

<span style="color:red">**Exercise:** </span> Applying *Latent Semantic Indexing* method to extract topics  for 20 news groups data using code given in the previous section.

### 2.3  Latent Dirichlet Allocation

<span style="color:red">**Exercise:** </span> Applying *Latent Dirichlet Allocation* method to extract topics for 20 news groups data using code given in the previous section.

## 3. Automated text summarization

The main objective of automated document summarization is to perform this summarization without involving human inputs except for running any computer programs. Mathematical and statistical models help in building and automating the task of summarizing documents by observing their content and context. The idea of document summarization is a bit different from topic modeling. The end result is still in the form of some document, but with a few sentences based on the length we might want the summary to be. This is similar to having a research paper with an abstract or an executive summary.

Here, we will be looking at summarizing text documents by utilizing document sentences, the terms in each sentence of the document, and applying SVD to them using some sort of feature weights like Bag of Words or TF-IDF weights. The core principle behind latent semantic analysis (LSA) is that in any document, there exists a latent structure among terms which are related contextually and hence should also be correlated in the same singular space.

The input parameters we need are the number of concepts ``k`` and the number of sentences ``n`` which we want the final summary to contain:
- Get the sentence vectors from the matrix V (k rows).
- Get the top k singular values from S.
- Apply a threshold-based approach to remove singular values that are less than half of the largest singular value if any exist. This is a heuristic, and you can play around with this value if you want, i.e., $S=0$ iff $S_i<\frac{1}{2}max(S)$.
- Multiply each term sentence column from V squared with its corresponding singular value from S also squared, to get sentence weights per topic. 
- Compute the sum of the sentence weights across the topics and take the square root of the final score to get the salience scores for each sentence in the document, i.e., $SS=\sqrt{\sum_{i=1}^{k}S_iV_{i}^{T}}$

Once we have these scores, we sort them in descending order, pick the top ``n`` sentences corresponding to the highest scores,
and combine them to form our final summary based on the order in which they were present in the original document. We will now build a generic reusable function for LSA using the previous algorithm so that we can use it on our product description document later on and you can also use this function on your own data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse.linalg import svds

def lsa_text_summarizer(documents, num_sentences=2,num_topics=2, feature_type='frequency', sv_threshold=0.5):
    
    vec = CountVectorizer()
    dt_matrix=vec.fit_transform(documents).astype(float)
                            
    td_matrix = dt_matrix.transpose()
    td_matrix = td_matrix.multiply(td_matrix > 0)

    u, s, vt = svds(td_matrix, k=num_topics)

    min_sigma_value = max(s) * sv_threshold
    s[s < min_sigma_value] = 0
    
    salience_scores = np.sqrt(np.dot(np.square(s), np.square(vt)))

    top_sentence_indices = salience_scores.argsort()[-num_sentences:][::-1]
    top_sentence_indices.sort()
    print("Salience scores:\t", np.round(salience_scores,2))
    print("Selected sentences:\t", top_sentence_indices)
    for index in top_sentence_indices:
        print(sentences[index])

We need to split a document into sentences. The function take in a text document, remove its newlines, parse the text, converting it into ASCII format, and break it down into its sentence constituents. The function is depicted in the following snippet:

In [None]:
import re

def parse_document(document):
    document = re.sub('\n', ' ', document)
    if isinstance(document, str):
        document = document
    elif isinstance(document, unicode):
        return unicodedata.normalize('NFKD', document).encode('ascii', 'ignore')
    else:
        raise ValueError('Document is not string or unicode!')
    document = document.strip()
    sentences = nltk.sent_tokenize(document)
    sentences = [sentence.strip() for sentence in sentences]
    return sentences

We will be using our Wikipedia description of elephants as the document on which we will test all our summarization techniques.

In [None]:
import numpy as np
toy_text = """
Elephants are large mammals of the family Elephantidae
and the order Proboscidea. Two species are traditionally recognised,
the African elephant and the Asian elephant. Elephants are scattered
throughout sub-Saharan Africa, South Asia, and Southeast Asia. Male
African elephants are the largest extant terrestrial animals. All
elephants have a long trunk used for many purposes,
particularly breathing, lifting water and grasping objects. Their
incisors grow into tusks, which can serve as weapons and as tools
for moving objects and digging. Elephants' large ear flaps help
to control their body temperature. Their pillar-like legs can
carry their great weight. African elephants have larger ears
and concave backs while Asian elephants have smaller ears
and convex or level backs.
"""

In [None]:
sentences = parse_document(toy_text)
norm_sentences = normalize_corpus(sentences) 
vec = CountVectorizer()
dt_matrix=vec.fit_transform(norm_sentences).astype(float)


print("Total Sentences:", len(norm_sentences) )

lsa_text_summarizer(norm_sentences, num_sentences=3,
                    num_topics=3, feature_type='frequency',
                    sv_threshold=0.5)  