# Automatic Learning of Key Phrases and Topics in Document Collections

## Part 1: Text Preprocessing

### Overview

This notebook is Part 1 in a series of 6, providing a step-by-step description of how to process and analyze the contents of a large collection of text documents in an unsupervised manner. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

This notebook demonstrates how to preprocess the raw text from a collection of documents as precursor to applying the natural language processing techniques of unsupervised phrase learning and latent topic modeling.



**These series of notebooks can be run on any compute context. Before you run those notebooks, make sure you have all dependencies installed in the compute context you choose as kernel.**

* For **local** kernel, click "_**Open Command Prompt**_" from "_**File**_" menu in Azure Machine Learning Workbench, and then manually install the following packages:
```
$ conda install numpy
$ conda install nltk
$ conda install -c conda-forge wordcloud
$ conda install bokeh
$ pip install gensim
$ pip install matplotlib
```

* For local or remote **Docker kernels**:
    * ensure **notebook**, **matplotlib**, **nltk**, **gensim**, **wordcloud** are listed in your **conda_dependencies.yml** file under **aml_config** folder.
    ```
        name: project_environment
        channels:
          - conda-forge
          - defaults
        dependencies:
          - python=3.5.2
          - numpy>=1.13
          - scikit-learn
          - nltk
          - pandas
          - azure
          - gensim
          - scipy
          - wordcloud
          - bokeh
          - pip:
            - notebook
            - gensim
            - matplotlib
    ```

### Import Relevant Python Packages

#### Importing NLTK Model for Sentence Tokenization


NLTK is a collection of Python modules, prebuilt models and corpora that provides tools for complex natural language processing tasks. Because the toolkit is large, the base installation of NLTK only installs the core skeleton of the toolkit. Installation of specific modules, corpora and pre-built models can be invoked from within Python using a download functionality provided by NLTK that can be invoked from Python. 

In this notebook, we make use of the NLTK sentence tokenization capability which takes a long string of text and splits it into sentence units. The tokenizer requires the installation of the 'punkt'  tokenizer models. After importing nltk, the nltk.download() function can be used to download specific packages such as 'punkt'.

For more information on NLTK see http://www.nltk.org/.

In [1]:
import os
import urllib.request
import nltk
# The first time you run NLTK you will need to download the 'punkt' models 
# for breaking text strings into individual sentences
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
from nltk import tokenize

from azureml.logging import get_azureml_logger
aml_logger = get_azureml_logger()   # logger writes to AMLWorkbench runtime view
aml_logger.log('amlrealworld.document-collection-analysis.notebook1', 'true')

<azureml.logging.script_run_request.ScriptRunRequest at 0x237d04441d0>

#### Import Other Required Packages
The 'pandas' package is used for handling and manipulating data frames. The 're' package is used for applying regular expressions.

In [2]:
import pandas 
import re

### Load Text Data

Need to download the datasets from Blob Storage.

> **NOTE**: If you are running this notebook outside of Azure Machine Learning Workbench, you will need to change the `'shared_path'` to relative path `'../Data'`.

In [3]:
def download_file_from_blob(filename):
    shared_path = os.environ['AZUREML_NATIVE_SHARE_DIRECTORY']
    save_path = os.path.join(shared_path, filename)

    # Base URL for anonymous read access to Blob Storage container
    STORAGE_CONTAINER = 'https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/'
    url = STORAGE_CONTAINER + filename
    urllib.request.urlretrieve(url, save_path)
    
    
def getData():
    shared_path = os.environ['AZUREML_NATIVE_SHARE_DIRECTORY']

    data_file = os.path.join(shared_path, DATASET_FILE)
    blacklist_file = os.path.join(shared_path, BLACK_LIST_FILE)
    function_words_file = os.path.join(shared_path, FUNCTION_WORDS_FILE)

    if not os.path.exists(data_file):
        download_file_from_blob(DATASET_FILE)
    if not os.path.exists(blacklist_file):
        download_file_from_blob(BLACK_LIST_FILE)
    if not os.path.exists(function_words_file):
        download_file_from_blob(FUNCTION_WORDS_FILE)

    df = pandas.read_csv(data_file, sep='\t')
    return df


Define constants used to get data from Blog Storage and read it as Pandas DataFrame.

In [4]:
# The dataset file name
# DATASET_FILE = 'small_data.tsv'
DATASET_FILE = 'CongressionalDataAll_Jun_2017.tsv'

# The black list of words to ignore
BLACK_LIST_FILE = 'black_list.txt'

# The non-content bearing function words
FUNCTION_WORDS_FILE = 'function_words.txt'


> **NOTE:** By default, this notebook will use the entire congressional dataset which contains about 290,000 bills. There is also an option to comment the line `DATASET_FILE = 'CongressionalDataAll_Jun_2017.tsv'` and uncomment line `DATASET_FILE = 'small_data.tsv'` on the above cell to run it on a dataset with 50,000 bills. Use a small dataset could significantly reduce the execution time of downstream notebooks.

Load full TSV file including a column of text

In [5]:
frame = getData()

In [6]:
print("Total documents in corpus: %d\n" % len(frame))

# Show the first five rows of the data in the frame
frame[0:5]

Total documents in corpus: 297462



Unnamed: 0,ID,Text,Date,SponsorName,Type,State,District,Party,Subjects
0,hconres1-93,"Provides that effective from January 3, 1973, ...",1973-01-03,"O'Neill, Thomas P., Jr.",rep,MA,8.0,Democrat,"congress,congressional joint committees,govern..."
1,hconres2-93,Makes it the sense of the Congress that the po...,1973-01-03,"Bennett, Charles E.",rep,FL,3.0,Democrat,"environmental protection,pollution,water resou..."
2,hconres3-93,Establishes a Joint Congressional Committee on...,1973-01-03,"Bennett, Charles E.",rep,FL,3.0,Democrat,"congress,congressional joint committees,congre..."
3,hconres4-93,Makes it the sense of the Congress that the Pr...,1973-01-03,"Collier, Harold R.",rep,IL,10.0,Republican,"armed forces and national security,missing in ..."
4,hconres5-93,Makes it the sense of the Congress that: (1) t...,1973-01-03,"Collier, Harold R.",rep,IL,10.0,Republican,"economics and public finance,federal budgets"


Print the full text of the first three documents

In [7]:
print(frame['Text'][0])
print('---')
print(frame['Text'][1])
print('---')
print(frame['Text'][2])

Provides that effective from January 3, 1973, the joint committee created to make the necessary arrangements for the inauguration of the President-elect and Vice President-elect of the United States on the 20th day of January 1973, is hereby continued and for such purpose shall have the same power and authority as that conferred by Senate Concurrent Resolution 63, of the Ninety-second Congress.
---
Makes it the sense of the Congress that the pollution of waters all over the world is a matter of vital concern to all nations and should be dealt with as a matter of the highest priority. Makes it the sense of the Congress that the President, acting through the United States delegation to the United National Conference on the Human Environment, should take such steps as may be necessary to propose an international agreement, or amendments to existing international agreements, as may be appropriate, providing for coordinated international activites to prohibit the disposal of munitions, chem

### Preprocess Text Data

The CleanAndSplitText function below takes as input a list where each row element is a single cohesive long string of text, i.e. a "document". The function first splits each string by various forms of punctuation into chunks of text that are likely sentences, phrases or sub-phrases. The splitting is designed to prohibit the phrase learning process from using cross-sentence or cross-phrase word strings when learning phrases.

The function creates a table where each row represents a chunk of text from the original documents. The DocIndex column indicates the original row index from associated document in the input from which the chunk of text originated. The TextLine column contains the original text excluding the punctuation marks and HTML markup that have been during the cleaning process.The TextLineLower column contains a fully lower-cased version of the text in the TextLIne column.


In [8]:
def CleanAndSplitText(textDataFrame):

    textDataOut = [] 
   
    # This regular expression is for section headers in the bill summaries that we wish to ignore
    reHeaders = re.compile(r" *TABLE OF CONTENTS:? *"
                           "| *Title [IVXLC]+:? *"
                           "| *Subtitle [A-Z]+:? *"
                           "| *\(Sec\. \d+\) *")

    # This regular expression is for punctuation that we wish to clean out
    # We also will split sentences into smaller phrase like units using this expression
    rePhraseBreaks = re.compile("[\"\!\?\)\]\}\,\:\;\*\-]*\s+\([0-9]+\)\s+[\(\[\{\"\*\-]*"                             
                                "|[\"\!\?\)\]\}\,\:\;\*\-]+\s+[\(\[\{\"\*\-]*"
                                "|\.\.+"
                                "|\s*\-\-+\s*"
                                "|\s+\-\s+"
                                "|\:\:+"
                                "|\s+[\/\(\[\{\"\-\*]+\s*"
                                "|[\,!\?\"\)\(\]\[\}\{\:\;\*](?=[a-zA-Z])"
                                "|[\"\!\?\)\]\}\,\:\;]+[\.]*$"
                             )
    
    # Regex for underbars
    regexUnderbar = re.compile('_')
    
    # Regex for space
    regexSpace = re.compile(' +')
 
    # Regex for sentence final period
    regexPeriod = re.compile("\.$")

    # Iterate through each document and do:
    #    (1) Split documents into sections based on section headers and remove section headers
    #    (2) Split the sections into sentences using NLTK sentence tokenizer
    #    (3) Further split sentences into phrasal units based on punctuation and remove punctuation
    #    (4) Remove sentence final periods when not part of an abbreviation 

    for i in range(0, len(frame)):     
        # Extract one document from frame
        docID = frame['ID'][i]
        docText = str(frame['Text'][i])

        # Set counter for output line count for this document
        lineIndex=0;

        # Split document into sections by finding sections headers and splitting on them 
        sections = reHeaders.split(docText)
        
        for section in sections:
            # Split section into sentence using NLTK tokenizer 
            sentences = tokenize.sent_tokenize(section)
            
            for sentence in sentences:
                # Split each sentence into phrase level chunks based on punctuation
                textSegs = rePhraseBreaks.split(sentence)
                numSegs = len(textSegs)
                
                for j in range(0,numSegs):
                    if len(textSegs[j])>0:
                        # Convert underbars to spaces 
                        # Underbars are reserved for building the compound word phrases                   
                        textSegs[j] = regexUnderbar.sub(" ",textSegs[j])
                    
                        # Split out the words so we can specially handle the last word
                        words = regexSpace.split(textSegs[j])
                        phraseOut = ""
                        # If the last word ends in a period then remove the period
                        words[-1] = regexPeriod.sub("", words[-1])
                        # If the last word is an abbreviation like "U.S."
                        # then add the word final period back on
                        if "\." in words[-1]:
                            words[-1] += "."
                        phraseOut = " ".join(words)  

                        textDataOut.append([docID, lineIndex, phraseOut])
                        lineIndex += 1
                        
    # Convert to pandas frame 
    frameOut = pandas.DataFrame(textDataOut, columns=['DocID', 'DocLine', 'CleanedText'])                      
    
    return frameOut

In [9]:
# Set this to true to run the function
writeFile = True

if writeFile:
    cleanedDataFrame = CleanAndSplitText(frame)

#### Writing and reading text data to and from a file 

Writing the text data to file and reading it back in. If the value is 'False' it assumes we have already run the CleanAndSplitData function and written it to file.

In [10]:
cleanedDataFile = os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], 'CongressionalDocsCleaned.tsv')

if writeFile:
    # Write frame with preprocessed text out to TSV file 
    cleanedDataFrame.to_csv(cleanedDataFile, sep='\t', index=False)
else:
    # Read a cleaned data frame in from a TSV file
    cleanedDataFrame = pandas.read_csv(cleanedDataFile, sep='\t', encoding="ISO-8859-1")


#### Examining the processed text data

In [11]:
cleanedDataFrame[0:25]

Unnamed: 0,DocID,DocLine,CleanedText
0,hconres1-93,0,Provides that effective from January 3
1,hconres1-93,1,1973
2,hconres1-93,2,the joint committee created to make the necess...
3,hconres1-93,3,is hereby continued and for such purpose shall...
4,hconres1-93,4,of the Ninety-second Congress
5,hconres2-93,0,Makes it the sense of the Congress that the po...
6,hconres2-93,1,Makes it the sense of the Congress that the Pr...
7,hconres2-93,2,acting through the United States delegation to...
8,hconres2-93,3,should take such steps as may be necessary to ...
9,hconres2-93,4,or amendments to existing international agreem...


In [12]:
print(cleanedDataFrame['CleanedText'][0])
print(cleanedDataFrame['CleanedText'][1])
print(cleanedDataFrame['CleanedText'][2])

Provides that effective from January 3
1973
the joint committee created to make the necessary arrangements for the inauguration of the President-elect and Vice President-elect of the United States on the 20th day of January 1973


### Next

The data preprocessing step is finished. The next step will be phrase learning which will be in the second notebook of the series: [`2_Phrase_Learning.ipynb`](./2_Phrase_Learning.ipynb).