# Part 2: Phrase Learning

If you haven't complete the **Part 1: Data Preparation**, please complete it before moving forward with **Part 2: Phrase Learning**. Part 2 requires files created from Part 1.

**NOTE**: Python 3 kernel doesn't include Azure Machine Learning Workbench functionalities. Please switch the kernel to `local` before continuing further. 

### Import Required Python Modules

`modules.phrase_learning` contains a list of Python user-defined Python modules to learn informative phrases that are used in this examples. You can find the source code of those modules in the directory of `modules/phrase_learning.py`.

In [13]:
import pandas as pd
import numpy as np
import re, os, requests, warnings
from collections import (namedtuple, Counter)
from modules.phrase_learning import (CleanAndSplitText, ComputeNgramStats, RankNgrams, ApplyPhraseRewrites,
                            ApplyPhraseLearning, ApplyPhraseRewritesInPlace, ReconstituteDocsFromChunks,
                            CreateVocabForTopicModeling)
warnings.filterwarnings("ignore")

## Access trainQ and testQ from Part 1

As we have prepared the _trainQ_ and _testQ_ from the `Part 1: Data Preparation`, we retrieve the datasets here for the further process.

_trainQ_ contains 5,153 training examples and _testQ_ contains 1,735 test examples. Also, there are 103 unique answer classes in both datasets.

In [14]:
# load non-content bearing function words (.txt file) into a Python dictionary. 
def LoadListAsHash(fileURL):
    response = requests.get(fileURL, stream=True)
    wordsList = response.text.split('\n')

    # Read in lines one by one and strip away extra spaces, 
    # leading spaces, and trailing spaces and inserting each
    # cleaned up line into a hash table.
    listHash = {}
    re1 = re.compile(' +')
    re2 = re.compile('^ +| +$')
    for stringIn in wordsList:
        term = re2.sub("",re1.sub(" ",stringIn.strip('\n')))
        if term != '':
            listHash[term] = 1
    return listHash

In [15]:
workfolder = os.environ.get('AZUREML_NATIVE_SHARE_DIRECTORY')

# paths to trainQ, testQ and function words.
trainQ_path = os.path.join(workfolder, 'trainQ_part1')
testQ_path = os.path.join(workfolder, 'testQ_part1')
function_words_url = 'https://bostondata.blob.core.windows.net/stackoverflow/function_words.txt'

# load the training and test data.
trainQ = pd.read_csv(trainQ_path, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_path, sep='\t', index_col='Id', encoding='latin1')

# Load the list of non-content bearing function words.
functionwordHash = LoadListAsHash(function_words_url)

## Clean and Split the Text

The CleanAndSplitText function from __phrase_learning__ takes as input a list where each row element is a single cohesive long string of text, i.e. a "question". The function first splits each string by various forms of punctuation into chunks of text that are likely sentences, phrases or sub-phrases. The splitting is designed to prohibit the phrase learning process from using cross-sentence or cross-phrase word strings when learning phrases.

The function returns a table where each row represents a chunk of text from the questions. The `DocID` coulmn indicates the original row index from associated question in the input from which the chunk of text originated. The `DocLine` column contains the original text excluding the punctuation marks and `HTML` markup that have been during the cleaning process. The `Lowercase Taxt` column contains a fully lower-cased version of the text in the `CleanedText` column.

In [16]:
CleanedTrainQ = CleanAndSplitText(trainQ)
CleanedTestQ = CleanAndSplitText(testQ)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mez\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mez\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
CleanedTrainQ.head(5)

Unnamed: 0,DocID,DocLine,CleanedText,LowercaseText
0,69913,0,why don't self-closing script tags work,why don't self-closing script tags work
1,69913,1,what is the reason browsers do not correctly r...,what is the reason browsers do not correctly r...
2,69913,2,only this is recognized,only this is recognized
3,69913,3,does this break the concept of xhtml support,does this break the concept of xhtml support
4,69913,4,note,note


## Learn Informative Phrases 
The phrases can be treated as single compound word units in down-stream processes such as discriminative training. To learn the phrases, we have implemented the basic framework for key phrase learning as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology. Although the paper examines the use of the technology for analyzing human-to-human conversations, the techniques are quite general and can be applied to a wide range of natural language data including news stories, legal documents, research publications, social media forum discussions, customer feedback forms, product reviews, and many more.

`ApplyPhraseLearning` module takes the following arguments:
- `textData`: a list of text data.
- `learnedPhrases`: a list of learned phrases. For initialization, an empty list should be given.
- `maxNumPhrases`: maximium number of phrases to learn. If you want to test the code out quickly then set this to a small value (e.g. 100) and set verbose to true when running the quick test.
- `maxPhraseLength`: maximum number of words allowed in the learned phrases.
- `maxPhrasesPerIter`: maximum number of phrases to learn per iteration. Increasing this number may speed up processing but will affect the ordering of the phrases learned and good phrases could be by-passed if the maxNumPhrases is set to a small number.
- `minCount`: minimum number of times a phrase must occur in the data to be considered during the phrase learning process.
- `functionwordHash`: a precreated hash table containing the list of function words used during phrase learning.
- `blacklistHash`: a precreated hash table (default value: {}) containing the list of black list words to be ignored during phrase learning.
- `verbose`: if true, it prints out the learned phrases to stdout buffer while its learning (default value: false). This will generate a lot of text to stdout, so best to turn this off except for testing and debugging.

In [18]:
# Initialize an empty list of learned phrases
# If you have completed a partial run of phrase learning
# and want to add more phrases, you can use the pre-learned 
# phrases as a starting point instead and the new phrases
# will be appended to the list
learnedPhrasesQ = []

# Create a copy of the original text data that will be used during learning
# The copy is needed because the algorithm does in-place replacement of learned
# phrases directly on the text data structure it is provided
phraseTextDataQ = []
for textLine in CleanedTrainQ['LowercaseText']:
    phraseTextDataQ.append(' ' + textLine + ' ')

# Run the phrase learning algorithm.
ApplyPhraseLearning(phraseTextDataQ, learnedPhrasesQ, maxNumPhrases=200, maxPhraseLength=7, maxPhrasesPerIter=50,
                    minCount=5, functionwordHash=functionwordHash)

# Add text with learned phrases back into data frame
CleanedTrainQ['TextWithPhrases'] = phraseTextDataQ

# Apply the phrase learning to test data.
CleanedTestQ['TextWithPhrases'] = ApplyPhraseRewritesInPlace(CleanedTestQ, 'LowercaseText', learnedPhrasesQ)

Start phrase learning with 0 phrases of 200 phrases learned
Iteration 1: Added 42 new phrases in 1.26 seconds (Learned 42 of max 200)
Iteration 2: Added 35 new phrases in 1.29 seconds (Learned 77 of max 200)
Iteration 3: Added 32 new phrases in 1.24 seconds (Learned 109 of max 200)
Iteration 4: Added 34 new phrases in 1.21 seconds (Learned 143 of max 200)
Iteration 5: Added 31 new phrases in 1.17 seconds (Learned 174 of max 200)
Iteration 6: Added 11 new phrases in 1.19 seconds (Learned 185 of max 200)
Iteration 7: Added 3 new phrases in 1.09 seconds (Learned 188 of max 200)
Iteration 8: Added 4 new phrases in 1.09 seconds (Learned 192 of max 200)
Iteration 9: Added 1 new phrases in 1.08 seconds (Learned 193 of max 200)
Iteration 10: Added 1 new phrases in 1.10 seconds (Learned 194 of max 200)
Iteration 11: Added 1 new phrases in 0.93 seconds (Learned 195 of max 200)
Iteration 12: Added 1 new phrases in 1.04 seconds (Learned 196 of max 200)
Iteration 13: Added 1 new phrases in 1.07 sec

In [19]:
print("\nHere are some phrases we learned in this part of the tutorial: \n")
print(learnedPhrasesQ[:20])


Here are some phrases we learned in this part of the tutorial: 

['possible duplicate', "i'm trying", 'works fine', 'doing wrong', 'click event', 'following code', 'using jquery', 'uncaught typeerror', 'ajax request', 'global variable', 'div class', 'json object', 'callback function', "i'm not sure", 'anonymous function', 'php file', 'return value', 'user clicks', 'dynamically created', 'input type']


## Reconstruct the Full Processed Text

After replacing the text with learned phrases, we reconstruct the sentences from the chunks of text and insert the sentences in the `TextWithPhrases` field.  

In [20]:
# reconstitue the text from seperated chunks.
trainQ['TextWithPhrases'] = ReconstituteDocsFromChunks(CleanedTrainQ, 'DocID', 'TextWithPhrases')
testQ['TextWithPhrases'] = ReconstituteDocsFromChunks(CleanedTestQ, 'DocID', 'TextWithPhrases')

## Tokenize Text with Learned Phrases

We learn a vocabulary by considering some text exclusion criteria, such as stop words, non-alphabetic words, the words below word count threshold, etc. 

`TokenizeText` module breaks the reconstituted text into individual tokens and excludes any word that doesn't exist in the vocabulary.

In [21]:
def TokenizeText(textData, vocabHash):
    tokenizedText = ''
    for token in textData.split():
        if token in vocabHash:
            tokenizedText += (token.strip() + ',')
    return tokenizedText.strip(',')

In [22]:
# create the vocabulary.
vocabHashQ = CreateVocabForTopicModeling(trainQ['TextWithPhrases'], functionwordHash)

# tokenize the text.
trainQ['Tokens'] = trainQ['TextWithPhrases'].apply(lambda x: TokenizeText(x, vocabHashQ))
testQ['Tokens'] = testQ['TextWithPhrases'].apply(lambda x: TokenizeText(x, vocabHashQ))

Counting words
Building vocab
Excluded 307 stop words
Excluded 911 non-alphabetic words
Excluded 15266 words below word count threshold
Excluded 142 words below doc count threshold
Excluded 3 words above max doc frequency
Final Vocab Size: 3114 words


In [23]:
trainQ[['AnswerId', 'Tokens']].head(5)

Unnamed: 0_level_0,AnswerId,Tokens
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
69913,69984,"self-closing,script,tags,work,reason,browsers,..."
392561,69984,"firefox,script,tag,error,adding,basic,script,t..."
1297308,69984,"weird,javascript/jquery,behavior,possible_dupl..."
3352182,69984,"html,script,tags,ending,possible_duplicate,t,s..."
5355867,69984,"loading,scripts,possible_duplicate,don&#39,t,s..."


## Save Outputs to a Share Directory in the Workbench

In [24]:
trainQ.to_csv(os.path.join(workfolder, 'trainQ_part2'), sep='\t', header=True, index=True, index_label='Id')
testQ.to_csv(os.path.join(workfolder, 'testQ_part2'), sep='\t', header=True, index=True, index_label='Id')