# Document (QnA) Matching Data Science Process

## Part 1: Text Preparation and Phrases Learning

### Overview

This notebook is Part 1 of 5, in a series providing a step-by-step description of how to create discriminative training methods to match the correct answer to a given question. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

Although the paper examines the use of the technology for analyzing human-to-human conversations, the techniques are quite general and can be applied to a wide range of natural language data including news stories, legal documents, research publications, social media forum discussions, customer feedback forms, product reviews, and many more.

Also, we implement a Naive Bayes Classifier as described in the paper entitled ["MCE Training Techniques for Topic Identification of Spoken Audio Documents"](http://ieeexplore.ieee.org/abstract/document/5742980/).

Part 1 of the series shows how to pre-process the text data, learn the most salient phrases present in a large collection of documents and save cleaned text data in the Azure Blob Storage. These phrases can be treated as single compound word units in down-stream processes such as discriminative training.

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.

## Import required Python modules

In this notebook, we use several open-source Python packages that need to be installed in a local machine or an Azure Notebook Server. An upgrade is requested if a previous version of a package has been installed in the past.

We make use of the NLTK sentence tokenization capability which takes a long string of text and splits it into sentence units. The tokenizer requires the installation of the 'punkt' tokenizer models. After importing nltk, the nltk.download() function can be used to download specific packages such as 'punkt'.

In [1]:
# uncomment the below code to install/upgrade the requested Python packages.
# !pip install --upgrade --no-deps smart_open azure pandas nltk

In [1]:
import pandas as pd
import gzip
import requests
import re
import itertools
import nltk
import math
import numpy as np
import csv 
import gc
import operator
import matplotlib.pyplot as plt
import matplotlib
from collections import (namedtuple, Counter)
from azure.storage import CloudStorageAccount
from IPython.display import display

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
EMPTY = ''
SPACE = ' '
nltk.download("punkt")
NLTK_PUNKT_EN = 'tokenizers/punkt/english.pickle'
SENTENCE_BREAKER = nltk.data.load(NLTK_PUNKT_EN)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mez\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mez\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mez\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Configure blob storage

Configure Azure Blob Storage to retrieve and store datasets.

In [4]:
storage_account_name = 'mezsa'
storage_account_key = 'X1Xwyn5ROxyQa4tmvjSza/Lv5bXLu7cZ1jWyfFhCEBCKFr78onDgFUH05F5iG2aq1IsU+DIooYDbPzKa821FSA=='
account = CloudStorageAccount(account_name=storage_account_name, account_key=storage_account_key)
blob_service = account.create_blob_service()

## Access sample data

We used three sets of data in this series of notebooks. We obtain the raw data from Stack Overflow Database and extract all question-answer pairs related to the "JavaScript" tag. For the question-answer pairs, we consider the following scenarios. 

1. Original Questions (Q): These questions have been asked and answered on the Stack Overflow.
2. Duplications (D): There is a linkage among questions. Some questions that have already been asked by others are linked to the previous/original questions as Duplications. In the Stack Overflow Database, this kind of linkage is determined by "LINK_TYPE_DUPE = 3". Each original question could have 0 to many duplications.
3. Answers (A): For each Original question and its Duplications, we found more than one answers resolved that question. In our analysis, we only select the Accepted answer or the answer with the highest score that resolved the Original question. Therefore, it's 1-to-1 mapping between Original questions and Answers and many-to-1 mapping between Duplications and Original questions. Each Original question and its Duplications has an unique AnswerId.
4. Function Words: we consider a list of words that can only be used in between content words in the creation of phrases. This list of words are also used as Stop Words.

See the below Data Diagram:

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/data_diagram.png">

In [13]:
# functions to load .tsv.gz file into Pandas data frame.
def read_csv_gz(url, **kwargs):
    return pd.read_csv(gzip.open(requests.get(url, stream=True).raw, mode='rb'), **kwargs)

def read_data_frame(url, **kwargs):
    return read_csv_gz(url, sep='\t', encoding='utf8', **kwargs).set_index('Id')

# functions to load .txt file into a Python dictionary. 
def load_from_url(fileURL):
    response = requests.get(fileURL, stream=True)
    return response.text.split('\n')

def LoadListAsHash(fileURL):
    listHash = {}
    wordsList = load_from_url(fileURL)
    # Read in lines one by one stripping away extra spaces, 
    # leading spaces, and trailing spaces and inserting each
    # cleaned up line into a hash table
    re1 = re.compile(' +')
    re2 = re.compile('^ +| +$')
    for stringIn in wordsList:
        term = re2.sub("",re1.sub(" ",stringIn.strip('\n')))
        if term != '':
            listHash[term] = 1
    return listHash

In [5]:
# URLs to Original questions, Duplications, Answers and Function Words.
questions_url = 'https://mezsa.blob.core.windows.net/stackoverflow/orig-q.tsv.gz'
dupes_url = 'https://mezsa.blob.core.windows.net/stackoverflow/dup-q.tsv.gz'
answers_url = 'https://mezsa.blob.core.windows.net/stackoverflow/ans.tsv.gz'
function_words_url = 'https://mezsa.blob.core.windows.net/stackoverflow/function_words.txt'

In [14]:
# load datasets.
questions = read_data_frame(questions_url, names=('Id', 'AnswerId', 'Text0', 'CreationDate'))
dupes = read_data_frame(dupes_url, names=('Id', 'AnswerId', 'Text0', 'CreationDate'))
answers = read_data_frame(answers_url, names=('Id', 'Text0'))
# Load the list of non-content bearing function words
functionwordHash = LoadListAsHash(function_words_url)

In [9]:
# examples of Original questions.
questions.head(2)

Unnamed: 0_level_0,AnswerId,Text0,CreationDate
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
220231,220233,Accessing the web page's HTTP Headers in JavaS...,2008-10-20 22:54:38.767
391979,810461,Get client IP using just JavaScript?. <p>I nee...,2008-12-24 18:22:30.780


## Text Pre-processing

### Clean up text

Since the raw data is in HTML format, we need to clean up HTML tags and links. We also remove embeded code chunks.

In [17]:
def strip_code(text):
    if not isinstance(text, str): return text
    return re.sub('<pre><code>.*?</code></pre>', EMPTY, text)

def strip_tags(text):
    if not isinstance(text, str): return text
    return re.sub('<[^>]+>', EMPTY, text)

def strip_links(text):
    if not isinstance(text, str): return text
    def replace_link(match):
        return EMPTY if re.match('[a-z]+://', match.group(1)) else match.group(1)
    return re.sub('<a[^>]+>(.*)</a>', replace_link, text)

def clean_text(text):
    return strip_tags(strip_links(strip_code(text)))

In [18]:
for df in (questions, dupes, answers):
    df['Text'] = df['Text0'].apply(clean_text).str.lower()
    df['NumChars'] = df['Text'].str.len()

In [19]:
# examples after the cleaning.
questions.head(2)

Unnamed: 0_level_0,AnswerId,Text0,CreationDate,Text,NumChars
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
220231,220233,Accessing the web page's HTTP Headers in JavaS...,2008-10-20 22:54:38.767,accessing the web page's http headers in javas...,284
391979,810461,Get client IP using just JavaScript?. <p>I nee...,2008-12-24 18:22:30.780,get client ip using just javascript?. i need t...,288


### Set data selection criteria

To obtain high quality datasets for learning phrases, we set a threshold of minimum length of characters in the text field. This threshold is considered seperatedly for Original questions, Duplications and Answers. 

For each Original question, we also make sure there are at least 3 linked Duplications so that we have enough data to learn from in the later Notebooks.

In [20]:
# a function to find the AnswerIds has at least 3 dupes.
def find_answerId(answersC, dupesC, num_dupes):
       
    countHash = {}
    for i in dupesC.AnswerId:
        if i not in answersC.index.values:
            continue
        if i not in countHash.keys():
            countHash[i] = 1
        else:
            countHash[i] += 1
            
    countHash = {k: v for k, v in countHash.items() if v >= num_dupes}
    commonAnswerId = countHash.keys()
    
    return commonAnswerId

# a function to extract data based on the selection criteria.
def select_data(questions, dupes, answers):
    # exclude the records without any text
    questions_nz = questions.query('NumChars > 0')
    dupes_nz = dupes.query('NumChars > 0')
    answers_nz = answers.query('NumChars > 0')

    # get the 10th percentile of text length as the minimum length of characters to consider in the text field
    minLenQ = questions_nz.quantile(.1)['NumChars']
    minLenD = dupes_nz.quantile(.1)['NumChars']
    minLenA = answers_nz.quantile(.1)['NumChars']
    
    # eliminate records with text less than the minimum length
    questionsC = questions.query('NumChars >' + str(int(minLenQ)))
    dupesC = dupes.query('NumChars >' + str(minLenD))
    answersC = answers.query('NumChars >' + str(minLenA))
    
    # make sure Questions 1:1 match with Answers 
    matches = questionsC.merge(answersC, left_on = 'AnswerId', right_index = True)
    questionsC = matches[['AnswerId', 'Text0_x', 'CreationDate', 'Text_x', 'NumChars_x']]
    questionsC.columns = ['AnswerId', 'Text0', 'CreationDate', 'Text', 'NumChars']

    answersC = matches[['Text0_y', 'Text_y', 'NumChars_y']]
    answersC.index = matches['AnswerId']
    answersC.columns = ['Text0', 'Text', 'NumChars']
    
    # find the AnswerIds has at least 3 dupes
    commonAnswerId = find_answerId(answersC, dupesC, 3)
    
    # select the records with those AnswerIds
    questionsC = questionsC.loc[questionsC.AnswerId.isin(commonAnswerId)]
    dupesC = dupesC.loc[dupesC.AnswerId.isin(commonAnswerId)]
    answersC = answersC.loc[commonAnswerId] 
    
    return questionsC, dupesC, answersC

In [21]:
questionsC, dupesC, answersC = select_data(questions, dupes, answers)

## Prepare training and scoring datasets

We split questions based on the creation date so that the training and the test sets are defined as below.
1. training set = Original quesiton + 75% of oldest Duplications per Original question
2. test set = remaining 25% of Duplications per Original question

In [22]:
# a function to split Original questions and their Duplications into training and test sets.
def split_data(questions, dupes, frac):
    trainQ = questions
    testQ = pd.DataFrame(columns = dupes.columns.values) # create an empty data frame
    for answerId in np.unique(dupes.AnswerId):
        df = dupes.query('AnswerId == ' + str(answerId))
        totalCount = len(df)
        splitPoint = int(totalCount * frac)
        dfSort = df.sort_values(by = ['CreationDate'])
        trainQ = trainQ.append(dfSort.head(splitPoint)) # oldest N percent of duplications
        testQ = testQ.append(dfSort.tail(totalCount - splitPoint))
    return trainQ, testQ

In [23]:
# prepare training and test
trainQ, testQ = split_data(questionsC, dupesC, 0.75)

## Process text data

The CleanAndSplitText function below takes as input a list where each row element is a single cohesive long string of text, i.e. a "document". The function first splits each string by various forms of punctuation into chunks of text that are likely sentences, phrases or sub-phrases. The splitting is designed to prohibit the phrase learning process from using cross-sentence or cross-phrase word strings when learning phrases.

The function creates a table where each row represents a chunk of text from the original documents. The DocIndex coulmn indicates the original row index from associated document in the input from which the chunk of text originated. The TextLine column contains the original text excluding the punctuation marks and HTML markup that have been during the cleaning process.The TextLineLower column contains a fully lower-cased verion of the text in the TextLIne column.

In [24]:
def CleanAndSplitText(frame):

    textDataOut = [] 

    # This regular expression is for punctuation that we wish to clean out
    # We also will split sentences into smaller phrase like units using this expression
    rePhraseBreaks = re.compile("[\"\!\?\)\]\}\,\:\;\*\-]*\s+\([0-9]+\)\s+[\(\[\{\"\*\-]*"   #                           
                                "|[\"\!\?\)\]\}\,\:\;\*\-]+\s+[\(\[\{\"\*\-]*"
                                "|\.\.+"       # ..
                                "|\s*\-\-+\s*" # --
                                "|\s+\-\s+"    # -  
                                "|\:\:+"       # ::
                                "|\s+[\/\(\[\{\"\-\*]+\s*"  
                                "|[\,!\?\"\)\(\]\[\}\{\:\;\*](?=[a-zA-Z])"
                                "|[\"\!\?\)\]\}\,\:\;]+[\.]*$"
                             )
    
    # Regex for underbars
    regexUnderbar = re.compile('_|_+')
    
    # Regex for space
    regexSpace = re.compile(' +')
 
    # Regex for sentence final period
    regexPeriod = re.compile("\.$")
    
    # Regex for parentheses
    regexParentheses = re.compile("\(\$?")
    
    # Regex for equal sign
    regexEqual = re.compile("=")

    # Iterate through each document and do:
    #    (1) Split documents into sections based on section headers and remove section headers
    #    (2) Split the sections into sentences using NLTK sentence tokenizer
    #    (3) Further split sentences into phrasal units based on punctuation and remove punctuation
    #    (4) Remove sentence final periods when not part of a abbreviation 

    for i in range(0,len(frame)):
        
        # Extract one document from frame
        docID = frame.index.values[i]
        docText = frame['Text'].iloc[i] 

        # Set counter for output line count for this document
        lineIndex=0

        sentences = SENTENCE_BREAKER.tokenize(docText)
        
        for sentence in sentences:

            # Split each sentence into phrase level chunks based on punctuation
            textSegs = rePhraseBreaks.split(sentence)
            numSegs = len(textSegs)

            for j in range(0,numSegs):
                if len(textSegs[j])>0:
                    # Convert underbars to spaces 
                    # Underbars are reserved for building the compound word phrases                   
                    textSegs[j] = regexUnderbar.sub(" ",textSegs[j])
                    
                    # Split out the words so we can specially handle the last word
                    words = regexSpace.split(textSegs[j])
                    
                    # Remove parentheses and equal signs
                    words = [regexEqual.sub("", regexParentheses.sub("", w)) for w in words]
                    
                    phraseOut = ""
                    last = len(words) -1
                    for i in range(0, last):
                        phraseOut += words[i] + " "
                    # If the last word ends in a period then remove the period
                    lastWord = regexPeriod.sub("", words[last])
                    # If the last word is an abbreviation like "U.S."
                    # then add the word final perios back on
                    if "\." in lastWord:
                        lastWord += "."
                    phraseOut += lastWord    

                    textDataOut.append([docID,lineIndex,phraseOut, phraseOut.lower()])
                    lineIndex += 1
                        
    # Convert to pandas frame 
    frameOut = pd.DataFrame(textDataOut, columns=['DocID','DocLine','CleanedText', 'LowercaseText'])                      
    
    return frameOut

In [35]:
CleanedTrainQ = CleanAndSplitText(trainQ)
CleanedTestQ = CleanAndSplitText(testQ)
CleanedAnswers = CleanAndSplitText(answersC)

In [36]:
CleanedTrainQ.head(2)

Unnamed: 0,DocID,DocLine,CleanedText,LowercaseText
0,220231,0,accessing the web page's http headers in javas...,accessing the web page's http headers in javas...
1,220231,1,how do i access a page's http response headers...,how do i access a page's http response headers...


## Compute N-gram Statistics for Phrase Learning 

In [28]:
# This is Step 1 for each iteration of phrase learning
# We count the number of occurances of all 2-gram, 3-ngram, and 4-gram
# word sequences 
def ComputeNgramStats(textData,functionwordHash,blacklistHash):
    
    # Create an array to store the total count of all ngrams up to 4-grams
    # Array element 0 is unused, element 1 is unigrams, element 2 is bigrams, etc.
    ngramCounts = [0]*5;
       
    # Create a list of structures to tabulate ngram count statistics
    # Array element 0 is the array of total ngram counts,
    # Array element 1 is a hash table of individual unigram counts
    # Array element 2 is a hash table of individual bigram counts
    # Array element 3 is a hash table of individual trigram counts
    # Array element 4 is a hash table of individual 4-gram counts
    ngramStats = [ngramCounts, {}, {}, {}, {}]
          
    # Create a regular expression for assessing validity of words
    # for phrase modeling. The expression says words in phrases
    # must either:
    # (1) contain an alphabetic character, or 
    # (2) be the single charcater '&', or
    # (3) be a one or two digit number
    reWordIsValid = re.compile('[A-Za-z]|^&$|^\d\d?$')
    
    # Go through the text data line by line collecting count statistics
    # for all valid n-grams that could appear in a potential phrase
    numLines = len(textData)
    for i in range(0, numLines):

        # Split the text line into an array of words
        wordArray = textData[i].split()
        numWords = len(wordArray)
        
        # Create an array marking each word as valid or invalid
        validArray = [];
        for word in wordArray:
            validArray.append(reWordIsValid.match(word) != None)        
            
        # Tabulate total raw ngrams for this line into counts for each ngram bin
        # The total ngrams counts include the counts of all ngrams including those
        # that we won't consider as parts of phrases
        for j in range(1,5):
            if j<=numWords:
                ngramCounts[j] += numWords - j + 1 
        
        # Collect counts for viable phrase ngrams and left context sub-phrases
        for j in range(0,numWords):
            word = wordArray[j]

            # Only bother counting the ngrams that start with a valid content word
            # i.e., valids words not in the function word list or the black list
            if ( ( word not in functionwordHash ) and ( word not in blacklistHash ) and validArray[j] ):

                # Initialize ngram string with first content word and add it to unigram counts
                ngramSeq = word 
                if ngramSeq in ngramStats[1]:
                    ngramStats[1][ngramSeq] += 1
                else:
                    ngramStats[1][ngramSeq] = 1

                # Count valid ngrams from bigrams up to 4-grams
                stop = 0
                k = 1
                while (k<4) and (j+k<numWords) and not stop:
                    n = k + 1
                    nextNgramWord = wordArray[j+k]
                    # Only count ngrams with valid words not in the blacklist
                    if ( validArray[j+k] and nextNgramWord not in blacklistHash ):
                        ngramSeq += " " + nextNgramWord
                        if ngramSeq in ngramStats[n]:
                            ngramStats[n][ngramSeq] += 1
                        else:
                            ngramStats[n][ngramSeq] = 1 
                        k += 1
                        if nextNgramWord not in functionwordHash:
                            # Stop counting new ngrams after second content word in 
                            # ngram is reached and ngram is a viable full phrase
                            stop = 1
                    else:
                        stop = 1
    return ngramStats

## Rank Potential Phrases by the Weighted Pointwise Mutual Information of their Constituent Words

In [29]:
def RankNgrams(ngramStats,functionwordHash,minCount):
    # Create a hash table to store weighted pointwise mutual 
    # information scores for each viable phrase
    ngramWPMIHash = {}
        
    # Go through each of the ngram tables and compute the phrase scores
    # for the viable phrases
    for n in range(2,5):
        i = n-1
        for ngram in ngramStats[n].keys():
            ngramCount = ngramStats[n][ngram]
            if ngramCount >= minCount:
                wordArray = ngram.split()
                # If the final word in the ngram is not a function word then
                # the ngram is a valid phrase candidate we want to score
                if wordArray[i] not in functionwordHash: 
                    leftNgram = wordArray[0]
                    for j in range(1,i):
                        leftNgram += ' ' + wordArray[j]
                    rightWord = wordArray[i]
                    
                    # Compute the weighted pointwise mutual information (WPMI) for the phrase
                    probNgram = float(ngramStats[n][ngram])/float(ngramStats[0][n])
                    probLeftNgram = float(ngramStats[n-1][leftNgram])/float(ngramStats[0][n-1])
                    probRightWord = float(ngramStats[1][rightWord])/float(ngramStats[0][1])
                    WPMI = probNgram * math.log(probNgram/(probLeftNgram*probRightWord));

                    # Add the phrase into the list of scored phrases only if WMPI is positive
                    if WPMI > 0:
                        ngramWPMIHash[ngram] = WPMI  
    
    # Create a sorted list of the phrase candidates
    rankedNgrams = sorted(ngramWPMIHash, key=ngramWPMIHash.__getitem__, reverse=True)

    # Force a memory clean-up
    ngramWPMIHash = None
    gc.collect()

    return rankedNgrams

## Apply Phrase Rewrites to Train Data

In [30]:
def ApplyPhraseRewrites(rankedNgrams,textData,learnedPhrases,                 
                        maxPhrasesToAdd,maxPhraseLength,verbose):

    # This function will consider at most maxRewrite 
    # new phrases to be added into the learned phrase 
    # list as specified by the calling fuinction
    maxRewrite=maxPhrasesToAdd

    # If the remaining number of proposed ngram phrases is less 
    # than the max allowed, then reset maxRewrite to the size of 
    # the proposed ngram phrases list
    numNgrams = len(rankedNgrams)
    if numNgrams < maxRewrite:
        maxRewrite = numNgrams
    
    # Create empty hash tables to keep track of phrase overlap conflicts
    leftConflictHash = {}
    rightConflictHash = {}
    
    # Create an empty hash table collecting the set of rewrite rules
    # to be applied during this iteration of phrase learning
    ngramRewriteHash = {}
    
    # Precompile the regex for finding spaces in ngram phrases
    regexSpace = re.compile(' ')

    # Initialize some bookkeeping variables
    numLines = len(textData)
    numPhrasesAdded = 0
    numConsidered = 0
    lastSkippedNgram = ""
    lastAddedNgram = ""
  
    # Collect list up to maxRewrite ngram phrase rewrites
    stop = False
    index = 0
    while not stop:

        # Get the next phrase to consider adding to the phrase list
        inputNgram = rankedNgrams[index]

        # Create the output compound word version of the phrase
        # The extra space is added to make the regex rewrite easier
        outputNgram = " " + regexSpace.sub("_",inputNgram)

        # Count the total number of words in the proposed phrase
        numWords = len(outputNgram.split("_"))

        # Only add phrases that don't exceed the max phrase length
        if (numWords <= maxPhraseLength):
    
            # Keep count of phrases considered for inclusion during this iteration
            numConsidered += 1

            # Extract the left and right words in the phrase to use
            # in checks for phrase overlap conflicts
            ngramArray = inputNgram.split()
            leftWord = ngramArray[0]
            rightWord = ngramArray[len(ngramArray)-1]

            # Skip any ngram phrases that conflict with earlier phrases added
            # These ngram phrases will be reconsidered in the next iteration
            if (leftWord in leftConflictHash) or (rightWord in rightConflictHash): 
                if verbose: 
                    print ("(%d) Skipping (context conflict): %s" % (numConsidered,inputNgram))
                lastSkippedNgram = inputNgram
                
            # If no conflict exists then add this phrase into the list of phrase rewrites     
            else: 
                if verbose:
                    print ("(%d) Adding: %s" % (numConsidered,inputNgram))
                ngramRewriteHash[" " + inputNgram] = outputNgram
                learnedPhrases.append(inputNgram) 
                lastAddedNgram = inputNgram
                numPhrasesAdded += 1
            
            # Keep track of all context words that might conflict with upcoming
            # propose phrases (even when phrases are skipped instead of added)
            leftConflictHash[rightWord] = 1
            rightConflictHash[leftWord] = 1

            # Stop when we've considered the maximum number of phrases per iteration
            if ( numConsidered >= maxRewrite ):
                stop = True
            
        # Increment to next phrase
        index += 1
    
        # Stop if we've reached the end of the ranked ngram list
        if index >= len(rankedNgrams):
            stop = True

    # Now do the phrase rewrites over the entire set of text data
    if numPhrasesAdded == 1:
        # If only one phrase to add use a single regex rule to do this phrase rewrite        
        inputNgram = " " + lastAddedNgram
        outputNgram = ngramRewriteHash[inputNgram]
        regexNgram = re.compile (r'%s(?= )' % re.escape(inputNgram)) 
        # Apply the regex over the full data set
        for j in range(0,numLines):
            textData[j] = regexNgram.sub(outputNgram, textData[j])
    elif numPhrasesAdded > 1:
        # Compile a single regex rule from the collected set of phrase rewrites for this iteration
        ngramRegex = re.compile(r'%s(?= )' % "|".join(map(re.escape, ngramRewriteHash.keys())))
        # Apply the regex over the full data set
        for i in range(0,len(textData)):
            # The regex substituion looks up the output string rewrite  
            # in the hash table for each matched input phrase regex
            textData[i] = ngramRegex.sub(lambda mo: ngramRewriteHash[mo.string[mo.start():mo.end()]], textData[i]) 
      
    return

## Run the full iterative phrase learning process

In [31]:
def ApplyPhraseLearning(textData,learnedPhrases,learningSettings):
    
    stop = 0
    iterNum = 0

    # Get the learning parameters from the structue passed in by thee calling function
    maxNumPhrases = learningSettings.maxNumPhrases
    maxPhraseLength = learningSettings.maxPhraseLength
    functionwordHash = learningSettings.functionwordHash
    blacklistHash = learningSettings.blacklistHash
    verbose = learningSettings.verbose
    minCount = learningSettings.minInstanceCount
    
    # Start timing the process
    functionStartTime = time.clock()
    
    numPhrasesLearned = len(learnedPhrases)
    print ("Start phrase learning with %d phrases of %d phrases learned" % (numPhrasesLearned,maxNumPhrases))

    while not stop:
        iterNum += 1
                
        # Start timing this iteration
        startTime = time.clock()
 
        # Collect ngram stats
        ngramStats = ComputeNgramStats(textData,functionwordHash,blacklistHash)

        # Rank ngrams
        rankedNgrams = RankNgrams(ngramStats,functionwordHash,minCount)
        
        # Incorporate top ranked phrases into phrase list
        # and rewrite the text to use these phrases
        maxPhrasesToAdd = maxNumPhrases - numPhrasesLearned
        if maxPhrasesToAdd > learningSettings.maxPhrasesPerIter:
            maxPhrasesToAdd = learningSettings.maxPhrasesPerIter
        ApplyPhraseRewrites(rankedNgrams,textData,learnedPhrases,maxPhrasesToAdd,maxPhraseLength,verbose)
        numPhrasesAdded = len(learnedPhrases) - numPhrasesLearned

        # Garbage collect
        ngramStats = None
        rankedNgrams = None
        gc.collect();
               
        elapsedTime = time.clock() - startTime

        numPhrasesLearned = len(learnedPhrases)
        print ("Iteration %d: Added %d new phrases in %.2f seconds (Learned %d of max %d)" % 
               (iterNum,numPhrasesAdded,elapsedTime,numPhrasesLearned,maxNumPhrases))
        
        if numPhrasesAdded >= maxPhrasesToAdd or numPhrasesAdded == 0:
            stop = 1
        
    # Remove the space padding at the start and end of each line
    regexSpacePadding = re.compile('^ +| +$')
    for i in range(0,len(textData)):
        textData[i] = regexSpacePadding.sub("",textData[i])
    
    gc.collect()
 
    elapsedTime = time.clock() - functionStartTime
    elapsedTimeHours = elapsedTime/3600.0;
    print ("*** Phrase learning completed in %.2f hours ***" % elapsedTimeHours) 

    return

In [32]:
# Create a structure defining the settings and word lists used during the phrase learning
learningSettings = namedtuple('learningSettings',['maxNumPhrases','maxPhrasesPerIter',
                                                  'maxPhraseLength','minInstanceCount'
                                                  'functionwordHash','blacklistHash','verbose'])

# If true it prints out the learned phrases to stdout buffer
# while its learning. This will generate a lot of text to stdout, 
# so best to turn this off except for testing and debugging
learningSettings.verbose = False

# Maximium number of phrases to learn
# If you want to test the code out quickly then set this to a small
# value (e.g. 100) and set verbose to true when running the quick test
learningSettings.maxNumPhrases = 200

# Maximum number of phrases to learn per iteration 
# Increasing this number may speed up processing but will affect the ordering of the phrases 
# learned and good phrases could be by-passed if the maxNumPhrases is set to a small number
learningSettings.maxPhrasesPerIter = 50

# Maximum number of words allowed in the learned phrases 
learningSettings.maxPhraseLength = 7

# Minimum number of times a phrase must occur in the data to 
# be considered during the phrase learning process
learningSettings.minInstanceCount = 5

# This is a precreated hash table containing the list 
# of function words used during phrase learning
learningSettings.functionwordHash = functionwordHash

# This is a precreated hash table containing the list 
# of black list words to be ignored during phrase learning
learningSettings.blacklistHash = {}

In [44]:
###### Questions:
# Initialize an empty list of learned phrases
# If you have completed a partial run of phrase learning
# and want to add more phrases, you can use the pre-learned 
# phrases as a starting point instead and the new phrases
# will be appended to the list
learnedPhrasesQ = []

# Create a copy of the original text data that will be used during learning
# The copy is needed because the algorithm does in-place replacement of learned
# phrases directly on the text data structure it is provided
phraseTextDataQ = []
for textLine in CleanedTrainQ['LowercaseText']:
    phraseTextDataQ.append(' ' + textLine + ' ')

# Run the phrase learning algorithm
ApplyPhraseLearning(phraseTextDataQ,learnedPhrasesQ,learningSettings)

# Add text with learned phrases back into data frame
CleanedTrainQ['TextWithPhrases'] = phraseTextDataQ

Start phrase learning with 0 phrases of 200 phrases learned
Iteration 1: Added 41 new phrases in 2.02 seconds (Learned 41 of max 200)
Iteration 2: Added 39 new phrases in 1.93 seconds (Learned 80 of max 200)
Iteration 3: Added 43 new phrases in 2.00 seconds (Learned 123 of max 200)
Iteration 4: Added 44 new phrases in 1.98 seconds (Learned 167 of max 200)
Iteration 5: Added 27 new phrases in 1.89 seconds (Learned 194 of max 200)
Iteration 6: Added 5 new phrases in 1.97 seconds (Learned 199 of max 200)
Iteration 7: Added 1 new phrases in 1.70 seconds (Learned 200 of max 200)
*** Phrase learning completed in 0.00 hours ***


In [47]:
###### Answers:
# Initialize an empty list of learned phrases
# If you have completed a partial run of phrase learning
# and want to add more phrases, you can use the pre-learned 
# phrases as a starting point instead and the new phrases
# will be appended to the list
learnedPhrasesA = []

# Create a copy of the original text data that will be used during learning
# The copy is needed because the algorithm does in-place replacement of learned
# phrases directly on the text data structure it is provided
phraseTextDataA = []
for textLine in CleanedAnswers['LowercaseText']:
    phraseTextDataA.append(' ' + textLine + ' ')

# Run the phrase learning algorithm
ApplyPhraseLearning(phraseTextDataA,learnedPhrasesA,learningSettings)

# Add text with learned phrases back into data frame
CleanedAnswers['TextWithPhrases'] = phraseTextDataA

Start phrase learning with 0 phrases of 200 phrases learned
Iteration 1: Added 44 new phrases in 0.45 seconds (Learned 44 of max 200)
Iteration 2: Added 42 new phrases in 0.48 seconds (Learned 86 of max 200)
Iteration 3: Added 39 new phrases in 0.50 seconds (Learned 125 of max 200)
Iteration 4: Added 34 new phrases in 0.40 seconds (Learned 159 of max 200)
Iteration 5: Added 34 new phrases in 0.39 seconds (Learned 193 of max 200)
Iteration 6: Added 6 new phrases in 0.39 seconds (Learned 199 of max 200)
Iteration 7: Added 1 new phrases in 0.39 seconds (Learned 200 of max 200)
*** Phrase learning completed in 0.00 hours ***


## Apply the Learned Phrases to Test Data

In [48]:
def ApplyPhraseRewritesInPlace(textFrame, textColumnName, phraseRules):
        
    # Get text data column from frame
    textData = textFrame[textColumnName]
    numLines = len(textData)
    
    # initial a list to store output text
    textOutput = [None] * numLines
    
    # Add leading and trailing spaces to make regex matching easier
    for i in range(0,numLines):
        textOutput[i] = " " + textData[i] + " "  

    # Make sure we have phrase to add
    numPhraseRules = len(phraseRules)
    if numPhraseRules == 0: 
        print ("Warning: phrase rule lise is empty - no phrases being applied to text data")
        return

    # Precompile the regex for finding spaces in ngram phrases
    regexSpace = re.compile(' ')
   
    # Initialize some bookkeeping variables

    # Iterate through full set of phrases to find sets of 
    # non-conflicting phrases that can be apply simultaneously
    index = 0
    outerStop = False
    while not outerStop:
       
        # Create empty hash tables to keep track of phrase overlap conflicts
        leftConflictHash = {}
        rightConflictHash = {}
        prevConflictHash = {}
    
        # Create an empty hash table collecting the next set of rewrite rules
        # to be applied during this iteration of phrase rewriting
        phraseRewriteHash = {}
    
        # Progress through phrases until the next conflicting phrase is found
        innerStop = 0
        numPhrasesAdded = 0
        while not innerStop:
        
            # Get the next phrase to consider adding to the phrase list
            nextPhrase = phraseRules[index]            
            
            # Extract the left and right sides of the phrase to use
            # in checks for phrase overlap conflicts
            ngramArray = nextPhrase.split()
            leftWord = ngramArray[0]
            rightWord = ngramArray[len(ngramArray)-1] 

            # Stop if we reach any phrases that conflicts with earlier phrases in this iteration
            # These ngram phrases will be reconsidered in the next iteration
            if ((leftWord in leftConflictHash) or (rightWord in rightConflictHash) 
                or (leftWord in prevConflictHash) or (rightWord in prevConflictHash)): 
                innerStop = True
                
            # If no conflict exists then add this phrase into the list of phrase rewrites     
            else: 
                # Create the output compound word version of the phrase
                                
                outputPhrase = regexSpace.sub("_",nextPhrase);
                
                # Keep track of all context words that might conflict with upcoming
                # propose phrases (even when phrases are skipped instead of added)
                leftConflictHash[rightWord] = 1
                rightConflictHash[leftWord] = 1
                prevConflictHash[outputPhrase] = 1           
                
                # Add extra space to input an output versions of the current phrase 
                # to make the regex rewrite easier
                outputPhrase = " " + outputPhrase
                lastAddedPhrase = " " + nextPhrase
                
                # Add the phrase to the rewrite hash
                phraseRewriteHash[lastAddedPhrase] = outputPhrase
                  
                # Increment to next phrase
                index += 1
                numPhrasesAdded  += 1
    
                # Stop if we've reached the end of the phrases list
                if index >= numPhraseRules:
                    innerStop = True
                    outerStop = True
                    
        # Now do the phrase rewrites over the entire set of text data
        if numPhrasesAdded == 1:
        
            # If only one phrase to add use a single regex rule to do this phrase rewrite        
            outputPhrase = phraseRewriteHash[lastAddedPhrase]
            regexPhrase = re.compile (r'%s(?= )' % re.escape(lastAddedPhrase)) 
        
            # Apply the regex over the full data set
            for j in range(0,numLines):
                textOutput[j] = regexPhrase.sub(outputPhrase, textOutput[j])
       
        elif numPhrasesAdded > 1:
            # Compile a single regex rule from the collected set of phrase rewrites for this iteration
            regexPhrase = re.compile(r'%s(?= )' % "|".join(map(re.escape, phraseRewriteHash.keys())))
            
            # Apply the regex over the full data set
            for i in range(0,numLines):
                # The regex substituion looks up the output string rewrite  
                # in the hash table for each matched input phrase regex
                textOutput[i] = regexPhrase.sub(lambda mo: phraseRewriteHash[mo.string[mo.start():mo.end()]], textOutput[i]) 
    
    # Remove the space padding at the start and end of each line
    regexSpacePadding = re.compile('^ +| +$')
    for i in range(0,len(textOutput)):
        textOutput[i] = regexSpacePadding.sub("",textOutput[i])
    
    return textOutput

In [49]:
CleanedTestQ['TextWithPhrases'] = ApplyPhraseRewritesInPlace(CleanedTestQ, 'LowercaseText', learnedPhrasesQ)

In [50]:
CleanedTestQ.loc[9]

DocID                                  15762825
DocLine                                       9
CleanedText        javascript html code c# code
LowercaseText      javascript html code c# code
TextWithPhrases    javascript html_code c# code
Name: 9, dtype: object

## Reconstruct the Full Processed Text of Each Document and Put it into a New Frame¶

In [39]:
def ReconstituteDocsFromChunks(textData, idColumnName, textColumnName):
    dataOut = []
    
    currentDoc = "";
    currentDocID = "";
    
    for i in range(0,len(textData)):
        textChunk = textData[textColumnName][i]
        docID = textData[idColumnName][i]
        if docID != currentDocID:
            if currentDocID != "":
                dataOut.append(currentDoc)
            currentDoc = textChunk
            currentDocID = docID
        else:
            currentDoc += " " + textChunk
    dataOut.append(currentDoc)
    
    return dataOut

In [51]:
trainQ['TextWithPhrases'] = ReconstituteDocsFromChunks(CleanedTrainQ, 'DocID', 'TextWithPhrases')
testQ['TextWithPhrases'] = ReconstituteDocsFromChunks(CleanedTestQ, 'DocID', 'TextWithPhrases')
answersC['TextWithPhrases'] = ReconstituteDocsFromChunks(CleanedAnswers, 'DocID', 'TextWithPhrases')

## Create the Vocabulary With Filtering Criteria

In [52]:
def CreateVocabForTopicModeling(textData,stopwordHash):

    print ("Counting words")
    numDocs = len(textData) 
    globalWordCountHash = {} 
    globalDocCountHash = {} 
    for textLine in textData:
        docWordCountHash = {}
        for word in textLine.split():
            if word in globalWordCountHash:
                globalWordCountHash[word] += 1
            else:
                globalWordCountHash[word] = 1
            if word not in docWordCountHash: 
                docWordCountHash[word] = 1
                if word in globalDocCountHash:
                    globalDocCountHash[word] += 1
                else:
                    globalDocCountHash[word] = 1

    minWordCount = 5;
    minDocCount = 2;
    maxDocFreq = .25;
    vocabCount = 0;
    vocabHash = {}

    excStopword = 0
    excNonalphabetic = 0
    excMinwordcount = 0
    excNotindochash = 0
    excMindoccount = 0
    excMaxdocfreq =0

    print ("Building vocab")
    for word in globalWordCountHash.keys():
        # Test vocabulary exclusion criteria for each word
        if ( word in stopwordHash ):
            excStopword += 1
        elif ( not re.search(r'[a-zA-Z]', word, 0) ):
            excNonalphabetic += 1
        elif ( globalWordCountHash[word] < minWordCount ):
            excMinwordcount += 1
        elif ( word not in globalDocCountHash ):
            print ("Warning: Word '%s' not in doc count hash") % (word)
            excNotindochash += 1
        elif ( globalDocCountHash[word] < minDocCount ):
            excMindoccount += 1
        elif ( float(globalDocCountHash[word])/float(numDocs) > maxDocFreq ):
            excMaxdocfreq += 1
        else:
            # Add word to vocab
            vocabHash[word]= globalWordCountHash[word];
            vocabCount += 1 
    print ("Excluded %d stop words" % (excStopword))       
    print ("Excluded %d non-alphabetic words" % (excNonalphabetic))  
    print ("Excluded %d words below word count threshold" % (excMinwordcount)) 
    print ("Excluded %d words below doc count threshold" % (excMindoccount))
    print ("Excluded %d words above max doc frequency" % (excMaxdocfreq)) 
    print ("Final Vocab Size: %d words" % vocabCount)
            
    return vocabHash

In [53]:
vocabHashQ = CreateVocabForTopicModeling(trainQ['TextWithPhrases'],functionwordHash)
vocabHashA = CreateVocabForTopicModeling(answersC['TextWithPhrases'],functionwordHash)

Counting words
Building vocab
Excluded 310 stop words
Excluded 1664 non-alphabetic words
Excluded 23968 words below word count threshold
Excluded 215 words below doc count threshold
Excluded 3 words above max doc frequency
Final Vocab Size: 4977 words
Counting words
Building vocab
Excluded 302 stop words
Excluded 524 non-alphabetic words
Excluded 7701 words below word count threshold
Excluded 56 words below doc count threshold
Excluded 2 words above max doc frequency
Final Vocab Size: 2258 words


## Tokenize Text with Learned Phrases

In [54]:
# start by tokenizing the full text string string for each document into list of tokens.
# any token that is in not in the pre-defined set of acceptable vocabulary words is execluded.
def TokenizeText(textData,vocabHash):
    tokenizedText = ''
    for token in textData.split():
        if token in vocabHash:
            tokenizedText += (token.strip() + ',')
    return tokenizedText.strip(',')

In [56]:
trainQ['Tokens'] = trainQ['TextWithPhrases'].apply(lambda x: TokenizeText(x, vocabHashQ))
testQ['Tokens'] = testQ['TextWithPhrases'].apply(lambda x: TokenizeText(x, vocabHashQ))
answersC['Tokens'] = answersC['TextWithPhrases'].apply(lambda x: TokenizeText(x, vocabHashA))

In [57]:
# an example of tokenized text in training set.
trainQ['Tokens'].iloc[0]

"accessing,http,headers,access,page's,http,response,headers,related,question,modified,ask,accessing,two,specific,http,headers,related,access,http_request,header,fields"

## Save cleaned data as .tsv and upload to Azure Blob

In [58]:
def save_upload_data(data, file_path, container_name, blob_name):
    data.to_csv(file_path, sep='\t', header=True, index=True, index_label='Id')
    blob_service.put_block_blob_from_path(container_name=container_name, blob_name=blob_name, file_path=file_path)

In [59]:
# modify the path in below script to upload the datasets to your own Blob Storage.
if False: 
    save_upload_data(trainQ, 'C:\\Users\\mez\\Desktop\\trainQwithTokens.tsv', 'stackoverflow', 'trainQwithTokens.tsv')
    save_upload_data(testQ, 'C:\\Users\\mez\\Desktop\\testQwithTokens.tsv', 'stackoverflow', 'testQwithTokens.tsv')
    save_upload_data(answersC, 'C:\\Users\\mez\\Desktop\\answersCwithTokens.tsv', 'stackoverflow', 'answersCwithTokens.tsv')