## Building a Custom Search Engine
### Step 1 - Collect, Pre-Process and Augment Content
- Collect relevant documents and decide content structure
- Parse and extract content fields
- Optional augmentation: Extract keyphrases
- Save processed content for indexing in step 2

**Note: **The following example uses a few sample pages from the [University of Cornell US Tax Code](https://www.law.cornell.edu/uscode/text/26). The main content section of the pages is shared in the "sample" folder. The parsing code below applies to the samples as well as the original pages from the University of Cornell online tax code. The sample pages are provided for demonstration purposes only.

The University of Cornell US Tax Code is structured in a logical navigation sequence leading to the final content pages. The navigational page titles provide valuable information about the content page. Upon examination of the HTML structure of the content pages, we notice that the navigation path titles are included in the leaf content page and can be parsed directly from the page. Alternatively, the navigation path title can be captured while crawling the navigation links.

First design decision is to choose the *unit of retrieval*, i.e., the unit of content that will be indexed and retrieved separately in the desired search experience. 

Consider the sample page in the picture below. Two options are possible:
- The full page is to be considered as one unit (useful if the objective is to locate a page)
- Content under each highlighted header is a separate unit (useful if the objective is to locate specific answers)

![](sample_page.png?raw=true)

Let's parse a content page and extract the following fields:
- Chapter title
- Section title
- Subsection title
- Content text
- Content key phrases

In [4]:
# Import base packages
from bs4 import BeautifulSoup
import os, glob, sys, re
import pandas as pd

In [5]:
# Let's explore one content page
html = '../sample/html/1.1.1.1.1.1.html'
page = open(html, 'r')

# Extract page contents
soup = BeautifulSoup(page.read(), 'html.parser')

The navigation path titles (appearing at the top of the page) are included in the HTML page under {'class': 'breadcrumb'}. Example:

<ol class="breadcrumb" itemprop="breadcrumb"><li><a href="/uscode/text" title="United States Code">U.S. Code</a> › <a href="/uscode/text/26" rel="usc_sup_01_26" title="Title 26 - INTERNAL REVENUE CODE">Title 26</a> › <a href="/uscode/text/26/subtitle-A" rel="usc_sup_01_26_10_A" title="Subtitle A - Income Taxes">Subtitle A</a> › <a href="/uscode/text/26/subtitle-A/chapter-1" rel="usc_sup_01_26_10_A_20_1" title="Chapter 1 - NORMAL TAXES AND SURTAXES">Chapter 1</a> › <a href="/uscode/text/26/subtitle-A/chapter-1/subchapter-A" rel="usc_sup_01_26_10_A_20_1_30_A" title="Subchapter A - Determination of Tax Liability">Subchapter A</a> › <a href="/uscode/text/26/subtitle-A/chapter-1/subchapter-A/part-I" rel="usc_sup_01_26_10_A_20_1_30_A_40_I" title="Part I - TAX ON INDIVIDUALS">Part I</a> › § 1</li></ol>

In [6]:
# The navigation path titles are included in {'class': 'breadcrumb'}
titles = soup.find('ol', {'class': 'breadcrumb'}).findAll('a')
print('HTML source: %s' % titles)

HTML source: [<a href="/uscode/text" title="United States Code">U.S. Code</a>, <a href="/uscode/text/26" rel="usc_sup_01_26" title="Title 26 - INTERNAL REVENUE CODE">Title 26</a>, <a href="/uscode/text/26/subtitle-A" rel="usc_sup_01_26_10_A" title="Subtitle A - Income Taxes">Subtitle A</a>, <a href="/uscode/text/26/subtitle-A/chapter-1" rel="usc_sup_01_26_10_A_20_1" title="Chapter 1 - NORMAL TAXES AND SURTAXES">Chapter 1</a>, <a href="/uscode/text/26/subtitle-A/chapter-1/subchapter-A" rel="usc_sup_01_26_10_A_20_1_30_A" title="Subchapter A - Determination of Tax Liability">Subchapter A</a>, <a href="/uscode/text/26/subtitle-A/chapter-1/subchapter-A/part-I" rel="usc_sup_01_26_10_A_20_1_30_A_40_I" title="Part I - TAX ON INDIVIDUALS">Part I</a>]


In [7]:
# Extract title texts
for title in titles:
    print(title.get('title'))

United States Code
Title 26 - INTERNAL REVENUE CODE
Subtitle A - Income Taxes
Chapter 1 - NORMAL TAXES AND SURTAXES
Subchapter A - Determination of Tax Liability
Part I - TAX ON INDIVIDUALS


Let's ignore the first two lines in titles as they are repeated on every content page, and ignore the leading part in each title line. Then let's use the following definitions for chapter, section and subsection titles:
- Chapter title: Lines 2 and 3, e.g., Income Taxes - NORMAL TAXES AND SURTAXES
- Section title: All remaining lines in titles, e.g., Determination of Tax Liability - TAX ON INDIVIDUALS
- Subsection title: Use current page title as the base subsection title

In [8]:
# Extract chapter, section and subsection titles
chapter_title    = ' - '.join([x.get('title').split('-')[1].strip() for x in titles[2:4]])
section_title    = ' - '.join([x.get('title').split('-')[1].strip() for x in titles[4:]])
subsection_title = soup.find(id='page-title').text.split('-')[1].strip()

print('Chapter title   : %s' % chapter_title)
print('Section title   : %s' % section_title)
print('Subsection title: %s' % subsection_title)

Chapter title   : Income Taxes - NORMAL TAXES AND SURTAXES
Section title   : Determination of Tax Liability - TAX ON INDIVIDUALS
Subsection title: Tax imposed


#### Some utility functions for text processing

In [9]:
# Strip non-ascii characters that break the overlap check
def strip_non_ascii(s):
    s = (c for c in s if 0 < ord(c) < 255)
    s = ''.join(s)
    return s

# Clean text: remove newlines, compact spaces, strip non_ascii, etc.
def clean_text(text, lowercase=False, nopunct=False):
    # Convert to lowercase
    if lowercase:
        text = text.lower()

    # Remove punctuation
    if nopunct:
        puncts = string.punctuation
        for c in puncts:
            text = text.replace(c, ' ')

    # Strip non-ascii characters
    text = strip_non_ascii(text)
    
    # Remove newlines - Compact and strip whitespaces
    text = re.sub('[\r\n]+', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text.strip()

### Option #1 - Extract all page content

In [10]:
# Get content from all subsections in page at once
def get_content_all(soup):
    section = soup.find("div", { "class" : "section" })
    section_text = ''
    
    # If page is empty, return
    if section == None:
        return section_text
    
    divs = section.findAll('div')
    for div in divs:
        # Do not include'sourceCredit' or 'section inline'
        if div.get('class')[0] != 'sourceCredit':
            # Fix a formatting issue causing some text to be collated when extracted
            for sp in div.findAll("span", { "class" : "chapeau" }):
                sp.replaceWith('<sp>' + sp.text)
            section_text += div.text.replace('<sp>', ' ')
    
    # Clean text, do not convert to lowercase or remove punctuation (default)
    section_text = clean_text(section_text, lowercase=False, nopunct=False)
    
    return section_text

In [11]:
content = get_content_all(soup)
print('Content text:\n%s' % content)

Content text:
§1. Tax imposed (a) Married individuals filing joint returns and surviving spouses There is hereby imposed on the taxable income of (1) every married individual (as defined in section 7703) who makes a single return jointly with his spouse under section 6013, and (2) every surviving spouse (as defined in section 2(a)), a tax determined in accordance with the following table: If taxable income is: The tax is: Not over $36,900 15% of taxable income. Over $36,900 but not over $89,150 $5,535, plus 28% of the excess over $36,900. Over $89,150 but not over $140,000 $20,165, plus 31% of the excess over $89,150. Over $140,000 but not over $250,000 $35,928.50, plus 36% of the excess over $140,000. Over $250,000 $75,528.50, plus 39.6% of the excess over $250,000. (1) every married individual (as defined in section 7703) who makes a single return jointly with his spouse under section 6013, and every married individual (as defined in section 7703) who makes a single return jointly wi

### Keyphrase extraction
Let's extract keyphrases from the page content and examine the results. Many keyphrase extraction algorithms and implementations are available (check out the [Text processing portal](http://textprocessing.org/tag/keyphrase-extraction) for some examples.

The following example uses the RAKE algorithm implementation from [https://github.com/aneesha/RAKE](https://github.com/aneesha/RAKE). The base rake.py script is included for convenience (turning off *test* mode).

In [12]:
from rake import *

# Extract keyphrases using RAKE algorithm. Limit results by minimum score.
def get_keyphrases_rake(text, stoplist_path=None, min_score=0):
    if stoplist_path == None:
        stoplist_path = 'SmartStoplist.txt'

    rake = Rake(stoplist_path)
    keywords = rake.run(text)
    phrases = []
    for keyword in keywords:
        score = keyword[1]
        if score >= min_score:
            phrases.append(keyword)

    return phrases

We'll use the stopwords list file *SmartStoplist.txt* as the base list, and add some of the common words found in the tax code text as stopwords, such as *paragraph, subparagraph, clause, section, subsection*.  The custom list is included in the file *SmartStoplist_extended.txt*.

In [13]:
stoplist_file = 'SmartStoplist_extended.txt'
keyphrases = get_keyphrases_rake(content, stoplist_path=stoplist_file, min_score=3)

print('Number of keyphrases = %d' % len(keyphrases))
for keyphrase in keyphrases:
    print('%s -> %f' % keyphrase)

Number of keyphrases = 206
term net capital gain means net capital gain -> 29.659770
term qualified dividend income means dividends received -> 23.802829
term adjusted net capital gain means -> 22.674502
including alaska permanent fund dividends -> 20.534582
term consumer price index means -> 18.835014
term net unearned income means -> 18.605338
term allocable parental tax means -> 17.590043
married individuals filing joint returns -> 17.479487
term 28-percent rate gain means -> 17.295988
married individuals filing separate returns -> 17.254487
term qualified foreign corporation means -> 17.172226
foreign tax credit limitation rules similar -> 16.281332
childs net unearned income bears -> 15.524800
investment income qualified dividend income -> 14.666851
including regulations requiring reporting -> 14.170455
net short-term capital loss -> 13.928969
adjusted net capital gain -> 13.898311
married individuals filing separately -> 13.879487
passive foreign investment company -> 13.868254
a

Now let's try another method using the [PKE keyphrase extraction module](https://github.com/boudinfl/pke). This module includes various algorithm implementations (TFIDF, topic rank, single rank, and KP Miner). A modified version of PKE that supports custom stopword lists and allows a *no stemming* option in pre-processing is available in [this branch](https://github.com/msolhab/pke). Please install PKE from the [branch](https://github.com/msolhab/pke) prior to running the example below, or simply skip it.

In [14]:
import pke

def get_keyphrases_pke(text, stoplist_path=None, postags=None):
    if stoplist_path == None:
        stoplist_path = 'SmartStoplist.txt'
    stoplist = [open(stoplist_path, 'r').read()]

    if postags == None:
        postags = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'VBN', 'VBD']

    # PKE expects an input file. Save text to temporary file to proceed.
    infile = 'tmp_%d.txt' % (os.getpid())
    f = open(infile, 'w')
    print >>f, text.encode('utf8')
    f.close()

    # Run keyphrase extractor (using TOPICRANK algorithm)
    try:
        extractor = pke.TopicRank(input_file=infile, language='english')
        extractor.read_document(format='raw', stemmer=None)
        extractor.candidate_selection(stoplist=stoplist, pos=postags)
        extractor.candidate_weighting(threshold=0.25, method='average')
        phrases = extractor.get_n_best(300, redundancy_removal=True)
    except:
        phrases = []

    # (Optional) Keep unique keywords only
    #phrases = ' '.join(p for p in set(phrases.split()))
    os.remove(infile)
    return phrases

In [15]:
stoplist_file = 'SmartStoplist_extended.txt'
custom_tags = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'VBN', 'VBD', 'VBG']
keyphrases = get_keyphrases_pke(content, stoplist_path=stoplist_file, postags=custom_tags)

print('Number of keyphrases = %d' % len(keyphrases))
for keyphrase in keyphrases:
    print('%s -> %f' % keyphrase)

Number of keyphrases = 300
section -> 0.045406
paragraph -> 0.022824
subsection -> 0.022061
amount -> 0.019384
taxable year -> 0.017856
purposes -> 0.016114
tax -> 0.014472
excess -> 0.014312
taxable income -> 0.013728
such child -> 0.012697
determined -> 0.012665
case -> 0.012323
respect -> 0.011381
subparagraph -> 0.011070
defined -> 0.010140
percent -> 0.009967
gain -> 0.009923
iii -> 0.008852
regard -> 0.008689
treated -> 0.008612
parent -> 0.008585
sum -> 0.007696
clause -> 0.007369
amount described -> 0.007276
gross income -> 0.006940
general -> 0.006703
account -> 0.006297
substituting -> 0.005523
net capital gain -> 0.005243
such taxable year -> 0.004867
net unearned income -> 0.004822
adjusted net capital gain -> 0.004813
rate -> 0.004746
taxable years beginning -> 0.004685
exchange -> 0.004427
qualified dividend income -> 0.004361
taken -> 0.004276
taxpayer -> 0.004275
tax imposed -> 0.004252
less -> 0.004176
calendar year -> 0.004110
stock -> 0.004110
sale -> 0.003948
applic

### Option #2 - Extract content in each subsection separately

In [16]:
# Get header and content from each subsection separately
def get_content_subsections(soup):
    subs = []
    sections = soup.findAll('div', {'class': 'subsection indent2 firstIndent-2'})
    
    if len(sections) > 0:
        for div in sections:
            sub_header = div.find('span', {'class': 'heading bold'})
            # Check if subsection has a valid title
            if sub_header != None:
                sub_title  = clean_text(sub_header.text)
            else:
                sub_title  = ''
            # Fix a formatting issue causing some text to be collated when extracted
            for sp in div.findAll("span", { "class" : "chapeau" }):
                sp.replaceWith('<sp>' + sp.text)
            sub_text = div.text.replace('<sp>', ' ')
            # Clean text, do not convert to lowercase or remove punctuation (default)
            sub_text = clean_text(sub_text, lowercase=False, nopunct=False)
            subs.append((sub_title, sub_text))
    else:
        # If page does not contain subsections, parse it as one
        sub_text = get_content_all(soup)
        # Check if page is empty
        if sub_text != '':
            subs.append(('', sub_text))

    return subs

In [17]:
subsections = get_content_subsections(soup)

print('Found %d subsections' % len(subsections))
for i, subsection in enumerate(subsections):
    print('Subsection# %d: %s' % (i, subsection[0]))
    print('%s\n' % subsection[1])

Found 9 subsections
Subsection# 0: Married individuals filing joint returns and surviving spouses
(a) Married individuals filing joint returns and surviving spouses There is hereby imposed on the taxable income of (1) every married individual (as defined in section 7703) who makes a single return jointly with his spouse under section 6013, and (2) every surviving spouse (as defined in section 2(a)), a tax determined in accordance with the following table: If taxable income is: The tax is: Not over $36,900 15% of taxable income. Over $36,900 but not over $89,150 $5,535, plus 28% of the excess over $36,900. Over $89,150 but not over $140,000 $20,165, plus 31% of the excess over $89,150. Over $140,000 but not over $250,000 $35,928.50, plus 36% of the excess over $140,000. Over $250,000 $75,528.50, plus 39.6% of the excess over $250,000.

Subsection# 1: Heads of households
(b) Heads of households There is hereby imposed on the taxable income of every head of a household (as defined in sect

Explore keyphrases in different subsections

In [18]:
stoplist_file = 'SmartStoplist_extended.txt'
sub_ind = 0
sub_content = subsections[sub_ind][1]
keyphrases = get_keyphrases_rake(sub_content, stoplist_path=stoplist_file, min_score=1)

print('Number of keyphrases = %d' % len(keyphrases))
for keyphrase in keyphrases:
    print('%s -> %f' % keyphrase)
    
# Combined list of keyphrases to be used for indexing
all_phrases = ', '.join(p[0] for p in keyphrases)
print('\nKeyphrases list: %s' % all_phrases)

Number of keyphrases = 15
married individuals filing joint returns -> 23.500000
single return jointly -> 9.000000
married individual -> 5.500000
taxable income -> 4.000000
surviving spouses -> 4.000000
tax determined -> 3.500000
surviving spouse -> 3.500000
tax -> 1.500000
spouse -> 1.500000
defined -> 1.000000
table -> 1.000000
excess -> 1.000000
imposed -> 1.000000
accordance -> 1.000000
makes -> 1.000000

Keyphrases list: married individuals filing joint returns, single return jointly, married individual, taxable income, surviving spouses, tax determined, surviving spouse, tax, spouse, defined, table, excess, imposed, accordance, makes


### Put it all together - Parse, process and prepare all content pages for indexing

In [19]:
def parse_contents(hfile, mode='full_page', stoplist_path=None, min_score=1):
    global df
    infile  = os.path.basename(hfile)
    print 'Processing %s' % infile
    
    # Parse and extract title and sections of interest
    soup  = BeautifulSoup(open(hfile, 'r').read(), 'html.parser')
    
    # The navigation path titles are included in {'class': 'breadcrumb'}
    titles = soup.find('ol', {'class': 'breadcrumb'}).findAll('a')
    
    # Extract chapter, section and subsection titles
    # Check if chapter title is valid - Handle exception cases
    try:
        chapter_title    = ' - '.join([x.get('title').split('-')[1].strip() for x in titles[2:4]])
    except:
        chapter_title    = ' - '.join([x.get('title').strip() for x in titles[2:4]])
       
    # Check if section title is valid - Handle exception cases
    try:
        section_title    = ' - '.join([x.get('title').split('-')[1].strip() for x in titles[4:]])
    except:
        section_title    = ' - '.join([x.get('title').strip() for x in titles[4:]])

    # Use page title as the base subsection title
    subsection_title = soup.find(id='page-title').text.split('-')[1].strip()
     
    # Option #1 - Extract all page content as one document
    if mode == 'full_page':
        page_text = get_content_all(soup)
        phrases  = get_keyphrases_rake(page_text, stoplist_path=stoplist_file, min_score=min_score)
        phrases  = ', '.join(p[0] for p in phrases)
        df = df.append({'File'           : infile, 
                        'ChapterTitle'   : chapter_title.replace('\r', ''),
                        'SectionTitle'   : section_title.replace('\r', ''),
                        'SubsectionTitle': subsection_title.replace('\r', ''),
                        'SubsectionText' : page_text.replace('\r', ''),
                        'Keywords'       : phrases.replace('\r', '')},
                        ignore_index=True)        
    
    # Option #2 - Extract header and content from each subsection separately
    elif mode == 'split_page':
        subsections = get_content_subsections(soup)
        for i, subsection in enumerate(subsections):
            # append subsection header to main subsection_title
            sub_title = subsection_title
            if subsection[0] != '':
                sub_title = sub_title + ' - ' + subsection[0]
            sub_text = subsection[1]
            phrases  = get_keyphrases_rake(sub_text, stoplist_path=stoplist_file, min_score=min_score)
            phrases  = ', '.join(p[0] for p in phrases)
            df = df.append({'File'           : infile, 
                            'ChapterTitle'   : chapter_title.replace('\r', ''),
                            'SectionTitle'   : section_title.replace('\r', ''),
                            'SubsectionTitle': sub_title.replace('\r', ''),
                            'SubsectionText' : sub_text.replace('\r', ''),
                            'Keywords'       : phrases.replace('\r', '')},
                            ignore_index=True)        
    else:
        print('Invalid parsing mode %s ... Valid options: full_page or split_page')

    print('Finished processing %s ...' % infile)
    return

Loop on all content pages in the 'sample' folder. Parse and process content. Save extracted content fields in Excel file to be used for indexing in step #2.

In [20]:
INDIR  = '../sample/html'
OUTDIR = '../sample'

# Select parsing option: Option #1 (FULL_PAGE), Option #2 (SPLIT_PAGE), or both
FULL_PAGE  = False
SPLIT_PAGE = True

if not os.path.exists(OUTDIR):
  os.makedirs(OUTDIR)

# Dataframe to keep all extracted content fields
df = pd.DataFrame(columns = ['File', 'ChapterTitle', 'SectionTitle', 'SubsectionTitle',
                                     'SubsectionText', 'Keywords'], dtype=unicode)
    
# Set custom stopwords list, if needed
stoplist_file = 'SmartStoplist_extended.txt'

# Process all content pages
for infile in glob.glob(INDIR + '/*.html'):
    if FULL_PAGE:
        parse_contents(infile, mode='full_page',  stoplist_path=stoplist_file, min_score=3)
    if SPLIT_PAGE:
        parse_contents(infile, mode='split_page', stoplist_path=stoplist_file, min_score=1)

# Save extracted content for indexing in step #2
#outfile = OUTDIR + '/parsed_content.tsv'
#df.to_csv(outfile, sep='\t', index_label='Index', encoding='utf-8')    
outxlsx = OUTDIR + '/parsed_content.xlsx'
df.to_excel(outxlsx, index_label='Index', encoding='utf-8') 

Processing 1.1.1.1.1.1.html
Finished processing 1.1.1.1.1.1.html ...
Processing 1.1.1.1.1.2.html
Finished processing 1.1.1.1.1.2.html ...
Processing 1.1.1.1.1.3.html
Finished processing 1.1.1.1.1.3.html ...
Processing 1.1.1.1.2.1.html
Finished processing 1.1.1.1.2.1.html ...
Processing 1.1.1.1.4.1.1.html
Finished processing 1.1.1.1.4.1.1.html ...
Processing 1.1.1.1.4.1.2.html
Finished processing 1.1.1.1.4.1.2.html ...
Processing 1.1.1.1.4.1.3.html
Finished processing 1.1.1.1.4.1.3.html ...
Processing 1.1.1.1.4.1.4.html
Finished processing 1.1.1.1.4.1.4.html ...
Processing 1.1.1.1.4.1.5.html
Finished processing 1.1.1.1.4.1.5.html ...
Processing 1.1.1.1.4.1.6.html
Finished processing 1.1.1.1.4.1.6.html ...
Processing 1.1.1.1.4.1.7.html
Finished processing 1.1.1.1.4.1.7.html ...
Processing 1.1.1.1.4.1.8.html
Finished processing 1.1.1.1.4.1.8.html ...
Processing 1.1.1.11.2.1.2.html
Finished processing 1.1.1.11.2.1.2.html ...


In [21]:
df.head(5)

Unnamed: 0,File,ChapterTitle,SectionTitle,SubsectionTitle,SubsectionText,Keywords
0,1.1.1.1.1.1.html,Income Taxes - NORMAL TAXES AND SURTAXES,Determination of Tax Liability - TAX ON INDIVI...,Tax imposed - Married individuals filing joint...,(a) Married individuals filing joint returns a...,"married individuals filing joint returns, sing..."
1,1.1.1.1.1.1.html,Income Taxes - NORMAL TAXES AND SURTAXES,Determination of Tax Liability - TAX ON INDIVI...,Tax imposed - Heads of households,(b) Heads of households There is hereby impose...,"taxable income, tax determined, tax, defined, ..."
2,1.1.1.1.1.1.html,Income Taxes - NORMAL TAXES AND SURTAXES,Determination of Tax Liability - TAX ON INDIVI...,Tax imposed - Unmarried individuals (other tha...,(c) Unmarried individuals (other than survivin...,"surviving spouse, taxable income, unmarried in..."
3,1.1.1.1.1.1.html,Income Taxes - NORMAL TAXES AND SURTAXES,Determination of Tax Liability - TAX ON INDIVI...,Tax imposed - Married individuals filing separ...,(d) Married individuals filing separate return...,"married individuals filing separate returns, s..."
4,1.1.1.1.1.1.html,Income Taxes - NORMAL TAXES AND SURTAXES,Determination of Tax Liability - TAX ON INDIVI...,Tax imposed - Estates and trusts,(e) Estates and trusts There is hereby imposed...,"taxable income, tax determined, taxable, tax, ..."


#### The content is now ready for indexing in step #2.