Work through http://shop.oreilly.com/product/9780596514235.do

Talk on https://github.com/rdempsey/pyparsing-dcpython

Examples http://pyparsing.wikispaces.com/Examples

In [57]:
from pyparsing import  *   # Word, alphas

greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- parser defined here
print (greet.parseString( "Hello, World!" ))

['Hello', ',', 'World', '!']


Basic grammar example, 

To use the parser, call parseString() on the parser object:

In [16]:
number = Word(nums+".")         # define variables
identifier = Word(alphas, alphanums+"_")

assignmentExpr = identifier + "=" + (identifier | number) # define parser

assignmentTokens = assignmentExpr.parseString("pi=3.14159") # Use the Grammar to Parse the Input Text In

print(assignmentTokens)

['pi', '=', '3.14159']


Update the assignmentExpr to use results names (such as lhs and rhs for the left- and righthand sides of the assignment), to access the fields as if they were attributes of the returned ParseResults:

In [15]:
assignmentExpr = identifier.setResultsName("lhs") + "=" +  (identifier | number).setResultsName("rhs")

assignmentTokens = assignmentExpr.parseString( "pi=3.14159" ) 

print(assignmentTokens.rhs, "is assigned to", assignmentTokens.lhs)

3.14159 is assigned to pi


# Parse table data

In [23]:
file = open('university.txt', 'r')
for line in file:
    print(line)
file.close()

09/04/2004 Virginia 		44 	Temple		14

09/04/2004 LSU				22	Missouri	18

09/09/2004 Troy State 		01	Cambridge	22

01/02/2003 Florida State	55	Oxford		28


In [24]:
file = open('university.txt', 'r')

num = Word(nums)  # define varlaibles
date = num + "/" + num + "/" + num 
schoolName = OneOrMore( Word(alphas) )
score = Word(nums) 

schoolAndScore = schoolName + score   # build up grammar
gameResult = date + schoolAndScore + schoolAndScore  # and grammar

for line in file:
    stats = gameResult.parseString(line) 
    print(stats.asList())

file.close()

['09', '/', '04', '/', '2004', 'Virginia', '44', 'Temple', '14']
['09', '/', '04', '/', '2004', 'LSU', '22', 'Missouri', '18']
['09', '/', '09', '/', '2004', 'Troy', 'State', '01', 'Cambridge', '22']
['01', '/', '02', '/', '2003', 'Florida', 'State', '55', 'Oxford', '28']


The first change we'll make is to combine the tokens returned by date into a single MM/DD/YYYY date string. The pyparsing Combine class does this for us by simply wrapping the composed expression:

In [25]:
file = open('university.txt', 'r')

num = Word(nums)  # define varlaibles
date = Combine( num + "/" + num + "/" + num )
schoolName = OneOrMore( Word(alphas) )
score = Word(nums) 

schoolAndScore = schoolName + score   # build up grammar
gameResult = date + schoolAndScore + schoolAndScore  # and grammar

for line in file:
    stats = gameResult.parseString(line) 
    print(stats.asList())

file.close()

['09/04/2004', 'Virginia', '44', 'Temple', '14']
['09/04/2004', 'LSU', '22', 'Missouri', '18']
['09/09/2004', 'Troy', 'State', '01', 'Cambridge', '22']
['01/02/2003', 'Florida', 'State', '55', 'Oxford', '28']


The next change to make will be to combine the school names, too. Because Combine's default behavior requires that the tokens be adjacent, we will not use it, since some of the school names have embedded spaces. Instead we'll define a rou- tine to be run at parse time to join and return the tokens as a single string.

For this example, we will define a parse action that takes the parsed tokens, uses the string join function, and returns the joined string. This is such a simple parse action that it can be written as a Python lambda. The parse action gets hooked to a particular expression by calling setParseAction, as in:

In [27]:
file = open('university.txt', 'r')

num = Word(nums)  # define varlaibles
date = Combine( num + "/" + num + "/" + num )
schoolName = OneOrMore( Word(alphas) )

schoolName = schoolName.setParseAction( lambda tokens: " ".join(tokens) )  # Use python join in lambda

score = Word(nums) 

schoolAndScore = schoolName + score   # build up grammar
gameResult = date + schoolAndScore + schoolAndScore  # and grammar

for line in file:
    stats = gameResult.parseString(line) 
    print(stats.asList())

file.close()

['09/04/2004', 'Virginia', '44', 'Temple', '14']
['09/04/2004', 'LSU', '22', 'Missouri', '18']
['09/09/2004', 'Troy State', '01', 'Cambridge', '22']
['01/02/2003', 'Florida State', '55', 'Oxford', '28']


Another common use for parse actions is to do additional semantic validation, beyond the basic syntax matching that is defined in the expressions. For instance, the expression for date will accept 03023/808098/29921 as a valid date, and this is certainly not desirable. A parse action to validate the input date could use time.strptime to parse the time string into an actual date:

    time.strptime(tokens[0],"%m/%d/%Y")
    
If strptime fails, then it will raise a ValueError exception. Pyparsing uses its own exception class, ParseException, for signaling whether an expression matched or not. Parse actions can raise their own exceptions to indicate that, even though the syntax matched, some higher-level validation failed. Our validation parse action would look like this:
def

In [39]:
import time

def validateDateString(tokens): 
    try:
        time.strptime(tokens[0], "%m/%d/%Y") 
        # print("time is in correct format :" + tokens[0])
    except ValueError:
        raise ParseException("Invalid date string (%s)" % tokens[0]) 
    date.setParseAction(validateDateString)
    
date = Combine( num + "/" + num + "/" + num )

test_date1 = "01/03/1900"
validateDateString(date.parseString(test_date1) )

test_date2 = "0101/03/1900"
# validateDateString(date.parseString(test_date2) )  # raises the required exceptions

Another modifier of the parsed results is the pyparsing Group class. Group does not change the parsed tokens; instead, it nests them within a sublist. Group is a useful class for providing structure to the results returned from parsing:

In [40]:
score = Word(nums) 
schoolAndScore = Group( schoolName + score )
gameResult = date + schoolAndScore + schoolAndScore  # and grammar

file = open('university.txt', 'r')
for line in file:
    stats = gameResult.parseString(line) 
    print(stats.asList())
file.close()

['09/04/2004', ['Virginia', '44'], ['Temple', '14']]
['09/04/2004', ['LSU', '22'], ['Missouri', '18']]
['09/09/2004', ['Troy State', '01'], ['Cambridge', '22']]
['01/02/2003', ['Florida State', '55'], ['Oxford', '28']]


Finally, we will add one more parse action to perform the conversion of numeric strings into actual integers. This is a very common use for parse actions, and it also shows how pyparsing can return structured data, not just nested lists of parsed strings. This parse action is also simple enough to implement as a lambda:

In [42]:
score = Word(nums).setParseAction( lambda tokens : int(tokens[0]) )

schoolAndScore = Group( schoolName + score )
gameResult = date + schoolAndScore + schoolAndScore  # and grammar

file = open('university.txt', 'r')
for line in file:
    stats = gameResult.parseString(line) 
    print(stats.asList())
file.close()

['09/04/2004', ['Virginia', 44], ['Temple', 14]]
['09/04/2004', ['LSU', 22], ['Missouri', 18]]
['09/09/2004', ['Troy State', 1], ['Cambridge', 22]]
['01/02/2003', ['Florida State', 55], ['Oxford', 28]]


Use results names to simplify access to specific tokens within the parsed re- sults, and to protect your parser from later text and grammar changes, and from the variability of optional data fields. But this still leaves us sensitive to the order of the parsed data.

Instead, we can define names in the grammar that different expressions should use to label the resulting tokens returned by those expressions. To do this, we insert calls to setResults-Name into our grammar, so that expressions will label the tokens as they are accumulated into the Parse-Results for the overall grammar:

In [51]:
schoolAndScore = Group(schoolName.setResultsName("school") + score.setResultsName("score") )
gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") + schoolAndScore.setResultsName("team2")

file = open('university.txt', 'r')
for line in file:
    stats = gameResult.parseString(line) 
    print("%(date)s %(team1)s %(team2)s" % stats)
    # print(stats.dump())  # can see a dump of hierarchical listing of keys and values
file.close()

09/04/2004 ['Virginia', 44] ['Temple', 14]
09/04/2004 ['LSU', 22] ['Missouri', 18]
09/09/2004 ['Troy State', 1] ['Cambridge', 22]
01/02/2003 ['Florida State', 55] ['Oxford', 28]


ParseResults also implements the keys(), items(), and values() methods, and supports key testing with Python's in keyword.

To check whether your grammar has processed the entire string, pyparsing pro- vides a class StringEnd (and a built-in expression stringEnd) that you can add to the end of the grammar. This is your way of signifying, "at this point, I expect there to be no more text—this should be the end of the input string." If the grammar has left some part of the input unparsed, then StringEnd will raise a ParseE xception. Note that if there is trailing whitespace, pyparsing will automatically skip over it before testing for end-of-string.

pyparser can be used for web pages too

# Now build code to parse pegs4.dat file

In [69]:
test1 = ' MEDIUM=AG521ICRU               ,STERNCID=AG521ICRU  '

identifier = Word(alphas, alphanums)

test1parse = "MEDIUM" + "=" + identifier.setResultsName("MEDIUM") + Suppress(',') +  "STERNCID" + "=" + identifier.setResultsName("STERNCID")

RESULT = test1parse.parseString(test1) 
print(RESULT.MEDIUM)

AG521ICRU


In [70]:
file = open('521icru_short.pegs4dat', 'r')
for line in file:
    if line.startswith(" MEDIUM"):   # note the whitespace at the beginning
        RESULT = test1parse.parseString(line) 
        print(RESULT.MEDIUM)
file.close()

AG521ICRU
ROBICRU
