<DIV ALIGN=CENTER>

# Introduction to Text Data Parsing
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In this Notebook we explore how to actually pull text data of interest
out of both unstructured and structured data sets. First we will review
basic Python tools that can be used for either an initial data
exploration or in many cases, more advanced data processing tasks.
Next, we review another important tool, regular expressions, which can
simplify the task of finding and selecting specific data in a large
document. Python provides a native implementation of [regular
expressions][re] through the `re` module.

Finally, we move on to semi-structured and structured text data
processing by reviewing the concept of parsing, where we use the
structure of a document to extract contextual information. First, we
introduce the Python email parsing functionality and demonstrate how
to use this built-in library to extract structured and unstructured data
from an email. Next, we move on to parsing structured documents, for
which we use the parsing tool [BeautifulSoup][bs], which provides an
elegant and simple method to parse and access XML formatted data.
BeautifulSoup was actually designed to simplify the task of scraping
data from Websites, and thus we can use it to parse any XML formatted
data including HTML or SVG. 


-----
[bs]: http://www.crummy.com/software/BeautifulSoup/
[re]: https://docs.python.org/3/library/re.html

### Text Data Processing

In many cases, we will be presented with unstructured or even
semi-structured text data. For example, Tweet messages, email messages,
or other documents can often be considered as character sequences. In
these cases, we often can perform basic data processing by employing
built-in Python data structures and collections. 

The main tool we can use for text processing is the Python `string`
object and its associated methods. One important point to remember is
that in Python, a `string` is immutable, thus any change will create a
new `string`. This will have an impact on using Python to process large
text data sets, which often leads to other solutions, of which several
are presented later in this Notebook. The `string` object has a number
of [useful methods][pystm]:

- `split`: Return a list of token strings that are delimited by a
character, such as space.

- `find`: return the lowest index in the string where a substring is
located.

- `replace`: return a new string with all occurrences of a pattern
replaced.

- `join`: return a string that is the combination of the input strings

- `count`: return the number of non-overlapping instances of a substring.

- `lower`: convert text to lowercase characters.

- `lstrip` / `rstrip`: return a string with the leading/trailing
characters removed.

In addition, one can make use of standard [Python sequence
operators][pso] to quickly perform basic text data processing.  Given a
value `v`, integer `n`, and similar typed sequences `s` and `t`:

| Operation | Description |
| ----- | ----- |
| `v in s`| `True` if `v` is in the sequence `s`, otherwise `False`|
| `v not in s`| `False` if `v` is in the sequence `s`, otherwise `True`|
| `s + t`| concatenation of `s` and `t`|
| `s * n` or `n* s`| `n` shallow copies of `s` concatenated|
| `len(s)`| the number of elements in the sequence `s`|
| `min(s)`| the smallest elements in the sequence `s`|
| `max(s)`| the largest of elements in the sequence `s`|
| `s.count(v)`| number of times `v` appears in `s`|

Finally, Python provides additional [data collection classes][cl] in the
`collections` library, which is part of the standard Python
distribution. Current;y, this library introduces the `namedTuple`,
`deque`, `ChainMap`, `Counter`, `OrderedDict`, `defaultDict`,
`UserDict`, `UserList`, and `UserString` classes. In the following code
example, we demonstrate the use of a `Counter` object to perform a
simple word count.

-----
[pystm]: https://docs.python.org/3/library/stdtypes.html#string-methods
[pso]: https://docs.python.org/3/library/stdtypes.html#common-sequence-operations
[cl]: https://docs.python.org/3/library/collections.html

In [1]:
with open ("data/email.txt", "r") as myfile:
    msg = myfile.read().replace('\n', ' ')
    
words = msg.split()

import collections as cl

mr = cl.Counter(words)

print(mr.most_common(25))

[('a', 6), ('Docker', 5), ('on', 4), ('to', 4), ('the', 4), ('you', 4), ('at', 4), ('this', 4), ('your', 3), ('in', 3), ('new', 3), ('or', 3), ('professor.brunner@gmail.com', 2), ('New', 2), ('101', 2), ('us', 2), ('email', 2), ('Course', 2), ('Robert', 2), ('have', 2), ('course', 2), ('RP', 2), ('Piazza', 2), ('Image', 2), ('if', 2)]


### Regular Expressions

Regular expressions, or RE or regexes, are expressions that can be used
to match one or more occurrences of a particular pattern. Regular
expressions are not unique to Python, they are used in many programming
languages and many Unix command line tools like sed, grep, or awk.
[Regular expressions][re] are used in Python through the `re` module. To
build a regular expression, you need to understand the syntax of the RE
language. Once a regular expression is developed, it is compiled and
executed by an engine written in C in order to provide fast execution.

To begin, most characters in a regular expression simply match
themselves, For example `python` would match any occurrence of the six
letters `python` either alone or embedded in another word. There are
several special characters, known as metacharacters, that control the
behaviour of the rest of the regular expresion. These metacharacters are
listed in the following table.

| Metacharacter | Meaning | Example |
| ---- | ----- | ----- |
| . |  Matches any character except a newline | `1.3` matches `123`, `1a3`, and `1#3` among others |
| ^ | Matches sequence at the beginning of the line| `^Python` matches `Python` at the beginning of a line |
| $ | Matches sequence at the end of the line | `Python$` matches `Python` at the end of a line |
| * | Matches zero or more occurrences of a pattern | `12*3` matches `13`, `123`, `1223`, etc. |
| + |  Matches one or more occurrences of a pattern | `12+3` matches `123`, `1223`, etc. |
| ? |  Matches zero or one occurrences of a pattern | `12?3` matches `13` and `123` |
| { }| Match repeated qualifier | `{m, n}` means match at least `m` and at most `n` occurrences | 
| [ ] | Used to specify a character class | `[a-z]` means match any lower case character |
| \ | Escape character | `\w` means match an alphanumeric character, `\s` means match any whitespace character, and `\\` means match a backslash |
| &#124; | or operator | `A ` &#124; ` B` match either `A` or `B` |
| ( ) | Grouping Operator | (a, b) |

One additional point to remember is that inside a character class (i.e.,
`[ ]`) many of these metacharacters lose their special meaning, and thus
can be used to match themselves. For example, inside a character class,
the `^` character means _not_, so `[^\w]` means match any
non-alphanumeric character.

To master regular expressions requires a lot of practice, but the
investment is well worth it as they are used in many different contexts
and can greatly simplify otherwise complex tasks. Given a regular
expression, there are several functions that can be used to process text
data.

- `compile`: compiles a regular expression for faster evaluation.
- `search`: find regular expression in string
- `match`: find regular expression at start of string
- `split`: splits the string by matches of a regular expression.
- `sub`: replaces substrings that match a regular expression with different string

We can modify our previous string processing example, by using regular
expressions to removing punctuation and other non-alphanumeric or
whitespace characters.

-----
[re]: https://docs.python.org/3/howto/regex.html

In [2]:
import re

pattern = re.compile(r'[^\w\s]')
with open ("data/email.txt", "r") as myfile:
    msg = myfile.read().replace('\n', ' ')
    
words = re.sub(pattern, ' ', msg).split()

mr = cl.Counter(words)

print(mr.most_common(25))

[('a', 6), ('Docker', 5), ('com', 4), ('on', 4), ('to', 4), ('the', 4), ('you', 4), ('at', 4), ('this', 4), ('in', 3), ('your', 3), ('Piazza', 3), ('or', 3), ('new', 3), ('New', 2), ('101', 2), ('us', 2), ('email', 2), ('Course', 2), ('Robert', 2), ('image', 2), ('no', 2), ('have', 2), ('Note', 2), ('piazza', 2)]


-----

### Email Text Parsing

Python provides built-in support for [processing email messages][pem],
which are an often overlooked source of information in data science
projects. The library is part of the core Python distribution, and
includes support for parsing email messages, as well as sending and
receiving emails. For our purpose, we simply need to read in text and
create an email `message`, which provides access to the basic email
contents. The `message` instance provides access to the email header
information as well as any payload data. 

Normally the payload is the email message, but with multipart messages,
like HTML email messages, an email can have multiple payloads. In the
next several code cells, we create an email `message` by reading an
email from a file (the email should look familiar). We subsequently
explore the Python email message interface to extract email headers and
the message payload, before grabbing the HTML message for later parsing.

-----

[pem]: https://docs.python.org/3/library/email.html

In [3]:
import email as em
from email import policy

with open("data/email.eml") as fin:
    msg = em.message_from_file(fin, policy=policy.default)
    
msg.keys()

['Delivered-To',
 'Received',
 'X-Received',
 'Return-Path',
 'Received',
 'Received-SPF',
 'Authentication-Results',
 'DKIM-Signature',
 'DKIM-Signature',
 'Received',
 'Received',
 'Date',
 'From',
 'To',
 'Message-ID',
 'In-Reply-To',
 'References',
 'Subject',
 'MIME-Version',
 'Content-Type',
 'X-SG-EID',
 'X-Feedback-ID']

In [4]:
print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])

To: professor.brunner@gmail.com
From: RP 101 on Piazza <no-reply@piazza.com>
Subject: [Instr Note] New Docker Course Image


In [5]:
print(str(msg)[:349])

Delivered-To: professor.brunner@gmail.com
Received: by 10.37.214.196 with SMTP id n187csp2182549ybg;
        Tue, 29 Sep 2015 15:55:13 -0700 (PDT)
X-Received: by 10.107.25.143 with SMTP id 137mr1640798ioz.52.1443567313622;
        Tue, 29 Sep 2015 15:55:13 -0700 (PDT)
Return-Path: <bounces+5126-72bd-professor.brunner=gmail.com@sendgrid.piazza.com>


In [6]:
print([att for att in dir(msg) if '__' not in att])

['_add_multipart', '_body_types', '_charset', '_default_type', '_find_body', '_get_params_preserve', '_headers', '_make_multipart', '_payload', '_unixfrom', 'add_alternative', 'add_attachment', 'add_header', 'add_related', 'as_bytes', 'as_string', 'attach', 'clear', 'clear_content', 'defects', 'del_param', 'epilogue', 'get', 'get_all', 'get_body', 'get_boundary', 'get_charset', 'get_charsets', 'get_content', 'get_content_charset', 'get_content_maintype', 'get_content_subtype', 'get_content_type', 'get_default_type', 'get_filename', 'get_param', 'get_params', 'get_payload', 'get_unixfrom', 'is_attachment', 'is_multipart', 'items', 'iter_attachments', 'iter_parts', 'keys', 'make_alternative', 'make_mixed', 'make_related', 'policy', 'preamble', 'raw_items', 'replace_header', 'set_boundary', 'set_charset', 'set_content', 'set_default_type', 'set_param', 'set_payload', 'set_raw', 'set_type', 'set_unixfrom', 'values', 'walk']


In [7]:
if msg.is_multipart() == True:
    data = msg.get_payload()

    print("Text Data:\n---------\n", data[0])

Text Data:
---------
 MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Instructor Robert J. Brunner posted a new Note.=20

New Docker Course Image

We generated a new Docker course image. If you want to follow along on your=
 laptop or work on the course Notebooks offline, you should download this n=
ew image by issuing a=C2=A0

docker pull lcdm/rppdm-standalone
command at a Docker prompt (i.e., in a Docker Quickstart Terminal). On the =
other hand, if you simply use the JupyterHub Server, no action is required =
on your part (we have already updated the server).

Let us know if you have any questions.

Robert


Go to https://piazza.com/class?cid=3Dif5yonj2fts4on&nid=3Die93g1v7xri4jg&to=
ken=3DlnIDF9d7Seu to view. Search or link to this question with @16.=20=20

Tell a colleague about Piazza. It's free, after all.

Thanks,
The Piazza Team
--
Contact us at team@piazza.com


You're receiving this email because professor.brunner@gmail.

In [8]:
from IPython.display import HTML

html = str(data[1])[102:]

HTML(html)

In [9]:
print(html)

<html>
Instructor Robert J. Brunner posted a new Note. <br><br>
<b>New Docker Course Image</b><br>
<br>
<p>We generated a new Docker course image. If you want to follow along on y=
our laptop or work on the course Notebooks offline, you should download thi=
s new image by issuing a=C2=A0</p>
<p></p>
<pre>docker pull lcdm/rppdm-standalone</pre>
<p>command at a Docker prompt (i.e., in a Docker Quickstart Terminal). On t=
he other hand, if you simply use the JupyterHub Server, no action is requir=
ed on your part (we have already updated the server).</p>
<p></p>
<p>Let us know if you have any questions.</p>
<p></p>
<p>Robert</p>
<p></p>
<br>
<br>
<a href=3D"https://piazza.com/class?cid=3Dif5yonj2fts4on&nid=3Die93g1v7xri4=
jg&token=3DlnIDF9d7Seu">Click here</a> to view. Search or link to this ques=
tion with @16.   <br><br>
Tell a colleague about Piazza. It's free, after all.<br>
<br>
Thanks,<br>
The Piazza Team<br>
--<br>
Contact us at team@piazza.com<br><br><br>
<font size=3D'-2'>
You're

### Structured Text Parsing

To parse structured text, like an XML or an HTML document, we can use
the Python [Beautiful Soup][bs] library. This library uses an XML/HTML
parser to build a DOM tree, and Beautiful Soup then provides traversal
methods to access and modify the DOM for a specific document. Beautiful
Soup has been extremely popular for the ease with which it allows web
scraping, for example, you can pull data out of an HTML table. But it is
more powerful than this, as it allows you to easily parse and manipulate
any XML document.

To use Beautiful Soup, we first need to import the library, and then
create a BeautifulSoup object that provides access to the parsed data.
Document elements, like `body` or `table` are directly accessed from the
parsed tree; and element attributes or data can be easily extracted,
deleted, or replaced. If required, new data can also be added to an
existing document, allowing for the dynamic creation of a new document.
These capabilities are demonstrated in the following code cells.

-----
[bs]: http://www.crummy.com/software/BeautifulSoup/



In [10]:
# Lets parse our HTML document

# We use BeautofulSoup version 4
from bs4 import BeautifulSoup
  
soup = BeautifulSoup(html)

# Now lets print out the start of the HTMl file
print(soup.prettify()[:134])

<html>
 <body>
  <p>
   Instructor Robert J. Brunner posted a new Note.
   <br/>
   <br/>
   <b>
    New Docker Course Image
   </b>
 


In [11]:
# We can access document elements directly
print('code element:= ', soup.pre)
print('value:', soup.pre.string)

# We can access parent data
print('parent element: ', soup.pre.parent.name)

code element:=  <pre>docker pull lcdm/rppdm-standalone</pre>
value: docker pull lcdm/rppdm-standalone
parent element:  body


In [12]:
# We can directly access elemnt attributes

print('font class attribute: ', soup.font['size'])

font class attribute:  3D'-2'


In [13]:
# We can access an entire element's content
print(soup.b)

<b>New Docker Course Image</b>


In [14]:
# We can find all occurances of a particular element

for el in soup.find_all('p'):
    print(el)

<p>
Instructor Robert J. Brunner posted a new Note. <br/><br/>
<b>New Docker Course Image</b><br/>
<br/>
</p>
<p>We generated a new Docker course image. If you want to follow along on y=
our laptop or work on the course Notebooks offline, you should download thi=
s new image by issuing a=C2=A0</p>
<p></p>
<p>command at a Docker prompt (i.e., in a Docker Quickstart Terminal). On t=
he other hand, if you simply use the JupyterHub Server, no action is requir=
ed on your part (we have already updated the server).</p>
<p></p>
<p>Let us know if you have any questions.</p>
<p></p>
<p>Robert</p>
<p></p>


In [15]:
# We can also change data in the document

soup.body['class'] = 'newClass'

print("\nBody class attribute = ", soup.body['class'])



Body class attribute =  newClass


In [16]:
# We can delete elements

myCode = soup.pre.extract()

print(soup.pre)

None


In [17]:
# We can select elements based on CSS Selectors
target = soup.select('font[size]')
print(target)

[<font size="3D'-2'">
You're receiving this email because professor.brunner@gmail.com is enrolled=
 in RP 101 at University of Illinois, Research Park. <a href="3D'https://pia=" zza.com="">Sign in</a> to manage your email preferences or <a href="3D'h=" ttps:="">un-enroll</a> from thi=
s class.</font>]


In [18]:
# We need to pull out the first element in the list to get tag
# Now we can insert our table back into the DOM

target[0].insert_after(myCode)
print(soup.pre)

<pre>docker pull lcdm/rppdm-standalone</pre>


In [19]:
# We can also insert entirely new elements.

# First we create a new element (tag)
tag = soup.new_tag('h3', id='h3id')
tag.string = 'A New Header'

# Now we can append (in this case we put the new element at the end of the body)

soup.body.append(tag)

# Show the result
print(soup.h3)

<h3 id="h3id">A New Header</h3>


-----

While Beautiful Soup provides a great deal of power and simplicity in
DOM parsing and element retrieval, the full power of parsing a document
requires the use of regular expressions. Given a regular expression, the
first task in Python is to compile the RE, which is done by using the
`compile` method in the `re` module. This is demonstrated in the
following code cell where we use a regular expression to find the
element containing `Docker` to display HTML entities that contain the
word _Docker_.

-----


In [20]:
# We need the re module
import re 

# Open and parse our XML document
soup = BeautifulSoup(html)

# Findelements containing the Docker string
for el in soup.find_all(text=re.compile('Docker')):
    print(el.parent)

<b>New Docker Course Image</b>
<p>We generated a new Docker course image. If you want to follow along on y=
our laptop or work on the course Notebooks offline, you should download thi=
s new image by issuing a=C2=A0</p>
<p>command at a Docker prompt (i.e., in a Docker Quickstart Terminal). On t=
he other hand, if you simply use the JupyterHub Server, no action is requir=
ed on your part (we have already updated the server).</p>


-----
## Breakout Session

During this breakout, you should work to improve your Python text data
processing skills. Specific problems you can attempt include the
following:

1. Modify the first String Processing code to convert all text to
lowercase characters before accumulating the word counts.

2. Use the Python set to obtain the list of unique words in the text
message.

3. Use Regular Expressions to remove the email encoding characters from
the message text.

Additional, more advanced problems:

1. Save several emails from within your mail reader and modify the
Python code to process them in bulk to extract out the sender, date
sent, and subject.

2. Save several webpages (perhaps by using wget), and modify the
BeautifulSoup code example to parse out and display the page title, any
Javascript code libraries, and any css style file references.

-----

### Additional References

1. [Dive Into Python3][1] regular expression chapter.
4. The [Python Collectins][pycol] documentation.
3. The Official [Python Email][pem] documentation.
2. [BeautifulSoup][2] tutorial.

-----

[1]: http://www.diveintopython3.net/regular-expressions.html
[2]: http://programminghistorian.org/lessons/intro-to-beautiful-soup
[pem]: https://docs.python.org/3/library/email.html
[pycol]: https://docs.python.org/3/library/collections.html

### Return to the [Week 3 Index](index.ipynb).

-----