Today we are going to dig deeper into bots and start to look at them not simply as responding to events in the world (retweeting, say), but as "conversational interfaces" to information and services - voice. To get there, we need tools to help us interpret what people are saying to our bot and how we are going to respond. 

We will assume that there are APIs (which we will use) for Alexa and the like that handle speech recognition and text-to-speech generation. Our job will be editorial. What do we say? 


Regular Expressions
-------------------

<img src=https://imgs.xkcd.com/comics/regular_expressions.png width=400>


In this portion of class, we are going to work with a "language" for expressing patterns in text. By "pattern" I mean specifying repetitions of symbols -- words or punctuation or sequences of numbers or any combination of these. Given a collection of text, for example, regular expressions might help you find dates or telephone numbers or URLs or email addresses -- all of these obey certain formatting rules. 

Regular expressions, then, are a way to describe these formatting rules so that we can search a body of text for them. Sometimes we are doing this because we want to find lists of facts about people (email addresses and their telephone numbers, say), creating structured data out of unstructured data (a common theme in this class). And sometimes we appeal to regular expressions because they help us in the act of "cleaning" data -- we might be given a date column in a data set that contains dates in two different formats (y-m-d and m/d/y, say) and we need to transform them into just one consistent format throughout. 

The patterns we express might also be about content. Can we detect the gender of sources? Can we find new memes in a stack of text? Unlike the proto-natural language processing we saw with TextBlob, regular expressions deal with words as patterns of characters. There is no understanding here about parts of speech or grammar. Just patterns of symbols -- characters, numbers, and emoji, even.

**Trump's State of the Union Address**

To explain how regular expressions work, we will look at a large collection of text -- the transcript of the SOTU from January. Lots of things were discussed and we can sort through topics, his speech patterns and so on.

[The full transcript is here](https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-state-union-address/)

[A file with one line per sentence is here](https://raw.githubusercontent.com/computationaljournalism/columbia2018/master/data/sotu_sentences.txt)

You can download the file in a browser window and save it to your computer or you can use the `requests` package to access the file from Python directly. You can choose either but for now, let's download the file and read it in directly.

In [None]:
# open the file and read in the contents. recall
# that this will take each line in the file and make
# it an entry is a list. so the entire sotu is now a list
# of strings, one string per sentence spoken.

sentences = open("sotu_sentences.txt").readlines()

print("The object 'sentences' is of type", type(sentences))
print("There are", len(sentences), "sentences in the list")

In [None]:
# here is a slice with the first five entries
# (those having index 0, 1, 2, 3 and 4)
sentences[:5]

**Aside: List comprehensions**

Notice that each sentence has a "newline" at the end of it, the `\n` character. This is actually a common problem so it's worth mentioning. When you read in HTML, for example, the format doesn't care if you specify a heading like this
    
    <h1>A title</h1>

or like this

    <h1>         A title            </h1>
    
or like this

    <h1>
              A
                    title
                                 </h1>

So we are routinely needing to tidy up text to make it more readable for a human, and to provide it with a bit of reglarity so any automated text processing is cleaner. This will happen with just about any data you come across -- some kind of regularizing is going to be needed, especially with text.

The pretext of cleaning data also gives us a chance to introduce something new. We can remove the newlines from the end of each sentence using something called a 'list comprehension.' This is a new piece of Python syntax that lets us create new lists from old ones, transforming each element of the old list. It is an alternative to a loop, and is, well, **pretty snazzy. ** 

If we wanted to remove the `\n` from each sentence we might do something like the following if we had to use a loop. It goes over each sentence and strip()'s off the whitespace from the start and end of the string. Whitespace includes spaces and tabs and newlines.

In [None]:
new_sentences = []

for s in sentences:
    new_sentences.append(s.strip())
    
new_sentences[:5]

See the difference? The `\n`'s are gone!

To sum, we iterate through the list of sentences. Each sentence is `strip()`'d, has all the "white space" removed from its front and back end, and then appended to the new_sentences list. I hope you agree that this is kind of clunky notation. 

A "list comprehension" is a cleaner way to accomplish the same thing. So, let's reread the data and apply this new code construction.

In [None]:
# read in the speech
sentences = open("sotu_sentences.txt").readlines()

# use a list comprehension to strip out the newlines
new_sentences = [s.strip() for s in sentences]

new_sentences[:5]

The second expression above says that we cycle through all the elements in `sentences`, letting the variable name `s` represent each one in turn. The first element, for example, becomes the start of a new list, and it has had `.strip()` applied to it. The second element is then `strip()`'d and stored in the new list and so on. As you can see, a list comprehension reads like our loop in the previous cell and behaves similarly -- but it is syntactically nicer. 

You can also limit the number of results included in the new list by adding an `if` clause -- a logical expression. As the list comprehension is running through the elements of the old list, it can chose whether or not to incude it in the new list via a logical expression. 

Suppose, for example, we want to keep only sentences that contain the word "wall". In terms of programming, we can use the operator `in`. We've seen this logical operator before. 

In [None]:
"tide" in  'A new tide of optimism was already sweeping across our land.'

In [None]:
"cyclone" in  'A new tide of optimism was already sweeping across our land.'

To use this, in the expression below we run through each entry in the list `sentences`, labeling them `s` in turn, and keeps only those with at least one occurrence of the word "wall". (Here we don't save the new, reduced and transformed list -- we just have a look at it.) 

In [None]:
[s for s in sentences if "wall" in s]

If we wanted to make a new list in a loop we might do the following. Hopefully you see the code above is slicker.

In [None]:
new_sentences = []

for s in sentences:
    if "wall" in s:
        new_sentences.append(s)
        
new_sentences

That's list comprehensions -- new lists from old with a syntax that's a lot cleaner to read than a loop. We'll be using them for the remainder of this lesson. Keep in mind, however, that while we are using them for text, they can be used for any list.

In [None]:
# a list of some numbers
x = [2,5,4,7,8,2,4,5]

# a list comprehension that just keeps two times
# the entries that are larger than 3
[2*i for i in x if i>3]


OK back to the SOTU. We'll read it in fresh and take out all the newlines. Here's a clean way to do it.

In [None]:
sentences = [s.strip() for s in open("sotu_sentences.txt").readlines()]

Make sure you understand the example above. This construction is pretty powerful!

**Back to the transcript and regular expressions**

Python implements the regular expression search framework through a package called `re` (aptly named). We are going to make use of the `search()` function in this package. It takes a pattern definition (a regular expression) and searches for it in a string, returning every match it finds. You can use this in a list comprehension because when a `search()` finds a regular expression pattern it is treated as `True` in an `if` statement. When it can't find the pattern, it is treated as `False`.

In [None]:
from re import search

**Literals** 

As a way of specifying patterns, let's start with so-called "literals" -- these characters just match themselves. For example, the literal *"wall"* matches the following sentences from Trump’s transcript. This should be equivalent to the results we had by using the operator `in`. 

In [None]:
[s for s in sentences if search("wall",s)]

Replace the literal *"wall"* with a search for *"immigration"
*. Do you find any sentences matching this pattern? What other searches like this might you do to highlight sentences about immigration? Or other topics that this search suggests.

In [None]:
# your code here


Now, in the search for "wall" we also had "walls" turn up. If we were interested in "media" we might look for the **literal string** consisting of the letters m-e-d-i-a. 

See what we get...

In [None]:
[s for s in sentences if search("media",s)]

Hmm, well, that's not what we wanted at all. So, how do we get just the word "media"? Let's take a moment and write down the kinds of patterns in text we might be interested in finding, either from this speech or from your reporting or other work. What sorts of things do you have to specify? I mean word boundaries seem like a good idea so we can differentiate "media" from "immediate". What else do you need?

.

.

.

To get us to more complex searches, we need some way to express these ideas. And that's where metacharacters come to the rescue!

**In Walks Metacharacters**

Any character except for [ ]\\^\$.|?\*+( )\{ and \} can be used to specify a literal -- they match a single instance of themselves. The string *"wall"* represented a series of literals and to have a match we need to find a "w" followed by an "a" followed by an "l" and so on. The non-literals, on the other hand, are known as **metacharacters** and are used to specify much more complicated text patterns.

They help us specify "whitespace," word boundaries, sets or classes of literals, the beginning and end of a line, and various alternatives ("war" or "peace"). For example **^ represents the start of a line.** Let's look at what we get by searching the SOTU for a pattern than includes this character.

*(We are now going to preface our strings representing regular expressions with the letter `r`. We have seen `u` before to mean Unicode. Here `r` means a "raw" string. Basically it tells Python that every character is to be interpreted as it is. We have seen that `\n` in a string means a single character, newline. In a raw string, `\n` is interpreted as two characters, a backslash and an "n". This will be important and is really a way around the fact that regular expressions and Python use metacharacters to build things like newlines or  to represent the beginning of a sentence.)*

In [None]:
[s for s in sentences if search(r"^I am",s)]

Try a new sentence starter and see what you find...

In [None]:
# your code here


**Some Help**

To interpret what a regular expression is doing, we can use a special tool from [regexper.com](http://regexper.com/). It takes regular expressions and renders them graphically so you can get a better sense of how the machinery is functioning. For example, here is the display for our *"^I am"* example.

[Graphical view of the pattern "^I am"](http://regexper.com/#%5EI%20am)

You see a window where you can change the regular expression and then a graphical interpretation of what you've asked for. This is an extremely handy tool. (Notice that you don't have to include the quotes or the `r` in the regexper.com interface.) 

Now, if specifying the start of a line is important, having a special character for the end of a line is likely to be handy also. **The \$ represents the end of a line.** Consider the pattern *"road.\$"*. Here are the lines it matches the SOTU. Do you notice anything odd here?

In [None]:
[s for s in sentences if search(r"road.$",s)]

We see two sentences that end in "road." but really end with the world "abroad." Ah but notice that we also have one sentence that ends with "road?". 

What's going on here? Well, the dot, ".", is a metacharacter and represents a wildcard. It is used to refer to any character. So, *road.\$* will match lines that end in "road." or "road?" or even "road9" (should the transcriber be typing sloppily one day). Have a look at this at regexper.com.

[Graphical view of the pattern "road.\$"](http://regexper.com/#road.%24)

Putting a backslash \\ before one of the special metacharacters [ ] \\^\$/?\*+()\{ and \} lets us include these in a pattern as literals -- in technical terms, we have "escaped" the special meaning of these characters and they return to their literal meanings. 

Consider the pattern *"\\\$2"*. With the backslash, \$ no longer means the end of a line. We have returned the dollar sign to its literal meaning and the following lines match from the Trump transcript. 

>Now, the first \$24,000 earned by a married couple is completely tax-free.'
<br><br>
>'A typical family of four making \$75,000 will see their tax bill reduced by \$2,000 — slashing their tax bill in half.'

And to bring the point home, look at the following.

[Graphical view of the pattern "\\\$2"](http://regexper.com/#%5C%242)

So, given this, what do we need to do to match sentences ending with the word "country" followed by a period?

In [None]:
# your code here



**A character class matches a single character out of all the possibilities contained in brackets, [  ]** — There are certain rules that apply when specifying these classes that we’ll get to in a second. Let's look at the pattern *"[Tt]onight"* and see what lines it matches in the transcript.

In [None]:
[s for s in sentences if search(r"[Tt]onight",s)]

[Graphical view of the pattern "[Tt]onight"](http://regexper.com/#%5BTt%5Donight)

Keep in mind that while there might be lots of options in the square brackets, we are only trying to match one character out of this group. The graphical display makes this clear. We'll talk about specifying more than one match in a few minutes.

In terms of the rules that work within character classes, you can specify a range of letters [a-z] or [A-Z] or numbers [0-9] — Keep in mind that the order within the character class doesn’t matter, it specifies a bag of characters from which we select one item. Let's look at the pattern *"[0-9] years"* and see which sentences it will match.

In [None]:
[s for s in sentences if search(r"[0-9] years",s)]

[Graphical view of the pattern "[0-9] years"](http://regexper.com/#%5B0-9%5D%20years)

It's important to keep in mind what's being matched here. The expression *"[0-9] years"* will match "5 years" in the sentence that started "CJ served 15 years in the Air Force". The expression wants a number then a space then the word "years". Get it? 

**When used at the beginning of a character class ^ is also a metacharacter and it indicates matching characters NOT in the indicated class.** So the pattern *"[^?.]\$"* will match sentences that don't end in a period or a question mark (you don't have to "escape" characters in a character class -- or between [ and ]). 

In [None]:
[s for s in sentences if search(r"[^?.]$",s)]

We see a large number of quotes in this list. Maybe we should look for quoted statements next? Anyway, here is the graphical representation of the expression.

[Graphical view of the pattern "[^?.]\$"](http://regexper.com/#%5B%5E%3F.%5D%24)

Continuing on our survey of metacharacters, the **vertical bar "|" translates to “or”** — We can use it to combine expressions, the subexpressions being called alternatives. The expression *"good|bad"* will match these lines from transcript file.

In [None]:
[s for s in sentences if search(r"good|bad",s)]

We'll jump ahead just briefly and say that we can solve the problem of matching "goodness" and "badly" when all we want are the words "good" or bad". Some collections of characters, some "character classes" are used so often that they are given special notation. For example `\w` is a word character and `\b` represents a character class of "word boundaries". Here's a (not elegant but works) way to say you want "good" or "bad" alone.

In [None]:
[s for s in sentences if search(r"\bgood\b|\bbad\b",s)]

We will give a complete set of character classes at the end of this section.

Returning to choices, of course we can join several alternatives together. Consider *"year|month|day"*.

In [None]:
[s for s in sentences if search(r"day|month|year",s)]

Again, here we see a lot of matches to patterns like "birthday" or "someday". Both contain the literal *"day"* but they might not be what we had in mind. 

[Graphical view of the pattern "year|month|day"](http://regexper.com/#year%7Cmonth%7Cday)

Oh and the alternatives separated by "|" can be any real expressions and not just literals. Here we ask for time or money

In [None]:
[s for s in sentences if search("[0-9] year|\$[0-9]",s)]

And again, regexper.com to help us out.

[Graphical view of the pattern "[0-9] year|\$[0-9]"](https://regexper.com/#%5B0-9%5D%20year%7C%5C%24%5B0-9%5D)

**Subexpressions are often contained in parentheses (more metacharacters) to constrain the
alternatives in some way.** For example *"^(I will|I am)"* matches either expression, but at the start of a sentence. 

Later we will see that we can identify each subexpression separately,allowing us to extract (or capture) the content they match.

In [None]:
[s for s in sentences if search("^(I will|I am)",s)]

And the graphical representation -- notice the new reference to groups that are formed by the parentheses.

<a href=https://regexper.com/#%5E%28I%20will%7CI%20am%29>Graphical view of the pattern "^(I will|I am)"</a>

We're building up quite a vocabulary. Try a more complex expression on your own.

In [None]:
# Your code here


Before we leave this, let's use the shorthand \\b  for word boundaries and our parentheses construction to match what we really wanted with our original search a few cells back, *"\b(day|month|year)\b"*. Let's have a look.

In [None]:
[s for s in sentences if search(r"\b(day|month|year)\b",s)]

**The question mark indicates that the indicated expression is optional.** The expression *"George( W\\.)? Bush"* will match references to “George W. Bush” or just “George Bush”.

<a href=http://regexper.com/#George(%20W%5C.)%3F%20Bush>Graphical view of the pattern "George( W\\.)? Bush"</a>

**The \* and + signs are metacharacters used to indicate repetition** — the \* means “any number, including zero, of the item” and + means “at least one of the item”. So we can specify clauses after a colon with the following regular expression.

In [None]:
[s for s in sentences if search(r":.+$",s)]

(You see some errors here with how the sentences were pulled. You can blame TextBlob.) Now, to make this seem a bit more practical, we are going to consider another data source. You can download some of it, but it's big. It's essentially all of Jeb Bush's emails while he was in office in Florida. I mean literally in `mbox` format. The site is show here via the Internet Archive...
<br><br>
<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/jb2.jpg style="width: 65%; border: #000000 1px outset;"/>
<br><br>
... because this was such a bad idea, they took the site down quickly. 
<br><br>
<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/jb1.jpg style="width: 65%; border: #000000 1px outset;"/>
<br><br>
Why a bad idea? Well, no one masked any possibly sensitive data from the emails. So it's easy to find phone numbers or social security numbers or military ID's in these text files.

For example, to grab phone numbers, we look for digits that are separated by hyphens. For social security numbers or phone numbers we could use this expression *"[0-9]+-[0-9]+-[0-9]+"*. Why? Here's a series of lines that match this pattern from Jeb's emails.

>Phone: 407-240-1891<br><br>In reference to your letter dated october 29, 1998 in which you offer to help me with my inmigration question, i am a us citizen who is petition for my husband (a mexican citizen) petition #SRC-98-204-50114 his name is FRANCISCO JAVIER CORTEZ HERNANDEZ.<br><br>Fax: 407-888-2445<br><br>Pager: 850-301-8072<br><br>Cell: 407-484-8167<br><br>The Reverned uses his pager# 813-303-4726 to get in contact with, or you may email
and I will get in touch with him. <br><br>

And from regexper.com... 

<a href=http://regexper.com/#%5B0-9%5D%2B-%5B0-9%5D%2B-%5B0-9%5D%2B>Graphical view of the pattern "[0-9]+-[0-9]+-[0-9]+"</a>

In words, we are looking for one or more numbers followed by a hyphen, followed by one or more numbers, and then another hyphen, and finally one or more numbers.

**The curly braces \{ and \} are referred to as interval quantifiers** — they let us specify the exact number of occurences of a pattern, its minimum, maximum or an acceptable range. For a Social Security Number we might want "[0-9]{3}-[0-9]{2}-[0-9]{4}", for example - three numbers, a hyphen, two numbers, a hyphen then four numbers.  

<a href=https://regexper.com/#%5B0-9%5D%7B3%7D-%5B0-9%5D%7B2%7D-%5B0-9%5D%7B4%7D>View of the pattern "[0-9]{3}-[0-9]{2}-[0-9]{4}"</a>

Here's the full list of what you can do with the curly braces as metacharacters. 
<table>
          <tr>
            <th>Expression</th>
            <th>What does it mean?</th>
          </tr>
          <tr>
            <td>{3}</td>
            <td>Looks for 3 occurences of a pattern</td>
          </tr>
          <tr>
            <td>{,3}</td>
            <td>Matches at most 3 occurrences</td>
          </tr>
          <tr>
            <td>{3,}</td>
            <td>Matches at least 3 occurrences</td>
          </tr>
          <tr>
            <td>{3,5}</td>
            <td>Matches between 3 and 5 occurrences</td>
          </tr>
 </table>

With this information, how would we skim these emails for specific kinds of numbers? Credit card numbers? Phone numbers? VIN numbers?

**Groupings.** In most implementations of regular expressions, the parentheses not only limit the scope
of alternatives divided by a “|”, but also can be used to “remember” text matched by the
subexpression enclosed. We refer to the matched text with \1, \2, etc., depending on how many parenthese we have. 

As an example, the expression
*" ([a-zA-Z]+) \1 "* will match these lines in Jeb Bush’s inbox from January of 2000.

>I feel this is a **win win** situation for the Governor, the Reverend and the people that need help.<br><br>I insisted **that that** be the outcome in that court and that we did not recede from that position.<br><br>I guess you're embarrassed **that that** line got out.<br><br>

The pattern is asking for repeated words. We highlighted them in the text above. Also have a look at the graphical representation of this regular expression.

<a href=http://regexper.com/#%20(%5Ba-zA-Z%5D%2B)%20%5C1%20>Graphical view of the pattern " ([a-zA-Z]+) \1 "</a>.

**Substitution**

Groupings are also helpful when you want to make substitutions. We have already seen that string types offer you the opportunity to replace text...

In [None]:
s = 'We will work to fix bad trade deals and negotiate new ones.'
s.replace("We","You")

The `re` package includes a function `sub` that lets you act on groups. Here we define a group to be a dollar value and extract it. With this kind of construction, you can see how you might start to pull structured information from unstructured text.

In [None]:
from re import sub

s = "These changes alone are estimated to increase average family income by more than $4,000."

# pull out the dollar value
sub(r".*\$([0-9,]+).*",r"\1",s)

**To Sum**

The presentation here is meant to give you a flavor of how regular expressions are structured; you have seen the major metacharacters and to use them to create patterns. Below I provide a useful cheat sheet to remember what the different metacharacters mean and what some of the useful shorthand character classes are. In addition, I can recommend [an interactive cheat sheet](https://www.debuggex.com/cheatsheet/regex/python), and the site [http://www.regular-expressions.info/](http://www.regular-expressions.info/) is also an excellent resource.

**Metacharacters**

<table>
          <tr>
            <th>Metacharacter</th>
            <th>What does it do?</th>
            <th>Examples</th>
            <th>Matches</th>
          </tr>
          <tr>
            <td>^</td>
            <td>Matches beginning of line</td>
            <td>^abc</td>
            <td>abc, abcdef.., abc123</td>
          </tr>
          <tr>
            <td>\$</td>
            <td>Matches end of line</td>
            <td>abc\$</td>
            <td>my:abc, 123abc, theabc</td>
          </tr>
          <tr>
            <td>.</td>
            <td>Match any character</td>
            <td>a.c</td>
            <td>abc, asg, a123c</td>
          </tr>
          <tr>
            <td>[...]</td>
            <td>Matches one character contained in brackets</td>
            <td>[abc]</td>
            <td>a,b, or c</td>
          </tr>
          <tr>
            <td>[^...]</td>
            <td>Matches one character not contained in brackets</td>
            <td>[^abc]</td>
            <td>xyz, 123, 1de</td>
          </tr>
          <tr>
            <td>[a-z]</td>
            <td>Matches one character between 'a' and 'z'</td>
            <td>[b-z]</td>
            <td>bc, mind, xyz</td>
          </tr>
          <tr>
            <td>\*</td>
            <td>Matches character before \* 0 or more times</td>
            <td>ab\*c</td>
            <td>abc, abbc, ac</td>
          </tr>
          <tr>
            <td>+</td>
            <td>Matches character before + one or more times</td>
            <td>a+c</td>
            <td>ac, aac, aaac,</td>
          </tr>
          <tr>
            <td>?</td>
            <td>Matches the character before the ? zero or one times. Also, used as a non-greedy match</td>
            <td>ab?c</td>
            <td>ac, abc</td>
          </tr>
          <tr>
            <td>{x}</td>
            <td>Match exactly 'x' number of times</td>
            <td>(abc){2}</td>
            <td>abcabc</td>
          </tr>
          <tr>
            <td>{x,}</td>
            <td>Match 'x' number of times or more</td>
            <td>(abc){2,}</td>
            <td>abcabc, abcabcabc</td>
          </tr>
           <tr>
            <td>{,x}</td>
            <td>Match up to 'x' number of times</td>
            <td>(abc){2,}</td>
            <td>abcabc, abcabcabc</td>
          </tr>
          <tr>
            <td>{x,y}</td>
            <td>Match between 'x' and 'y' times.</td>
            <td>(a){2,4}</td>
            <td>aa, aaa, aaaaa</td>
          </tr>
           <tr>
            <td>|</td>
            <td>OR operator</td>
            <td>abc|xyz</td>
            <td>abc or xyz</td>
          </tr>
          <tr>
            <td>(...)</td>
            <td>Capture anything matched</td>
            <td>(a)b(c)</td>
            <td>Captures 'a' and 'c'</td>
          </tr>
          <tr>
            <td>(?:...)</td>
            <td>Non-capturing group</td>
            <td>(a)b(?:c)</td>
            <td>Captures 'a' but only groups 'c'</td>
          </tr>
           <tr>
            <td>\</td>
            <td>Escape the character after the backslash; or create a special sequence (like word boundaries, \b, or a character representing a space, \s.</td>
            <td>a\sc</td>
            <td>a c</td>
          </tr>
        </table>

The special "metacharacters" () [] {} ^ \$ . | \* + ?  and \\ become "literals" again if you put a \\ in front of them -- That is, \\. matches a period and is no longer the wild card. We say we have "escaped" the metacharacter.

**Shorthand character classes**

<table>
          <tr>
            <td>\d</td>
            <td>Match any digit (0-9)</td>
          </tr>
          <tr>
            <td>\D</td>
            <td>Match any non digit</td>
          </tr>
          <tr>
            <td >\t</td>
            <td>Match a tab</td>
          </tr>
          <tr>
            <td>\n</td>
            <td>Match a new line</td>
          </tr>
          <tr>
            <td>\r</td>
            <td>Match a carriage return</td>
          </tr>
          <tr>
            <td>\s</td>
            <td>Matches a space character (space, \t, \r, \n)</td>
          </tr>
          <tr>
            <td>\S</td>
            <td>Matches any non-space character </td>
          </tr>
          <tr>
            <td>\b</td>
            <td>Word boundary</td>
          </tr>
          <tr>
            <td>\B</td>
            <td>Non word boundary</td>
          </tr>
          <tr>
            <td>\w</td>
            <td>Matches any one word character [a-zA-Z_0-9]</td>
          </tr>
          <tr>
            <td>\W</td>
            <td>Matches any one non word character</td>
          </tr>
          </table>

The Coming Age of Conversational Bots
--------------------------------------
<hr>
<img src="https://cdn-images-1.medium.com/max/1000/1*-uuhR1UX709LfUnDiS30Rg.png"  style="width: 65%;"/>
<br>

Thursday, we will have a bit more history courtesy of Mark Lavallee and Emily. In the meantime, FastCompany  published a nice history of  a particular kind of software robot: [How The New, Improved Chatbots Rewrite 50 Years Of Bot History](http://www.fastcompany.com/3059439/why-the-new-chatbot-invasion-is-so-different-from-its-predecessors). Chatbots are really a conversation-based interfaces to services of various kinds. Sometimes the logic behind their inner workings is incredibly complex, but other times their operation is very service-oriented and shallow. Sometimes they are reading from a script, sometimes machine learning helps direct responses.

The FastCompany article makes the point that while strands of artificial intelligence research has been obsessed with making a both that was believably human (or perhaps with the potential for one), the latest incarnations of these programs exist "to help you get things done." They have grown plentiful using the scale of new platforms (Facebook, Twitter, and so on) and are fit tidily into their business interests.

Here's a snippet.

>But over the past few years, chatbots have made a comeback. With advancements in processing power, bots now have a better ability to interpret natural language and learn from users over time. Just as importantly, big companies like Facebook, Apple, and Microsoft are now eager to host our interactions with various services, and offer tools for developers to make those services available. Chatbots easily fit into their larger business models of advertising, e-commerce, online services, and device sales. Meanwhile, services that want to reach hundreds of millions of customers on a platform like Messenger will be helping to write the chat scripts.
<br><br>"A lot of the things that were barriers to us back then are no longer barriers to us today because of the evolution of the way technology works," Hoffer says.
<br><br>
Crucially, these bots are meant to be useful out of the gate, ... they no longer need conversation as a crutch for mass adoption. Sure, Apple’s Siri knows how to break the ice with a few jokes, but it largely exists to help you get things done. Rival assistants from Google and Amazon don’t exhibit much personality at all. Utility is winning out because the technology allows for it.

The article ends with a comment about early artificial intelligence researcher Joseph Weizenbaum. Weizenbaum created something called ELIZA, a computer program that was modeled after a psychiatrist (or an "active listener"). 

>If the latest round of chatbots succeed, they might prove that Weizenbaum, the creator of ELIZA, was right all along. These machines are not warm and cuddly replacements for the human intellect. They’re just another set of tools—an evolution of the apps that have served us for years.

As an aside, Weizenbaum didn't create ELIZA as a state of the art conversational program. He was, in fact, troubled by its positive reception, and later in life he wrote about the limits of artificial intelligence. The passage below is from his 1976 text [Computer power and human reason](https://en.wikipedia.org/wiki/Computer_Power_and_Human_Reason). He closes the third chapter with this image.

>Sometimes when my children were still little, my wife and I would stand over them as they lay sleeping in their beds. We spoke to each other only in silence, rehearsing a scene as old as mankind itself. It is as Ionesco told his journal: ‘Not everything is unsayable in words, only the living truth.

Beautiful, right? For a data class, it's a great reminder that data will always exist at distance from lived experience. We can try to pile more and more data on a given situation or phenomenon. But no matter how big our data gets, we are still missing something. That's why we've been stressing how your choice of data is a creative act.

Before we startup ELIZA, our dear friend Suman has listed out a number of identifiable elements to think about when you are designing a conversation. Again, not all of these will apply as we might be very functionally-oriented, but they are something to remember.

#### Some Identifiable Elements of Conversation:

| Element of Conversation | Possible Techniques to Compute/ Quantify |
| ------ | ----------- |
|1. Notifications/ Recalling relevant things   |  Time Series Analysis, Alerting, Keyword caches |
|2. Learning topics in context | Topic Mining/Modeling - extract the topic from the words in text |
|3. Understanding Social Networks (offline and online)  | Network Science, the study of the structure of how things are connected and how information flows through it |
|4. Responding to Emotion  | Sentiment Analysis
|5. Having Episodic Memory  | Some kind of graphical model, [see Aditi's data post](https://medium.com/@aditinair/episodic-memory-modeling-for-conversational-agents-7c82e25b06b4#.9k65cziqw). |
|6. Portraying Personality  | Decision Tree, which is a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. |



**ELIZA**

Anyway, today we are going to work on the basics of a chatbot. Below we have the ELIZA program in Python form. Execute it and interact. You type "quit" to get out.

In [None]:
from re import match, IGNORECASE
from random import choice
 
reflections = {
    "am": "are",
    "was": "were",
    "i": "you",
    "i'd": "you would",
    "i've": "you have",
    "i'll": "you will",
    "my": "your",
    "are": "am",
    "you've": "I have",
    "you'll": "I will",
    "your": "my",
    "yours": "mine",
    "you": "me",
    "me": "you"
}
 
actions = [
    [r'I need (.*)',
     ["Why do you need {0}?",
      "Would it really help you to get {0}?",
      "Are you sure you need {0}?"]],
 
    [r'Why don\'?t you ([^\?]*)\??',
     ["Do you really think I don't {0}?",
      "Perhaps eventually I will {0}.",
      "Do you really want me to {0}?"]],
 
    [r'Why can\'?t I ([^\?]*)\??',
     ["Do you think you should be able to {0}?",
      "If you could {0}, what would you do?",
      "I don't know -- why can't you {0}?",
      "Have you really tried?"]],
 
    [r'I can\'?t (.*)',
     ["How do you know you can't {0}?",
      "Perhaps you could {0} if you tried.",
      "What would it take for you to {0}?"]],
 
    [r'I am (.*)',
     ["Did you come to me because you are {0}?",
      "How long have you been {0}?",
      "How do you feel about being {0}?"]],
 
    [r'I\'?m (.*)',
     ["How does being {0} make you feel?",
      "Do you enjoy being {0}?",
      "Why do you tell me you're {0}?",
      "Why do you think you're {0}?"]],
 
    [r'Are you ([^\?]*)\??',
     ["Why does it matter whether I am {0}?",
      "Would you prefer it if I were not {0}?",
      "Perhaps you believe I am {0}.",
      "I may be {0} -- what do you think?"]],
 
    [r'What (.*)',
     ["Why do you ask?",
      "How would an answer to that help you?",
      "What do you think?"]],
 
    [r'How (.*)',
     ["How do you suppose?",
      "Perhaps you can answer your own question.",
      "What is it you're really asking?"]],
 
    [r'Because (.*)',
     ["Is that the real reason?",
      "What other reasons come to mind?",
      "Does that reason apply to anything else?",
      "If {0}, what else must be true?"]],
 
    [r'(.*) sorry (.*)',
     ["There are many times when no apology is needed.",
      "What feelings do you have when you apologize?"]],
 
    [r'Hello(.*)',
     ["Hello... I'm glad you could drop by today.",
      "Hi there... how are you today?",
      "Hello, how are you feeling today?"]],
 
    [r'I think (.*)',
     ["Do you doubt {0}?",
      "Do you really think so?",
      "But you're not sure {0}?"]],
 
    [r'(.*) friend (.*)',
     ["Tell me more about your friends.",
      "When you think of a friend, what comes to mind?",
      "Why don't you tell me about a childhood friend?"]],
 
    [r'Yes',
     ["You seem quite sure.",
      "OK, but can you elaborate a bit?"]],
 
    [r'(.*) computer(.*)',
     ["Are you really talking about me?",
      "Does it seem strange to talk to a computer?",
      "How do computers make you feel?",
      "Do you feel threatened by computers?"]],
 
    [r'Is it (.*)',
     ["Do you think it is {0}?",
      "Perhaps it's {0} -- what do you think?",
      "If it were {0}, what would you do?",
      "It could well be that {0}."]],
 
    [r'It is (.*)',
     ["You seem very certain.",
      "If I told you that it probably isn't {0}, what would you feel?"]],
 
    [r'Can you ([^\?]*)\??',
     ["What makes you think I can't {0}?",
      "If I could {0}, then what?",
      "Why do you ask if I can {0}?"]],
 
    [r'Can I ([^\?]*)\??',
     ["Perhaps you don't want to {0}.",
      "Do you want to be able to {0}?",
      "If you could {0}, would you?"]],
 
    [r'You are (.*)',
     ["Why do you think I am {0}?",
      "Does it please you to think that I'm {0}?",
      "Perhaps you would like me to be {0}.",
      "Perhaps you're really talking about yourself?"]],
 
    [r'You\'?re (.*)',
     ["Why do you say I am {0}?",
      "Why do you think I am {0}?",
      "Are we talking about you, or me?"]],
 
    [r'I don\'?t (.*)',
     ["Don't you really {0}?",
      "Why don't you {0}?",
      "Do you want to {0}?"]],
 
    [r'I feel (.*)',
     ["Good, tell me more about these feelings.",
      "Do you often feel {0}?",
      "When do you usually feel {0}?",
      "When you feel {0}, what do you do?"]],
 
    [r'I have (.*)',
     ["Why do you tell me that you've {0}?",
      "Have you really {0}?",
      "Now that you have {0}, what will you do next?"]],
 
    [r'I would (.*)',
     ["Could you explain why you would {0}?",
      "Why would you {0}?",
      "Who else knows that you would {0}?"]],
 
    [r'Is there (.*)',
     ["Do you think there is {0}?",
      "It's likely that there is {0}.",
      "Would you like there to be {0}?"]],
 
    [r'My (.*)',
     ["I see, your {0}.",
      "Why do you say that your {0}?",
      "When your {0}, how do you feel?"]],
 
    [r'You (.*)',
     ["We should be discussing you, not me.",
      "Why do you say that about me?",
      "Why do you care whether I {0}?"]],
 
    [r'Why (.*)',
     ["Why don't you tell me the reason why {0}?",
      "Why do you think {0}?"]],
 
    [r'I want (.*)',
     ["What would it mean to you if you got {0}?",
      "Why do you want {0}?",
      "What would you do if you got {0}?",
      "If you got {0}, then what would you do?"]],
 
    [r'(.*) mother(.*)',
     ["Tell me more about your mother.",
      "What was your relationship with your mother like?",
      "How do you feel about your mother?",
      "How does this relate to your feelings today?",
      "Good family relations are important."]],
 
    [r'(.*) father(.*)',
     ["Tell me more about your father.",
      "How did your father make you feel?",
      "How do you feel about your father?",
      "Does your relationship with your father relate to your feelings today?",
      "Do you have trouble showing affection with your family?"]],
 
    [r'(.*) child(.*)',
     ["Did you have close friends as a child?",
      "What is your favorite childhood memory?",
      "Do you remember any dreams or nightmares from childhood?",
      "Did the other children sometimes tease you?",
      "How do you think your childhood experiences relate to your feelings today?"]],
 
    [r'(.*)\?',
     ["Why do you ask that?",
      "Please consider whether you can answer your own question.",
      "Perhaps the answer lies within yourself?",
      "Why don't you tell me?"]],
 
    [r'quit',
     ["Thank you for talking with me.",
      "Good-bye.",
      "Thank you, that will be $150.  Have a good day!"]],
 
    [r'(.*)',
     ["Please tell me more.",
      "Let's change focus a bit... Tell me about your family.",
      "Can you elaborate on that?",
      "Why do you say that {0}?",
      "I see.",
      "Very interesting.",
      "{0}.",
      "I see.  And what does that tell you?",
      "How does that make you feel?",
      "How do you feel when you say that?"]]
]
 
 
def reflect(fragment):
    
    # Turn a string into a series of words
    tokens = fragment.lower().split()
    
    # for each word...
    for i in range(len(tokens)):
        token = tokens[i]
    
        # see if the word is in the "reflections" list and if it
        # is, replace it with its reflection (you -> me, say)
        if token in reflections:
            tokens[i] = reflections[token]
            
    return ' '.join(tokens)
 
 
def respond(statement):
    
    # run through all the actions
    for j in range(len(actions)):
    
        # for each one, see if it matches the statment that was typed
        pattern = actions[j][0] 
        responses = actions[j][1]
        found = match(pattern, statement.rstrip(".!"),IGNORECASE)
        
        if found:
        
            # for the first match, select a response at random and insert
            # the text from the statement into ELIZA's response
            response = choice(responses)
            return response.format(*[reflect(g) for g in found.groups()])
 
 
def eliza():
    # a friendly welcome
    print("Hello. How are you feeling today?")
 
    # talk forever...
    while True:
        
        # collect a statement and respond, stop the conversation on 'quit'
        statement = input("> ")
        print(respond(statement))
 
        if statement == "quit":
            break


There are some constructions here you haven't seen. The `def ... :` is a way to create new functions. That is, you can build your own operations that take data in, operate on it, and return in some way. 

The final function eliza(), for example, drops you into a loop (a "while" loop that you "break" out of by typing "quit". The only other new thing here is that the notebook has a funciton "raw_input" that lets your reader type things and gives you access to their musings. Play with ELIZA a little. What do you think?

In [None]:
eliza()

Before we turn you lose, let's examine the code a little. The reflect() function turns and "I" into a "you", allowing the program to turn a user's statement "Because I love apples" around into the question "If you love apples, what else must be true?" Here is reflect() working on single phrases.

In [None]:
reflect("I am troubled")

In [None]:
reflect("your analysis is wrong")

Next, let's look at the respond() function a little more closely. There are some new code constructions here. Here is respond() in action.

In [None]:
respond("I am doing fine.")

Below we take the same function but add a number of print statements to see what it's doing. The new function is called irespond() instead, to avoid confusion. 

In [None]:
from pprint import pprint

def irespond(statement):
    
    # run through all the actions
    for j in range(len(actions)):
        
        # for each one, see if it matches the statment that was typed
        pattern = actions[j][0] 
        responses = actions[j][1]
        found = match(pattern, statement.rstrip(".!"),IGNORECASE)
        
        if found:
            
            # for the first match, select a response at random and insert
            # the text from the statement into ELIZA's response
            
            print "Found pattern:"
            print pattern
            print "--"*5
            print "Choosing between responses:"
            pprint(responses)
            
            response = choice(responses)            

            print "--"*5
            print "The matched groups:"
            pprint([reflect(g) for g in found.groups()])
            print "--"*5
            
            print "ELIZA's response:"
            print response.format(*[reflect(g) for g in found.groups()])
            print "--"*5
            
            return           

In [None]:
irespond("My dog.")

The responses have references that look like "{0}" and "{1}" and so on. Given a string with these special character strings, the method format() will substitute its first argument for "{0}", its second for "{1}" and so on. Here we make two substitutions.

In [None]:
"Not everything is {0} in words, only {1}".format("sayable","the living truth.")

The only other magic is the "\*" inside the format() call in irespond() and respond(). What the star notation does is take a list and make it like each element is another argument for the function. So the list below has two elements, two strings, and the star make the call below just like the one above. The first element of the list is the first argument to format() and the second is the second argument. 

In [None]:
"Not everything is {0} in words, only {1}".format(*["sayable","the living truth."])

We do this because the match() command returns a list of the groups identified in the regular expression -- the items marked out with parenthese. So above, the word "dog" is the only match and the groups() method returns a list with just one item. format() then takes that item and plops it into the response string, replacing "{0}".

In [None]:
irespond("do you think my mother would approve?")

In [None]:
irespond("I'm sad.")

Your turn! Start by copying and adapting the ELIZA code to fix up where it seems to get stuck, conversationally. When you are ready, start on your own bot. The more rules you rewrite the better. What are you going to talk about? What are you going to ground your conversation in? Maybe we ground it in Trump commentary (unless you're exhausted -- I'll understand)? [Here](https://www.nytimes.com/2016/11/18/technology/automated-pro-trump-bots-overwhelmed-pro-clinton-messages-researchers-say.html?_r=0) is a great article on simple political chat bots and another one [here](https://www.askhillaryanddonald.com/assets/Sample_Questions.pdf).