Three objectives:

1.  Help, baked into python.
2.  Brief intro to regular expressions.
3.  Installing third-party modules ("easy" installations only).

### What type is an object?

Python has a number of [built-in functions](https://docs.python.org/2/library/functions.html), three of which (**type**, **dir**, and **help**) are useful in determining the type of an object, etc.

When we define a variable, create an object, etc, we know its type; however, from time to time, we might get a result from some third-party function (or we may have lost track of what we were doing!), hence the need for these functions.

In [None]:
from collections import defaultdict

a_number = 12
a_string = 'Some meaningless text'
a_list = ['a', 'b', 'c', 'd']
a_dictionary = {'a': 1, 'b':2, 'c': 3, 'd': 4}

a_defaultdict = defaultdict()
for a, n in enumerate(a_list):
    a_defaultdict[a] = n

print 'type(a_number)', type(a_number)
print 'type(a_string)', type(a_string)
print 'type(a_list)', type(a_list)
print 'type(a_dictionary)', type(a_dictionary)
print 'type(a_defaultdict)', type(a_defaultdict)

### What are the methods and properties of an object . . . 

. . . and which is which?

In [None]:
print
print dir(a_dictionary)
print
print a_dictionary.keys
print
print a_dictionary.__doc__

a_dictionary.keys

### What is a method's signature?

I.e., what arguments does it take, and what does it return?

For help to work, ["docstrings"](https://en.wikipedia.org/wiki/Docstring#Python) must be present in the code being "helped".

In [None]:
print help(a_dictionary.keys)

### type, dir and help work on modules

We're conditioned (or at least I am) to Google everything.  But for standard modules (i.e., those that ship with the core language) and for third-party modules written by good citizens, there is a ton of useful information at your fingertips.

In [None]:
import collections

print
print type(collections)
print
print dir(collections)
print
print help(collections.Counter)

### Brief intro to regular expressions

See

* Doug Knox's ["Understanding Regular Expressions"](https://programminghistorian.org/lessons/understanding-regular-expressions)
* [Python's documentation](https://docs.python.org/2/howto/regex.html)

Make your regular expressions **raw strings**, so the python interpreter passes the escape character "\" through to re.

**metacharacters**, special sequences, capture groups, character classes . . . 

In [None]:
import re

test_string = 'I am a teapot, short and stout.'

result = re.sub(r'\bteapot\b', 'beverage brewing system', test_string)     # \b (slash b) means word boundary

print
print 'test_string', test_string
print
print 'type(result)', type(result)
print
print 'result', result

result = re.sub(r'[aeiou]', '', test_string)

print
print result

result = re.sub(r'\b(tea)(pot)\b', r'\1\1\1\2\2', test_string)

print
print result


### regex tokenizing

(There's more on this, and on tokenizing in general, in the NLTK book.)

In [None]:
import re

test_string = """
Semolina Pilchard,
Climbing up the Eiffel tower,
Elementary penguin singing Hare Krishna:
Man, you should have seen them kicking Edgar Allen Poe . . . 

I am the egg-man!
They are the egg-men!
I am the walrus!
Goo goo g'joob, goo goo goo g'joob
Goo goo g'joob, goo goo goo g'joob, goo goo
"""

tokens = re.split(r'\W', test_string)      # \W means any non-alphanumeric character
 
print
print tokens

tokens = [t for t in re.split(r'\W', test_string) if t > '']      

print
print tokens

tokens = []
for t in re.split(r'\W', test_string):
    if t > '':
        tokens.append(t)

print
print tokens

In [None]:
import re

test_string = """
Semolina Pilchard,
Climbing up the Eiffel tower,
Elementary penguin singing Hare Krishna:
Man, you should have seen them kicking Edgar Allen Poe . . . 

I am the egg-man!
They are the egg-men!
I am the walrus!
Goo goo g'joob, goo goo goo g'joob
Goo goo g'joob, goo goo goo g'joob, goo goo
"""

tokens = [t for t in re.split(r'\s', test_string) if t > '']      # \s means white space (space, new line, tab)

print
print tokens

tokens = [t for t in re.split(r'\s|\.|!', test_string) if t > '']  # | (pipe) means or; \. means period

print
print tokens

tokens = [t for t in re.split(r'[\s\.!]', test_string) if t > '']   # [] indicates a "regex class" of characters

print
print tokens

tokens = [t for t in re.split(r'([\s\.!])', test_string) if t > '']  # () here puts the split characters in the results

print
print tokens


In [None]:
import re

test_string = """
Semolina Pilchard,
Climbing up the Eiffel tower,
Elementary penguin singing Hare Krishna:
Man, you should have seen them kicking Edgar Allen Poe . . . 

I am the egg-man!
They are the egg-men!
I am the walrus!
Goo goo g'joob, goo goo goo g'joob
Goo goo g'joob, goo goo goo g'joob, goo goo
"""

tokens = [t for t in re.split(r'[^a-z]', test_string.lower()) if t > '']  # ^ means not; split if not a-z

print
print tokens

tokens = [t for t in re.split(r'(.)', test_string) if t > '']  # . unescaped means any character; note () in expression

print
print tokens

#### [where did I leave macbeth, and what's it called?]

It's possible to [run shell commands the notebook](https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html) (Mac and Linux).

In [None]:
!ls corpora/shakespeare_plaintext

### Matching . . .

Do you want to search to for the first match, or do you want to find all matches?

[**Search vs match**](https://docs.python.org/2/library/re.html#search-vs-match) is yet another odd little corner of python . . . 

In [None]:
import re

def demonstrate_re_search(regex_pattern, string_to_search):
    
    print 
    print 'regex_pattern', regex_pattern

    match = re.search(regex_pattern, string_to_search)
    
    #print 
    #print '\t', type(match), dir(match)
    #print
    
    if match == None:
        print '\t', 'NOT FOUND'
    else:
        print '\t', 'FOUND'
    
# ------------------------------------------------------------------------

macbeth_text = open('corpora/shakespeare_plaintext/macbeth.txt').read()

macbeth_text = re.sub('\n\n+', '\n', macbeth_text)

demonstrate_re_search(r'\bsemolina pilchard\b', macbeth_text)

demonstrate_re_search(r'\bequivoca.*\b', macbeth_text)

matches = re.finditer(r'\bequivoca.*\b', macbeth_text.lower())
for m in matches:
    print
    print '-------------------------------------------------------------'
    #print type(m)
    #print type(m), dir(m)
    print m.start(), macbeth_text[m.start() - 100: m.end() + 100]


### Installing third-party packages

We're going to install

* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a library for parsing content from web pages;
* [lxml](http://lxml.de/), which is especially useful for parsing XML;
* [Scrapy](https://scrapy.org/), a web scraper;
* [spacy](https://spacy.io/), an NLP (Natural Language Processing) package;
* [textacy](http://textacy.readthedocs.io/en/latest/), which adds some goodness on top of spacy;
* [gensim](https://radimrehurek.com/gensim/), a nifty corpus loader/comparer/topic modeller;
* And [nltk](https://www.nltk.org/).

#### Mac and Linux

We're going to try to [**use conda**](https://conda.io/docs/user-guide/tasks/manage-pkgs.html) for everything.  In the cell below, remove the '#' which precedes which line, then run the cell.  Don't remove the '!'; otherwise, the notebook won't see the line as a shell command.

#### Windows

**I hope this works.  If not, make your way to Eads 004, and we'll figure it out.**

From the start menu, you should have something called the "Anaconda Prompt."  Open it.  Copy line-by-line the content from the next cell, and hit enter.


In [None]:
!conda install -y beautifulsoup4
!conda install -y lxml
!conda install -y scrapy
!conda install -y spacy
!python -m spacy download en
!conda install -y textacy
!conda install -y gensim
!conda install -y nltk

#### Grab one ntlk corpus

In [None]:
import nltk
nltk.download("stopwords")