## INVESTIGATING UNSTRUCTURED TEXT
As we've seen, even the sometimes messy and unpredictable Markup language of HTML can give us clues to how data may be structured. But language as a system (as we saw in Borges) also comes with its own structures. Python provides numerous methods for navigating through basic linguistic patterns. Let's begin with repetition itself:

In [1]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''

speech

"Tomorrow, and tomorrow, and tomorrow,\nCreeps in this petty pace from day to day,\nTo the last syllable of recorded time;\nAnd all our yesterdays have lighted fools\nThe way to dusty death. Out, out, brief candle!\nLife's but a walking shadow, a poor player,\nThat struts and frets his hour upon the stage,\nAnd then is heard no more. It is a tale\nTold by an idiot, full of sound and fury,\nSignifying nothing."

There're various ways to investigate Macbeth's famous, very short, speech. We begin by searching for the obvious, searching through the whole speech.

Here we are using Python's string methods. Note that a string is treated as a list of characters. The first character of the speach, "T" is in the [0] position.

In [2]:
speech[0]

'T'

In [3]:
'tomorrow' in speech

True

In [4]:
speech.find('candle')

202

In [5]:
speech[190:202+len('candle')]
speech[14:22]
len(speech)

402

In [6]:
speech.count('and')
speech.count('tomorrow')

2

In [7]:
speech.lower().count('tomorrow')
speech.lower().find('idiot')

352

In [8]:

speech.count('And')

2

In [9]:
speech.count('\nT')

4

In [10]:
speech.lower().count('and')
len(speech)

402

Of course, there is already a structure to the speech that we are ignoring--it has lines. Let's get out those lines and put them into a list.

In [11]:
#lines = speech.split('\n')
#spliting by lines is so common these is a method for it
#but I prefer the above
lines = speech.splitlines() 
lines

['Tomorrow, and tomorrow, and tomorrow,',
 'Creeps in this petty pace from day to day,',
 'To the last syllable of recorded time;',
 'And all our yesterdays have lighted fools',
 'The way to dusty death. Out, out, brief candle!',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,',
 'And then is heard no more. It is a tale',
 'Told by an idiot, full of sound and fury,',
 'Signifying nothing.']

In [12]:
firstline = lines[0]
firstline

'Tomorrow, and tomorrow, and tomorrow,'

Python has a handful of built-in ways to search a line. Here are just a few.

In [13]:
yest = firstline.replace('tomorrow','yesterday',)
yest

'Tomorrow, and yesterday, and yesterday,'

In [14]:
firstline.startswith('Tomorrow')

True

In [15]:
firstline.endswith('tomorrow,')

True

## List comprehensions
What if we want to search through every line. The obvious way is using a `for` loop.

In [16]:
helpful_list = []
for line in lines:
    if line.startswith('T'):
        helpful_list.append(line[0:10])
helpful_list

['Tomorrow, ', 'To the las', 'The way to', 'That strut', 'Told by an']

That is a very simple loop, so simple that Python has a solution for a looping through a list using a one-line statement, called a **list comprehension**

In [17]:
helpufl_list = [line[0:10] for line in lines if line.startswith('T')]
helpful_list

['Tomorrow, ', 'To the las', 'The way to', 'That strut', 'Told by an']

Whaaaat? Let's break down how a list comprehension works:

**helpful_list** = [line[0:10] for line in lines if line.startswith('T')]

*A variable:* that is the variable that's going to hold the final output of this loop.

helpful_list = **\[**line[0:10] for line in lines if line.startswith('T')**\]**

*Returns a list:* The **\[ \]** indicate that what is being returned is actually a list.

helpful_list= [**line[0:10]** for line in lines if line.startswith('T')]

*What gets placed in the list:* This first part inside the brackets is what is actually going to be entered in the list (if it passes the test at the end.)

helpful_list= [line[0:10] **for line in lines** if line.startswith('T')]

*The loop:* This defines the loop. We are looking through lines, and each element inside that list we are going to call line.

helpful_list= [line[0:10] for line in lines **if line.startswith('T')**]

*The test for each line:* Finally, this is the if statement that tests for something for each element in the list (lines). Only if it passes the test (if line starts with a "T") does that first part of the list comprehension (line[0:10]) get placed in the resulting list.

Remember this, when we start using more robust ways of searching line by line (sentence by sentence, etc) these will come in handy. But before we jump to those special searching methods, let's have a little detour on sorting.

## Sorting!
Say we want to investigate the lines in the speech, and order them from longest line to shortest line. Well we know how to get the length of each line using loop, but how can we measure them to reorder our list?

In [18]:
for line in lines:
    print(len(line))

37
42
38
41
47
43
46
39
41
19


We could write a function that pairs these numbers with each line, and then sorts through everything--but sort functions are notoriously challenging to write. And Python has a built in sorting function.

In [19]:
sortlines = lines.copy()
sortlines.sort()
sortlines

['And all our yesterdays have lighted fools',
 'And then is heard no more. It is a tale',
 'Creeps in this petty pace from day to day,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'That struts and frets his hour upon the stage,',
 'The way to dusty death. Out, out, brief candle!',
 'To the last syllable of recorded time;',
 'Told by an idiot, full of sound and fury,',
 'Tomorrow, and tomorrow, and tomorrow,']

But not only that, Python has a built in mini-function called a `lambda` function that you can nest inside at sorting function. lambda functions are a bit advance for where we are now, so don't lose any time and brain power on them, it's just go to know they exist.

In [20]:
sortlines = lines.copy()
sortlines.sort(key=lambda x: len(x), reverse=True)
# what is this one down here doing?
sortlines.sort(key=lambda x: x.split()[-1], reverse=True)
#the line above is using each key to split each line and take the last word then returing a sorted list of lines in a reversed alphabatical order.
sortlines

['Tomorrow, and tomorrow, and tomorrow,',
 'To the last syllable of recorded time;',
 'And then is heard no more. It is a tale',
 'That struts and frets his hour upon the stage,',
 "Life's but a walking shadow, a poor player,",
 'Signifying nothing.',
 'Told by an idiot, full of sound and fury,',
 'And all our yesterdays have lighted fools',
 'Creeps in this petty pace from day to day,',
 'The way to dusty death. Out, out, brief candle!']

## Regular Expressions
The more you work with unstructured text, the greater desire you will have for the power that regular expressions give you. Regular expressions are a mini-language to themselves (often sharing similarities across different programming languages). They allow you to search for a variety of patterns within text. The most obvious patterns you might find are telephone numbers, ZIP Codes, email addresses (social security numbers and credit card numbers for the more malicious)--and many regular expressions have been written to capture these with varying levels accuracy. Today, however, our focus will be on exploring text.

First import the built-in regular expression library `re`

In [21]:
import re

There are five main regular expression methods that we will work with:

**match()** & **search()**: these methods tell you whether or not they found a match, and where that match was located--although match() only searches at the very beginning of the line--so it is rarely useful.

**split()** & **sub()**: these two work just like split() & replace(), but they search for patterns and return a list or a substitute string respectively.

**findall()**: just as the name sounds, this method returns a list of matching patterns that were found throughout the entire string.

In [22]:

#found = 
if (re.search("omorrow",firstline,re.IGNORECASE)):
    print("Yes!")
else:
    print('No!')
#found
# found = re.search("morrow",firstline,re.IGNORECASE)
#found.start()
#found


Yes!


In [23]:
newlist = re.split("and",firstline,flags=re.IGNORECASE)
newstring = re.sub("tomorrow","yesterday",firstline,flags=re.IGNORECASE)
print(newlist,newstring)

['Tomorrow, ', ' tomorrow, ', ' tomorrow,'] yesterday, and yesterday, and yesterday,


In [24]:
words = re.findall("to",firstline,re.IGNORECASE)
len(words)

3

## Special characters
While the search methods above are more useful than what's built into Python, it is the pattern seeking commands that--once you get used to them--do the most powerful work.

Here's a list  of the most common pattern seeking characters:

| special character | what it does |
|--------|---------|
| `.` | Match any character except newline |
| `^` | match the beginning of string |
| `$` | match the end of string, including `\n` |
| `*` | match 0 or more repetitions |
| `+` | match 1 or more repetitions  |
| `?` | match 0 or 1 repetitions  |
| `{m}` | m specifies the number of repetitions  |
| `{m,n}` | m and n specifies a range of repetitions  |
| `{m,}` | m specifies the minimum number of repetitions  |


In [25]:
all_ll = re.findall("....l+.{4}",speech)
#re.search("^tomorrow",firstline)
#re.search("tomorrow\.$",firstline)
all_ll


['the last ',
 'llable of',
 'nd all our',
 'ave light',
 'a walking',
 'or player',
 ', full of ']

In [26]:
#a list comprehension again!
#Note that match() would produce the same thing
#\b is a word boundary, super important and magicial, see below!
[line for line in lines if re.search(r"\bidiot\b",line)]

['Told by an idiot, full of sound and fury,']

In [27]:
#what is this searching for...
[line for line in lines if re.search("[erw],$",line)]

['Tomorrow, and tomorrow, and tomorrow,',
 "Life's but a walking shadow, a poor player,",
 'That struts and frets his hour upon the stage,']

In [28]:
th_plus = re.findall("mor+..",speech)
th_plus

['morrow', 'morrow', 'morrow', 'more.']

In [29]:
names = "Jon, John, Jonn, Johhhn, Joan"
find_names = re.findall(r"Jo.{0,2}n\b",names)
find_names

['Jon', 'John', 'Jonn', 'Joan']

In [30]:
# see what happens if you replace the + with a *
l_plus = re.findall("..l+..",speech)
l_plus

['e las',
 'syllab',
 ' all o',
 'e lig',
 'ndle!',
 'walki',
 ' play',
 'Told ',
 'full o']

In [31]:
# ? means one occurance or zero.
l_plus = re.findall(".or?",speech)
l_plus

['To',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'to',
 'mor',
 'ro',
 'ro',
 'to',
 'To',
 ' o',
 'cor',
 ' o',
 'fo',
 'to',
 ' o',
 'do',
 'po',
 'ho',
 'po',
 'no',
 'mor',
 'To',
 'io',
 ' o',
 'so',
 'no']

In [32]:
o_2 = re.findall("..o{2}..",speech)
o_2

[' fools', ' poor ']

## Sets and Groups
**Sets**, which include `[]` in shortcuts like `\w`, allow you to search for certain types of characters. **Groups**, which are demarcated by `()` allow you to specify important sub-patterns that you can access individually.

| enclosures | what it does |
|--------|---------|
| `[]` | A defined set of characters to search for |
| `()` | A group of characters to search for, can be accessed individually in the results. |


| Examples of sets | what it does |
|--------|---------|
| `[aeiou]` | Find any vowel |
| `[Tt]` | Find a lowercase or uppercase t |
| `[0-9]` | Find any number, there is a shortcut for this |
| `[^0-9]` | Find anything that's not number, there is a shortcut for this |
| `[13579]` | Find any odd numer |
| `[A-Za-z]` | Find any letter, there is a shortcut for this too |
| `[+.*]` | Find those actual characters, special characters are canceled in sets (not including shortcuts: see below) |


| Shortcut | what it does |
|--------|---------|
| `\b` | Word boundary: spaces, commas, punctuation, end of line, anything that comes at the beginning or end of a word |
| `\B` | Not a word-boundary |
| `\d` | numbers [0-9] |
| `\D` | not numbers |
| `\s` | whitespace characters: space, tab... |
| `\S` | not space |
| `\w` | letters |
| `\W` | not letters |


In [33]:
words = re.findall(r"\b[Ss]\w{4}",speech)
words

['sylla', 'shado', 'strut', 'stage', 'sound', 'Signi']

In [34]:
#you can use a set to search for a special character
#this is searching for the last word in a sentence
words = re.findall(r"\b\w+[.]",speech)
words
#you can also use \ get a special character
#words = re.findall(r"\b\w+\.",speech)
#words

['death.', 'more.', 'nothing.']

In [35]:
words = re.findall(r"[tT]\w+",speech)
words

['Tomorrow',
 'tomorrow',
 'tomorrow',
 'this',
 'tty',
 'to',
 'To',
 'the',
 'time',
 'terdays',
 'ted',
 'The',
 'to',
 'ty',
 'th',
 'That',
 'truts',
 'ts',
 'the',
 'tage',
 'then',
 'tale',
 'Told',
 'thing']

Looking for phrases and grouping them

In [36]:
# three-word phrases that begin with two-letter words
phrases = re.findall(r"\b\w{2}\W+\w+\W+\w+",speech)
#overlapping using look back ?=, this is advanced...
#phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech) 
phrases

['in this petty',
 'to day,\nTo',
 'of recorded time',
 'to dusty death',
 'is heard no',
 'It is a',
 'by an idiot',
 'of sound and']

GROUPS ( ) ALLOW YOU TO SEPARATE DIFFERENT TEXT ELEMENTS INTO TUPLES

In [37]:
phrases = re.findall(r"(\b\w{2})\W+(\w+)\W+(\w+)",speech)
#phrases = re.findall(r"(?=(\b\w{2})\W+(\w+)\W+(\w+))",speech)
phrases

[('in', 'this', 'petty'),
 ('to', 'day', 'To'),
 ('of', 'recorded', 'time'),
 ('to', 'dusty', 'death'),
 ('is', 'heard', 'no'),
 ('It', 'is', 'a'),
 ('by', 'an', 'idiot'),
 ('of', 'sound', 'and')]

Searching a longer poem

In [47]:
# Import text
f = open('wasteland.txt', encoding='utf-8')
wasteland = f.read()

In [48]:
#make a list of lines
poemlines = wasteland.split('\n')
#strip every line to get rid of leading/ending whitespace
poemlines = [line.strip() for line in poemlines]

In [40]:
[line for line in poemlines if re.search(r"what", line)]

['What are the roots that clutch, what branches grow',
 '“I never know what you are thinking. Think.”',
 'He’ll want to know what you done with that money he gave you',
 'Datta: what have we given?']

SOME MUCH MORE COMMON USES OF REGEX!!!

In [51]:
#proper email systax: (simplified, this is looking for .edu address with numbers, letters, _, or -)
import re
email = "thirj525@newschool.edu"
if (re.match(r"[-_\w\d]+@[-_\w\d]+[.]edu$",email,re.IGNORECASE)):
    print("Yes!")
else:
    print('No!')

Yes!


In [52]:
#proper phone number syntax: SO FUN!!!

phone = "(888) 929-9000"
# phone = "888-929-9000"
# phone = "8889299000"
# phone = "888929900033"
if (re.match(r"[(]*\d{3}[) -]*\d{3}[-]*\d{4}$",phone,re.IGNORECASE)):
    print("Yes!")
else:
    print('No!')

Yes!


THE GUARDIAN!!!
Using REGEX GROUPS to consistently structure of the list in a much more straightforward and accurate way!!

In [53]:
import requests
from bs4 import BeautifulSoup
my_url = "https://www.theguardian.com/books/2017/dec/31/the-100-best-nonfiction-books-of-all-time-the-full-list"
raw_html = requests.get(my_url).content
soup_doc = BeautifulSoup(raw_html, "html.parser")

In [67]:
all_entries=soup_doc.find_all('p')
get_better_list = []
for book in all_entries[1:-1]:
    whole_line = book.text
    print(whole_line)
    #pattern=re.findall(r"^\d{1,3}[.].+[(][\d]{4,}[)].+",whole_line)##all_but 29 and 83
    #pattern=re.findall(r"^\d{1,3}[.].+[(][\d/]{4,}[)].+",whole_line)#all but 83
    pattern=re.findall(r"^\d{1,3}[.].+[(][\d/-]{4,}[)].+",whole_line)#ALL!!!!!!
    #pattern_groups = re.findall(r"^(\d{1,3})[.](.+)[(]([\d/-]{4,})[)](.+)",whole_line)
    print(pattern_groups)
    get_better_list.append(pattern_groups)

1. The Sixth Extinction by Elizabeth Kolbert (2014) 
  An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.
[('100', ' King James Bible: The Authorised Version ', '1611', 'It is impossible to imagine the English-speaking world celebrated in this series without the King James Bible, which is as universal and influential as Shakespeare.')]
2. The Year of Magical Thinking by Joan Didion (2005)This steely and devastating examination of the author’s grief following the sudden death of her husband changed the nature of writing about bereavement.
[('100', ' King James Bible: The Authorised Version ', '1611', 'It is impossible to imagine the English-speaking world celebrated in this series without the King James Bible, which is as universal and influential as Shakespeare.')]
3. No Logo by Naomi Klein (1999)
  Naomi Klein’s timely anti-branding bible combined a fresh approach to corporate hegemony with potent reportage from the dark side of capi

In [81]:
##MAKE SURE ALL ENTRIES ARE SPLIT INTO 4s!!!
get_better_list
for each in get_better_list:
    if len(each[0]) != 4:
        print(each[0])

In [89]:
get_better_list[99][0]

('100',
 ' King James Bible: The Authorised Version ',
 '1611',
 'It is impossible to imagine the English-speaking world celebrated in this series without the King James Bible, which is as universal and influential as Shakespeare.')