## Regular expressions
More ways to use regular expressions...

More ways to think about parsing and reading unstructured texts...

In [6]:
import re

**Parsing an entire text** with groups and split

Open this text file for Hamlet and take a look at it! The text is very basic. 
What I do below is look for the various patterns that begin to group the parts of the play together.

Plays are well-structured texts, in beautiful soup or JavaScript object notation we would understand play being organized by:
	play.act.scene.dialogue_stageDirection
    These are the levels of organization of a play.


In [12]:
f = open('hamlet.txt', 'r', encoding='utf8')
play = f.read()
print(play)
#That last line just shows us the first 500 characters of the play.


The Tragedy of Hamlet, Prince of Denmark

ACT I

SCENE I. Elsinore. A platform before the castle.

FRANCISCO at his post. Enter to him BERNARDO
BERNARDO
Who's there?
FRANCISCO
Nay, answer me: stand, and unfold yourself.
BERNARDO
Long live the king!
FRANCISCO
Bernardo?
BERNARDO
He.
FRANCISCO
You come most carefully upon your hour.
BERNARDO
'Tis now struck twelve; get thee to bed, Francisco.
FRANCISCO
For this relief much thanks: 'tis bitter cold,
And I am sick at heart.
BERNARDO
Have you had quiet guard?
FRANCISCO
Not a mouse stirring.
BERNARDO
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
FRANCISCO
I think I hear them. Stand, ho! Who's there?
Enter HORATIO and MARCELLUS

HORATIO
Friends to this ground.
MARCELLUS
And liegemen to the Dane.
FRANCISCO
Give you good night.
MARCELLUS
O, farewell, honest soldier:
Who hath relieved you?
FRANCISCO
Bernardo has my place.
Give you good night.
Exit

MARCELLUS
Holla! Bernardo!
BERNARDO
Say,
Wh

**GROUPING PATTERNS**

We can isolate information types using regular expressions. Here are a bunch of different regular expressions using groups ( ) and re.findall() that pull out every instance of a pattern. Try them out!

In [26]:
#This gets a list of characters
all_chars = re.findall(r"[\n]([A-Z ]+)[\n]",play)
print(all_chars)

#Gets a list of act names
act_names = re.findall(r"[\n](ACT [IV]+)[\n]",play)
act_names

#Gets a list of act and scene names
act_scene=re.findall(r"[\n](ACT [IV]+[\n]+SCENE [IVX]+\.)",play)
act_scene

#List of scene names
all_scenes = re.findall(r"(SCENE [IVX]+)",play)
all_scenes

#List of all acts and all scenes with acts (with a blank when ACT doesn't appear)
act_w_scene = re.findall(r"(ACT [IV]+)?[\n]+(SCENE [IVX]+)",play)
act_w_scene

# #List of all acts all scenes plus the scene description
act_w_scene_des = re.findall(r"(ACT [IV]+)?[\n]+(SCENE [IVX]+\.)(.+)[\n]",play)
act_w_scene_des

['ACT I', 'BERNARDO', 'FRANCISCO', 'BERNARDO', 'FRANCISCO', 'BERNARDO', 'FRANCISCO', 'BERNARDO', 'FRANCISCO', 'BERNARDO', 'FRANCISCO', 'BERNARDO', 'FRANCISCO', 'HORATIO', 'MARCELLUS', 'FRANCISCO', 'MARCELLUS', 'FRANCISCO', 'MARCELLUS', 'BERNARDO', 'HORATIO', 'BERNARDO', 'MARCELLUS', 'BERNARDO', 'MARCELLUS', 'HORATIO', 'BERNARDO', 'HORATIO', 'BERNARDO', 'MARCELLUS', 'BERNARDO', 'MARCELLUS', 'BERNARDO', 'HORATIO', 'BERNARDO', 'MARCELLUS', 'HORATIO', 'MARCELLUS', 'BERNARDO', 'HORATIO', 'MARCELLUS', 'BERNARDO', 'HORATIO', 'MARCELLUS', 'HORATIO', 'MARCELLUS', 'HORATIO', 'MARCELLUS', 'HORATIO', 'BERNARDO', 'HORATIO', 'MARCELLUS', 'HORATIO', 'BERNARDO', 'HORATIO', 'MARCELLUS', 'BERNARDO', 'HORATIO', 'MARCELLUS', 'HORATIO', 'MARCELLUS', 'KING CLAUDIUS', 'CORNELIUS VOLTIMAND', 'KING CLAUDIUS', 'LAERTES', 'KING CLAUDIUS', 'LORD POLONIUS', 'KING CLAUDIUS', 'HAMLET', 'KING CLAUDIUS', 'HAMLET', 'QUEEN GERTRUDE', 'HAMLET', 'QUEEN GERTRUDE', 'HAMLET', 'KING CLAUDIUS', 'QUEEN GERTRUDE', 'HAMLET', 'KIN

[('ACT I', 'SCENE I.', ' Elsinore. A platform before the castle.'),
 ('', 'SCENE II.', ' A room of state in the castle.'),
 ('', 'SCENE III.', " A room in Polonius' house."),
 ('', 'SCENE IV.', ' The platform.'),
 ('', 'SCENE V.', ' Another part of the platform.'),
 ('ACT II', 'SCENE I.', " A room in POLONIUS' house."),
 ('', 'SCENE II.', ' A room in the castle.'),
 ('ACT III', 'SCENE I.', ' A room in the castle.'),
 ('', 'SCENE II.', ' A hall in the castle.'),
 ('', 'SCENE III.', ' A room in the castle.'),
 ('', 'SCENE IV.', " The Queen's closet."),
 ('ACT IV', 'SCENE I.', ' A room in the castle.'),
 ('', 'SCENE II.', ' Another room in the castle.'),
 ('', 'SCENE III.', ' Another room in the castle.'),
 ('', 'SCENE IV.', ' A plain in Denmark.'),
 ('', 'SCENE V.', ' Elsinore. A room in the castle.'),
 ('', 'SCENE VI.', ' Another room in the castle.'),
 ('', 'SCENE VII.', ' Another room in the castle.'),
 ('ACT V', 'SCENE I.', ' A churchyard.'),
 ('', 'SCENE II.', ' A hall in the castle

**regex split()**

If you use split() with groups ( ) it will remember the patterns you are using to split by, but it will also isolate everything between those patterns as well! So now you are getting an organized list of all of the components of this play.

In [59]:
#EVERYTHING!!! this one regular expression, using split, 
#parses the entire structure of the play.
#It gets a list that has the act, the scene, the scene description, 
#and the entirety of that scene--each as a separate list element
#so every fourth element contains the complete text of a scene.
scenes = re.split(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+[.])(.+)[\n]",play)

print(len(scenes))
scenes[5]

81


In [62]:
print(scenes[5])

None


**Making that list into a useful dictionary!**

A good thing to understand: **lists** are great for isolating and ordering a series of data.

Whereas **dictionaries** are great for grouping that data into units and making those units more meaningful by assigning keys to them.

The list I get from the regular expression above is extremely useful because it has every component part of the play isolated and in a series. But when I transform it into a dictionary below, then I get a **list of every scene**: each *scene* is a **dictionary** with **keys** for the *act number*, the *scene number*, the *scene description*, and all of the *dialogue* (and stage directions) from that scene within a dictionary.

In [68]:
#The list we get from that regular expression give us a pattern, 
#(after the first element [0] which is just the title of the play.)

# The pattern is: act [1], scene [2], description [3], dialogue [4]
#                 act [5], scene [6], description [7], dialogue [8]
#                 act [9], scene [10], description [11], dialogue [12]
#                 ...and so forth..
# So starting at element 1, every 4 elements match that pattern.
# So this loop sets the range to start x at 1 and to jump every 4 ahead, 
# and it uses x in the loop to isolate each element 
# and to enter each element into a more meaningful dictionary by keys
# (one tricky part: the act text "ACT I" element is empty for all subsequent scenes
#  until we get to the next Act so I control for that with the variable current_act)

hamlet_structure=[]
current_act = "Will Be the Act Name"
for x in range(1,len(scenes),4):
    if scenes[x] is not None:
        current_act = scenes[x]
    scene_dict = {}
    scene_dict['act'] = current_act
    scene_dict['scene'] = scenes[x+1]
    scene_dict['setting'] = scenes[x+2]
    scene_dict['dialogue'] = scenes[x+3]
    hamlet_structure.append(scene_dict)
hamlet_structure[5]

{'act': 'ACT II',
 'scene': 'SCENE I.',
 'setting': " A room in POLONIUS' house.",
 'dialogue': "\nEnter POLONIUS and REYNALDO\nLORD POLONIUS\nGive him this money and these notes, Reynaldo.\nREYNALDO\nI will, my lord.\nLORD POLONIUS\nYou shall do marvellous wisely, good Reynaldo,\nBefore you visit him, to make inquire\nOf his behavior.\nREYNALDO\nMy lord, I did intend it.\nLORD POLONIUS\nMarry, well said; very well said. Look you, sir,\nInquire me first what Danskers are in Paris;\nAnd how, and who, what means, and where they keep,\nWhat company, at what expense; and finding\nBy this encompassment and drift of question\nThat they do know my son, come you more nearer\nThan your particular demands will touch it:\nTake you, as 'twere, some distant knowledge of him;\nAs thus, 'I know his father and his friends,\nAnd in part him: ' do you mark this, Reynaldo?\nREYNALDO\nAy, very well, my lord.\nLORD POLONIUS\n'And in part him; but' you may say 'not well:\nBut, if't be he I mean, he's very w

In [69]:
hamlet_structure

[{'act': 'ACT I',
  'scene': 'SCENE I.',
  'setting': ' Elsinore. A platform before the castle.',
 {'act': 'ACT I',
  'scene': 'SCENE II.',
  'setting': ' A room of state in the castle.',
  'dialogue': "\nEnter KING CLAUDIUS, QUEEN GERTRUDE, HAMLET, POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, and Attendants\nKING CLAUDIUS\nThough yet of Hamlet our dear brother's death\nThe memory be green, and that it us befitted\nTo bear our hearts in grief and our whole kingdom\nTo be contracted in one brow of woe,\nYet so far hath discretion fought with nature\nThat we with wisest sorrow think on him,\nTogether with remembrance of ourselves.\nTherefore our sometime sister, now our queen,\nThe imperial jointress to this warlike state,\nHave we, as 'twere with a defeated joy,--\nWith an auspicious and a dropping eye,\nWith mirth in funeral and with dirge in marriage,\nIn equal scale weighing delight and dole,--\nTaken to wife: nor have we herein barr'd\nYour better wisdoms, which have freely gone\

In [74]:
sentence = 'The horse the boy the car the truck went to the ocean.'
#scenes = re.split(r"(ACT [IV]+)*[\n]+(SCENE [IVX]+[.])(.+)[\n]",play)
new_list = re.split(r"(the)",sentence)
new_list= sentence.split(" the ")
new_list


['The horse', 'boy', 'car', 'truck went to', 'ocean.']

**Groups for data type**

In [76]:
#also see this link for scraping
#https://www.house.gov/representatives

house_reps = '''1st Zeldin, Lee R 1517 LHOB (202) 225-3826 Financial Services Foreign Affairs 
2nd King, Pete R 339 CHOB (202) 225-7896  Financial Services Homeland Security Intelligence 
3rd Suozzi, Thomas D 226 CHOB (202) 225-3335  Armed Services Foreign Affairs 
4th Rice, Kathleen D 1508 LHOB (202) 225-5516  Homeland Security Veterans' Affairs 
5th Meeks, Gregory W. D 2234 RHOB (202) 225-3461  Financial Services Foreign Affairs 
6th Meng, Grace D 1317 LHOB (202) 225-2601  Appropriations 
7th Velázquez, Nydia M. D 2302 RHOB (202) 225-2361  Financial Services Natural Resources Small Business 
8th Jeffries, Hakeem D 1607 LHOB (202) 225-5936  Budget Judiciary 
9th Clarke, Yvette D. D 2058 RHOB (202) 225-6231  Energy and Commerce Small Business Ethics 
10th Nadler, Jerrold D 2109 RHOB (202) 225-5635  Judiciary 
11th Donovan, Daniel R 1541 LHOB (202) 225-3371  Foreign Affairs Homeland Security 
12th Maloney, Carolyn D 2308 RHOB (202) 225-7944  Financial Services Oversight and Government Reform 
13th Espaillat, Adriano D 1630 LHOB (202) 225-4365  Education and the Workforce Foreign Affairs Small Business 
14th Crowley, Joseph D 1035 LHOB (202) 225-3965  Ways and Means 
15th Serrano, José E. D 2354 RHOB (202) 225-4361  Appropriations 
16th Engel, Eliot D 2462 RHOB (202) 225-2464  Foreign Affairs Energy and Commerce 
17th Lowey, Nita D 2365 RHOB (202) 225-6506  Appropriations Joint Select Committee on Budget and APPNs Process Reform 
18th Maloney, Sean Patrick D 1027 LHOB (202) 225-5441  Agriculture Transportation and Infrastructure 
19th Faso, John R 1616 LHOB (202) 225-5614  Agriculture Budget Transportation and Infrastructure 
20th Tonko, Paul D. D 2463 RHOB (202) 225-5076  Energy and Commerce Science, Space, and Technology 
21st Stefanik, Elise R 318 CHOB (202) 225-4611  Armed Services Education and the Workforce Intelligence 
22nd Tenney, Claudia R 512 CHOB (202) 225-3665  Financial Services 
23rd Reed, Tom R 2437 RHOB (202) 225-3161  Ways and Means 
24th Katko, John R 1620 LHOB (202) 225-3701  Homeland Security Transportation and Infrastructure 
25th Slaughter, Louise McIntosh - Vacancy D 2469 RHOB (202) 225-3615  
26th Higgins, Brian D 2459 RHOB (202) 225-3306  Budget Ways and Means 
27th Collins, Chris R 1117 LHOB (202) 225-5265  Energy and Commerce
'''

In [78]:
house_list = house_reps.splitlines()
house_list

['1st Zeldin, Lee R 1517 LHOB (202) 225-3826 Financial Services Foreign Affairs ',
 '2nd King, Pete R 339 CHOB (202) 225-7896  Financial Services Homeland Security Intelligence ',
 '3rd Suozzi, Thomas D 226 CHOB (202) 225-3335  Armed Services Foreign Affairs ',
 "4th Rice, Kathleen D 1508 LHOB (202) 225-5516  Homeland Security Veterans' Affairs ",
 '5th Meeks, Gregory W. D 2234 RHOB (202) 225-3461  Financial Services Foreign Affairs ',
 '6th Meng, Grace D 1317 LHOB (202) 225-2601  Appropriations ',
 '7th Velázquez, Nydia M. D 2302 RHOB (202) 225-2361  Financial Services Natural Resources Small Business ',
 '8th Jeffries, Hakeem D 1607 LHOB (202) 225-5936  Budget Judiciary ',
 '9th Clarke, Yvette D. D 2058 RHOB (202) 225-6231  Energy and Commerce Small Business Ethics ',
 '10th Nadler, Jerrold D 2109 RHOB (202) 225-5635  Judiciary ',
 '11th Donovan, Daniel R 1541 LHOB (202) 225-3371  Foreign Affairs Homeland Security ',
 '12th Maloney, Carolyn D 2308 RHOB (202) 225-7944  Financial Servi

In [83]:
everything = []
for person in house_list:
    dists=re.findall(r"^\d\d?\w\w",person)[0]
    print("this is dists"+" || "+dists)
    name = re.findall(r"\d\d?\w\w (\D+) [DR] \d\d+",person)[0]
    print(name)
    comm = re.findall(r"-\d{4} (.+)$",person)[0]
    print(comm)

this is dists || 1st
Zeldin, Lee
Financial Services Foreign Affairs 
this is dists || 2nd
King, Pete
 Financial Services Homeland Security Intelligence 
this is dists || 3rd
Suozzi, Thomas
 Armed Services Foreign Affairs 
this is dists || 4th
Rice, Kathleen
 Homeland Security Veterans' Affairs 
this is dists || 5th
Meeks, Gregory W.
 Financial Services Foreign Affairs 
this is dists || 6th
Meng, Grace
 Appropriations 
this is dists || 7th
Velázquez, Nydia M.
 Financial Services Natural Resources Small Business 
this is dists || 8th
Jeffries, Hakeem
 Budget Judiciary 
this is dists || 9th
Clarke, Yvette D.
 Energy and Commerce Small Business Ethics 
this is dists || 10th
Nadler, Jerrold
 Judiciary 
this is dists || 11th
Donovan, Daniel
 Foreign Affairs Homeland Security 
this is dists || 12th
Maloney, Carolyn
 Financial Services Oversight and Government Reform 
this is dists || 13th
Espaillat, Adriano
 Education and the Workforce Foreign Affairs Small Business 
this is dists || 14th
Cro

In [90]:
#Multiline flag re.M allows you to search across multiple lines in a string.
dists = re.findall(r"^\d\d?\w+",house_reps,re.M)
dists 

['1st',
 '2nd',
 '3rd',
 '4th',
 '5th',
 '6th',
 '7th',
 '8th',
 '9th',
 '10th',
 '11th',
 '12th',
 '13th',
 '14th',
 '15th',
 '16th',
 '17th',
 '18th',
 '19th',
 '20th',
 '21st',
 '22nd',
 '23rd',
 '24th',
 '25th',
 '26th',
 '27th']

In [93]:
#different searching goodies in here!

#[re.findall(r"^\d+",line) for line in house_list]
#[re.findall(r"(^\d+[nrst][dht])",line) for line in house_list]
#[re.findall(r" [A-Z][\w]+, [A-Z][\w]+",line) for line in house_list]
#[re.findall(r"^\d+[nrst][dht] [A-Z][\w]+,",line) for line in house_list]
#[re.findall(r"[(]\d+[)][ 0-9\-]+",line) for line in house_list]
#[re.findall(r"[(]\d+[)] \d{3}-\d{4}",line) for line in house_list]
#[re.findall(r"[(]\d+[)] \d+-*\d+",line) for line in house_list]
#[re.findall(r"\D+$",line) for line in house_list]
#[re.findall(r" [DR] \d",line) for line in house_list]
#[re.findall(r", ([A-Z]\w+) ([A-Z]\w+ )*([A-Z][.])*",line) for line in house_list]
[re.findall(r"\d\d?\w\w (\D+) [DR] \d\d+",line) for line in house_list]

#16th Engel, Eliot D 2462 RHOB (202) 225-2464  Foreign Affairs Energy and Commerce ',

[['Zeldin, Lee'],
 ['King, Pete'],
 ['Suozzi, Thomas'],
 ['Rice, Kathleen'],
 ['Meeks, Gregory W.'],
 ['Meng, Grace'],
 ['Velázquez, Nydia M.'],
 ['Jeffries, Hakeem'],
 ['Clarke, Yvette D.'],
 ['Nadler, Jerrold'],
 ['Donovan, Daniel'],
 ['Maloney, Carolyn'],
 ['Espaillat, Adriano'],
 ['Crowley, Joseph'],
 ['Serrano, José E.'],
 ['Engel, Eliot'],
 ['Lowey, Nita'],
 ['Maloney, Sean Patrick'],
 ['Faso, John'],
 ['Tonko, Paul D.'],
 ['Stefanik, Elise'],
 ['Tenney, Claudia'],
 ['Reed, Tom'],
 ['Katko, John'],
 ['Slaughter, Louise McIntosh - Vacancy'],
 ['Higgins, Brian'],
 ['Collins, Chris']]

**Phrase search and ? (lookahead)**
Lookheads and lookbehinds and negative lookahead/behind are more advanced and you should just learn them when you need to use them. Basically they search for patterns without moving on to new characters. They just see if patterns happen ahead or behind or don't happen. Here is a very simple example. But when the time comes for using these you will know. This is a decent explanation online: https://www.rexegg.com/regex-lookarounds.html

In [107]:
phrases = re.findall(r"\w{2}","hello I am a string") 
phrases = re.findall(r"(?=(\w\w))","hello I am a string") 
phrases

['he', 'el', 'll', 'lo', 'am', 'st', 'tr', 'ri', 'in', 'ng']

In [109]:
speech = '''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''

Below is a search for three word phrases, that begin with two letter words. See the difference between the lookahead version and the regular version:

With the lookahead, in the line `And then is heard no more. It is a tale` it finds all of the phrases that begin with two letter words even those that overlap:
```
'is heard no',
 'no more. It',
 'It is a',
 'is a tale',
```
Without the lookahead, once it finds a pattern it moves on to the next character location, so it doesn't search for any of the two letter words that are already included in the each result:

```
'is heard no',
 'It is a',
```

In [114]:
# three-word phrases that begin with two-letter words
phrases = re.findall(r"\b\w{2}\W+\w+\W+\w+",speech)
#overlapping
phrases = re.findall(r"(?=(\b\w{2}\W+\w+\W+\w+))",speech) 


#groups--I'm ready access but I'm using groups to isolate each word.
phrases = re.findall(r"(\b\w{2})\W+(\w+)\W+(\w+)",speech)
phrases = re.findall(r"(?=(\b\w{2})\W+(\w+)\W+(\w+))",speech)

phrases

[('in', 'this', 'petty'),
 ('to', 'day', 'To'),
 ('To', 'the', 'last'),
 ('of', 'recorded', 'time'),
 ('to', 'dusty', 'death'),
 ('is', 'heard', 'no'),
 ('no', 'more', 'It'),
 ('It', 'is', 'a'),
 ('is', 'a', 'tale'),
 ('by', 'an', 'idiot'),
 ('an', 'idiot', 'full'),
 ('of', 'sound', 'and')]

**Splitting a sentence**
This is tricky, and this isn't even the best or most robust regular expression. But it does work on an annoying group of sentences like this. Depending on the kind of text you are parsing it can be virtually impossible to accurately 100% all of the time split by sentence.

In [116]:
to_sentence = '''
Ms. Smith bought cheapsite.com for 1.5 million 
dollars, i.e. he paid a lot for it. Did he 
mind? Adam Jones Jr. thinks he didn't want to. In any 
case, this isn't true... Well, with a 
probability of .9 it isn't. Right?! My name is Jon. Mr. Comey of 
the F. B. I. thinks not. 
'''

In [124]:
#Write a regex that accurately splits 
#the paragraph above into sentences

#this works ok!!
#First group looks for:
#     any three characters that are NOT capitalized letters followed by . or ? or !
#           (this will not work for something that's written in all caps)
#           (this will not even work for a sentence ending: "said Jon.")
#Second group looks ahead from that first group for:
#     one or more spaces and a Cap to start the next sentence.
#     That second part, because it is a lookahead, does not get captured in the groups.
#split() splits by that first pattern as long as the lookahead is true. 

#So we get a list 'sents' that has the pattern and the beginning of the next sentence.

sents = re.split(r"([^A-Z]{3}[.?!\"\'])(?=\s+[A-Z])",to_sentence)
sents



['\nMs. Smith bought cheapsite.com for 1.5 million \ndollars, i.e. he paid a lot for',
 ' it.',
 ' Did he \nm',
 'ind?',
 " Adam Jones Jr. thinks he didn't want",
 ' to.',
 " In any \ncase, this isn't tru",
 'e...',
 ' Well, with a \nprobability of .9 it is',
 "n't.",
 ' Rig',
 'ht?!',
 ' My name is Jon. Mr. Comey of \nthe F. B. I. thinks not. \n']

In [125]:
#Here we have to join every other part of the 'sents' list together 
#to re-combine the full sentence.

join_sents = [sents[x] + sents[x+1] for x in range(0,len(sents)-2,2)]
join_sents.append(sents[-1])
join_sents

['\nMs. Smith bought cheapsite.com for 1.5 million \ndollars, i.e. he paid a lot for it.',
 ' Did he \nmind?',
 " Adam Jones Jr. thinks he didn't want to.",
 " In any \ncase, this isn't true...",
 " Well, with a \nprobability of .9 it isn't.",
 ' Right?!',
 ' My name is Jon. Mr. Comey of \nthe F. B. I. thinks not. \n']

**More problems with text: The Waste Land**

In [121]:
f = open('wasteland.txt', 'r', encoding='utf8')
wasteland = f.read()

In [21]:
#?! Looked ahead for something not containing
# Homework example of 2 occurrences "ow"

**Sometimes phrases are more useful than word searches**

In [None]:
phrases = re.findall(r"\b\w{2}\W+\w+\W+\w+",wasteland)
phrases = re.findall(r"\bof\W+the\W+\w+",wasteland,re.IGNORECASE)
phrases

**Writing a Most Frequent Words script**

In [23]:
waste_words = wasteland.lower()

#get a list of words
#waste_words1 = re.split(r"\W+",waste_words)
waste_words2 = re.findall(r"\b\w+\b'?[ts]?",waste_words)

In [None]:
waste_words2

In [None]:
#sort words alphabetically
sortwords = waste_words2.copy()
sortwords.sort()
sortwords

In [None]:
waste_words2

In [27]:
#Loop through the alphabetically arranged list
#count each instance of the word and make a dictionary
all_words = []
counter = 1
this_word = "a"
for word in sortwords:
    if word != this_word:
        all_words.append({'word':this_word,'count':counter})
        counter = 1
        this_word = word
    else:
        counter +=1

In [None]:
all_words

In [None]:
#sort a dictionary by a key's value
order_words = sorted(all_words, key=lambda d: d['count'], reverse=True)
order_words

In [30]:
#more efficient dictionary way, as a function:

def word_freq(my_text):
    word_dict = {}
    words = re.findall(r"\b\w+\b", my_text.lower())
    for word in words:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return dict(sorted(word_dict.items(), key=lambda item: item[1], reverse=True))



In [None]:
word_count = word_freq(waste_words)
word_count

In [32]:
#Or use the Built-in version!!!!!
from collections import Counter
wordcount = Counter(waste_words2)

In [None]:
wordcount.most_common(10)