# Python Regex Cheatsheet

<https://www.debuggex.com/cheatsheet/regex/python>

# Regexp use from Python

## Defining regexps
You then define a regexp as a string,
as in:

    s = 'abc.def'

It’s important to point out that because all regexps in Python are first created
as strings, the Python parser may handle some regexps differently than you
might expect. For example, let’s say that your regexp is looking for the string
abc as a word on its own. You would likely want to use the \b (word boundary)
metacharacter to indicate this in your regexp, as follows:

    s = '\babc\b'

However, this will fail. That’s because \b is treated by Python’s string
parser as a special character (ASCII 8, or backspace). The regexp engine
will thus think that it’s to look for the backspace character, rather than the \b
metacharacter. The same is true if you use backreferences, which uses backslashes
followed by numbers, such as above.
This isn’t a legal character in a Python string, and you’ll get an error message
from Python.
In both of these cases, what you need to do is double your backslash, as
follows:

    s = '\\babc\\b' # doubled backslashes

If this gets annoying, then you can always use a “raw string” – just put an r
before the opening quote of a Python string, and the backslashes are automatically
doubled. You can think of a raw string as a way to tell Python that you
want the string to be precisely as you entered it:

    s = r'\babc\b' # raw string

## Finding one regexp

Once you have created a regexp string, you can then search for it inside of text.
Python provides you with two basic ways to search inside of text with regexps:
You can either search for a single occurrence, or for all of the occurrences.

To search for a single occurrence of your regexp within a string, you’ll
use the re.match or re.search functions. Both of them work in precisely
the same way, except that re.match automatically anchors your regexp to the
start of the screen.

Some examples:


In [23]:
import re
text = 'hello, world'
re.match('hello', text) # Find "hello" at the start of text
re.search('hello', text) # Find "hello" anywhere in text

<_sre.SRE_Match object; span=(0, 5), match='hello'>

Both re.search and re.match return either None (if no match was found)
or a “match object” if one was. A match object, traditionally named m, has a
number of useful attributes, the most popular of which is m.group(0). This
asks Python to display the entire string that the regexp matched. If there were
any groups within the regexp, then you can retrieve the individual groups with
m.group and then passing the group number.

In [24]:
text = 'hello, world'
m = re.search(r'\b(h.)(..o)\b', text)
if m:
    print("Full match: {}".format(m.group(0))) # hello
    print("First part: {}".format(m.group(1))) # he
    print("Last part: {}".format(m.group(2))) # llo

Full match: hello
First part: he
Last part: llo


## Finding more than one
To search for multiple occurences within a string, use re.findall. This function
also takes a regexp string and a text string, but is guaranteed to return a
Python list, with all of the matches for your regexp. If there were no matches,
then it returns an empty list. Note that if your regexp includes groups (i.e.,
parentheses), then re.findall returns a list of matches for your group (if
there was one group) or a list of tuples (if there were multiple groups).

For example:


In [25]:
# Find all matches of "hello" in book
text = 'hello, world and hello, trees!'
re.findall('hello', text) # ['hello', 'hello']

['hello', 'hello']

In [26]:
# Find "h", three characters, and then o -- and match the three
# inner characters. Result is a list of those three characters
re.findall('h(...)o', text) # ['ell', 'ell']

['ell', 'ell']

In [27]:
# Find all words start with h and ending with o.
# Put the first two characters in a group, and the final three
# characters in a separate group. Return a list of two-element
# tuples, one with "h." and the other with "..o"
re.findall(r'\b(h.)(..o)\b', text) # [('he', 'llo'), ('he', 'llo')]

[('he', 'llo'), ('he', 'llo')]

If you expect to find a large number of matches, then you might want
to use re.finditer rather than re.findall. The only difference is that
re.finditer is an iterator, so it won’t consume large amounts of memory.
re.findall, by contrast, will return a list of all matches, which might be
quite long.

# Simple regexps

## Five-letter words
Display words in the dictionary (which contains each word in a line) that are either four
letters long, or that are five letters long if they end with an s. The word – not
just a subset of the word – should be precisely four or five letters long.

For the purposes of this exercise, any character (not just a letter) can be
counted in the first four letters of the word. However, if there is a fifth letter, it
must be an s.

In [28]:
import re

# ro = re.compile('....s?') # will find words with at least 4 letters
ro = re.compile('^....s?$')

for line in open('words.txt'):
    if ro.search(line):
        print(line.rstrip("\n"))

Aani
Aaru
abac
abas
Abba
abbas
Abby
abed
Abel
abet
abey
Abie
Abies
abir
able
ably
abox
Abrus
Absi
abut
abyss
acca
Acer
ache
achy
acid
Acis
acle
aclys
acme
acne
acor
acre
acta
Acts
actu
acyl
Adad
adad
Adai
Adam
Adar
adat
adaw
aday
adays
Adda
adda
Addu
Addy
adet
Adib
Adin
adit
admi
adry
adze
Aedes
aegis
aeon
aero
aery
Afar
afar
affa
affy
Agag
agal
Agao
agar
Agau
Agaz
aged
agee
agen
ager
agha
Agib
agio
agla
Agnes
agnus
agog
agon
Agra
agre
agua
ague
ahem
Ahet
ahey
Ahir
Ahom
ahoy
ahum
Aias
aide
Aides
aiel
aile
aint
Ainu
aion
Aira
aire
airt
airy
ajar
ajog
Akal
Akan
akee
akey
Akha
akia
Akim
akin
Akka
akov
Akra
akra
Alan
alan
Alans
alar
alas
alba
albe
Albi
albus
Alca
Alces
alco
Aldus
Alea
alec
alee
alef
alem
alen
Alex
alfa
alga
Algy
alias
Alids
alif
alin
alit
Alix
Alkes
alky
Alle
Ally
ally
Alma
alma
alme
alms
Alnus
alod
aloe
Alois
alop
alow
also
alto
alum
Alur
alvus
Alya
amaas
amah
amar
amass
amba
ambo
ambos
Amen
amen
Amex
Amia
amic
amid
amil
amin
Amir
amir
amiss
amla
amli
Amma
amma
Ammi
ammo


### Illustration
There are two parts to this exercise. First of all, we need to create a regexp that
will match four letter words and five-letter words ending with s. Another way
of thinking about this is to say that we want to find four characters, followed
by an optional s. In regexps, we can use the ? metacharacter to indicate that
the preceding character is optional. Our regexp will thus be:

    ....s?

In other words, four characters that are not newlines (represented by .), and
then an optional s.

However, if we were merely to search for this regexp in each line of the
dictionary, we would find that many longer words would match, as well. That’s
because the regexp, left as it is above, will match any word with four or more
letters in it.

We have several ways to deal with this problem. One is to use anchors to
connect the regexp to the start and end of the line. For example:

    ^....s?$

The ˆ anchors the regexp to the front of the line, and the $ anchors it to the
end of the line. That’s probably the best way to go about this, I’d say.
Another solution is to use the programming language’s string-length function
to determine whether the word is either four or five characters in length,
and then fits our criteria.


## Double “f” in the middle
Find all of the words in the dictionary that contain a “ff” in them, so long as those f’s are not the first or final characters in the world. Thus, “affable” would be fine, but “quaff” would not.

In [1]:
import re

ro = re.compile('.ff.')

for line in open('words.txt'):
    if ro.search(line):
        print(line.rstrip())

acecaffine
affa
affability
affable
affableness
affably
affabrous
affair
affaite
affect
affectable
affectate
affectation
affectationist
affected
affectedly
affectedness
affecter
affectibility
affectible
affecting
affectingly
affection
affectional
affectionally
affectionate
affectionately
affectionateness
affectioned
affectious
affective
affectively
affectivity
affeer
affeerer
affeerment
affeir
affenpinscher
affenspalte
afferent
affettuoso
affiance
affiancer
affiant
affidation
affidavit
affidavy
affiliable
affiliate
affiliation
affinal
affination
affine
affined
affinely
affinitative
affinitatively
affinite
affinition
affinitive
affinity
affirm
affirmable
affirmably
affirmance
affirmant
affirmation
affirmative
affirmatively
affirmatory
affirmer
affirmingly
affix
affixal
affixation
affixer
affixion
affixture
afflation
afflatus
afflict
afflicted
afflictedness
afflicter
afflicting
afflictingly
affliction
afflictionless
afflictive
afflictively
affluence
affluent
affluently
affluentness
afflux

### Illustration

We know that the regexp will need to include ff inside of it. But if we use the
simple regexp

    ff

then we are telling the regexp engine that it’s OK to find ff anywhere in
our word, including the start or the finish. We could thus start to use all sorts of
metacharacters, to ensure that we have at least one character before and after
the ff. For example:

    .+ff.+

The above says that there can be any number of characters before and after
the ff. But if we think about it for a moment, all we care about is having at
least one character before and after the ff. We don’t care about anything else
in the string. We can thus whittle our regexp down to a more minimal version:

    .ff.


## Extract timestamp
It’s common to use regular expressions to extract information from logfiles. In the access-log.txt file that comes with this book, each HTTP request is accompanied by a timestamp, consisting of a date and time.

In this exercise, you must match and retrieve the entire timestamp from each line, starting with [ and ending with ]. For the purposes of this exercise, you cannot assume that this will be the only pair of [ and ] in the logfile, so you cannot use a regexp such as:
\[[^]]\]
which would mean, “start with [, end with ], and take everything in the middle.” You’ll need to specify the regexp more explicitly and carefully than that.
For example, the first line of access-log.txt contains the following timestamp:

    [30/Jan/2010:00:03:18 +0200]

In [30]:
import re

filename = 'access-log.txt'
# Example: [30/Jan/2010:00:03:18 +0200]
ro = re.compile('\[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]')

for line in open(filename):
    if ro.search(line):
        print(line)

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"

66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.12 - -

### Illustration

I’m going to use the built-in character classes \d (any digit) and \w (any
letter or number), as well as the {min,max} way of indicating how many characters
we want and the + metacharacter, which allows us to indicate that we
want one or more of the preceding character:
     
     '\[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]'

The above basically says that we want:
* literal opening [, so we precede it with a \
* two digits (date), followed by a slash
* three letters/numbers (month), followed by a slash
* four digits (year), followed by a colon
* two digits (hour), followed by a colon
* two digits (minute), followed by a colon
* two digits (seconds)
* space
* a literal +, so we add a \
* four digits (time zone)
* a literal closing ], so we precede it with a \


# Character classes

## End-of-sentence words
In Alice in Wonderland, find all of the words that are at the end of a sentence. In other words, find and display all of the words that end with ., ?, or !. You should display the punctuation mark along with the word. For the purposes of this exercise, a word is any string of alphanumeric characters at least two characters long

In [31]:
import re

filename = 'alice.txt'
ro = re.compile('\w{2,}[.?!]')

for line in open(filename):
    m = ro.findall(line) # use findall; maybe more than one sentence per line 
    if m:
        print(m)

['whatsoever.']
['www.', 'gutenberg.']
['www.', 'pgdp.']
['Duchess.']
['do.']
['conversations?']
['her.']
['dear!']
['dear!', 'late!']
['hedge.', 'it!']
['well.']
['her.']
['pegs.']
['passed.']
['it.']
['down!', 'end?']
['herself.']
['think!', 'cat.']
['time.']
['me!']
['thump!', 'thump!']
['over.']
['moment.']
['it.']
['lost.']
['getting!']
['seen.']
['roof.']
['again.']
['glass.']
['alas!']
['them.']
['high.']
['fitted!']
['saw.']
['doorway.', 'telescope!']
['begin.']
['telescopes.']
['letters.']
['later.']
['off.']
['feeling!', 'Alice.']
['telescope!']
['indeed!']
['garden.']
['Alice!']
['cried.']
['that!']
['sharply.', 'minute!']
['eyes.']
['currants.']
['happens!']
['way?']
['way?']
['size.', 'cake.']
['curiouser!']
['was!', 'feet!']
['dears?']
['you.']
['door.']
['Alice!']
['ever.', 'again.']
['hall.']
['coming.']
['other.']
['Oh!', 'Duchess!', 'Oh!']
['waiting!']
['go.']
['talking.', 'dear!', 'day!']
['usual.']
['morning?']
['puzzle!']
['talking.', 'that?', 'thought.']
['again.'

### Illustration

This is a classic case of using character classes. First of all, we’re looking
for three specific characters (., ?, and !). This means that we can define the
character class [.?!]. This might lead us to think that the regexp we want is:

    .[.?!]

But there are three problems with the above: First of all, it doesn’t restrict
the character before the punctuation mark to be alphanumeric. Secondly, it
only captures a single character, rather than the entire word. Thirdly, the specifications
indicate that our word must be at least two characters long.

We can solve all of these problems together by using the built-in \w character
class, which is the same as 

    [A-Za-z0-9_]. 

We can then indicate that we want a minimum of two such characters by using the {min,max} specifier.
Our final regexp thus looks like this:

    '\w{2,}[.?!]'

Note that because more than one sentence might appear on a single line of
text, we’ll need to use the functionality that finds all matches, rather than just
the first one on a line.


## Hex numbers
Given the following sentence:

    I like the hex numbers 0xfa and 0X123 and 0xcab and 0xff

retrieve all of the hexadecimal numbers. That is, it starts with 0x (or 0X),
then has a string of digits or the letters a through f, capital or lowercase.

In [32]:
import re

s = 'I like the hex numbers 0xfa and 0x123 and 0xcab and 0xff'

ro = re.compile('0[xX][A-Fa-f\d]+')

print(ro.findall(s))

['0xfa', '0x123', '0xcab', '0xff']


### Illustration

We cannot use the built-in \w character class here, because we want a more
restricted set of characters. So our character class will look like [A-Fa-f]..
However, we also want to allow for numeric digits, so we’ll add \d to our
custom class. We want any number of these following 0x, which means that
our final regexp will be:

    0[xX][A-Fa-f\d]+


## Hexwords
Which words in the dictionary only the letters a through f?

In [33]:
import re

filename = 'words.txt'
ro = re.compile('^[a-f]+$')

for line in open(filename):
    if ro.search(line):
        print(line.rstrip())

a
aa
aba
abac
abaca
abaff
abb
abed
acca
accede
ace
ad
adad
add
adda
added
ade
adead
ae
aface
affa
b
ba
baa
baba
babe
bac
bacaba
bacca
baccae
bad
bade
bae
baff
be
bead
beaded
bebed
bed
bedad
bedded
bedead
bedeaf
bee
beef
c
ca
cab
caba
cabda
cad
cade
caeca
caffa
ce
cede
cee
d
da
dab
dabb
dabba
dace
dad
dada
dade
dae
daff
de
dead
deaf
deb
decad
decade
dee
deed
deedeed
deface
e
ea
ebb
ecad
edea
efface
f
fa
facade
face
faced
fad
fade
faded
fae
faff
fe
fed
fee
feed


The solution to this exercise is a regexp that is anchored to the start and end of
a word, and contains a character class with the letters a through f:

    ^[a-f]+$

Notice the +, which indicates that the word might be more than one character
long. Forget to add that, and you’ll end up matching a much smaller set of
words!

Failing to anchor the word to the start and end with ˆ and $ will have the
result of finding words in which at least one character is from the set [a-f],
but other letters might not be.

## IP addresses
Each line of access-log.txt starts with an IP address. Each IP address has four numbers, each containing between one and three digits. The numbers are separated by periods (.).

In this exercise, you are to retrieve the IP addresses from access-log.txt.

In [34]:
import re

filename = 'access-log.txt'
ro = re.compile('^(\d{1,3}\.){3}\d{1,3}')

for line in open(filename):
    m = ro.search(line)
    if m:
        print(m.group(0)) # entire string that the regexp matched

67.218.116.165
66.249.71.65
65.55.106.183
65.55.106.183
66.249.71.65
66.249.71.65
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.131
65.55.106.131
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.186
65.55.106.186
66.249.65.12
66.249.65.12
66.249.65.12
74.52.245.146
74.52.245.146
66.249.65.43
66.249.65.43
66.249.65.43
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.25
65.55.207.25
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.94
65.55.207.94
66.249.65.12
65.55.207.71
66.249.65.12
66.249.65.12
66.249.65.12
98.242.170.241
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38


### Illustration

If I were only interested in four character separated by periods, I coul use a
generic regexp, such as:

    \w\.\w\.\w\.\w

Notice how we need to use \., and not just .. That’s because we don’t
want to use the . metacharacter here, but rather a literal . character. To do
that, we need to use \..

But the above regexp doesn’t do what we want, in two different ways: First
of all, it captures only one \w, when we want to have between one and three.
Beyond that, we actually want to have digits (\d), not alphanumeric characters
(\w). So we can rewrite the regexp as follows:

    \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The above will work, and isn’t a bad way to go about things. But we can
do one better, albeit using a more advanced technique of grouping: We can
notice that there is a pattern that repeats three times, and can then put that in
parentheses, and indicate it should happen three times:

    (\d{1,3}\.){3}\d{1,3}

In other words: We want to have 1-3 digits, followed by ., three times.
Then, we want to have 1-3 digits.

Finally, let’s ensure that we only find an IP address that is the first thing on
its line, by adding ˆ to the front:

    ^(\d{1,3}\.){3}\d{1,3}

Notice that this now means we’ve introduced a group to our regexp, via the
parentheses. In some languages and environments, this will change the way in
which we receive output.



## Long, weird words
Find all of the words in the dictionary that have the following characteristics:
* 10 letters long
* Start with a letter from the first half of the alphabet (a-m)
* End with a letter from the second half of the alphabet (n-z)
* Somewhere in the middle, there should be a “p”

In [35]:
import re

filename = 'words.txt'
ro = re.compile('^[a-m][a-z]*p[a-z]*[n-z]$')

for line in open(filename):
    if len(line) == 10 and ro.search(line):
        print(line.rstrip())

abruption
acalephan
acapsular
acceptant
acception
accipient
accompany
acephalan
acephalus
acropathy
acrophony
acropolis
adderspit
addleplot
ademption
adephagan
adeptness
adeptship
adiposity
adoptedly
adoptious
aeropathy
aerophagy
aeroscopy
afterpain
afterpart
afterpast
afterplay
aglyphous
alephzero
allopathy
allopatry
alloplast
allotropy
alopecist
alpasotes
alpenglow
alpenhorn
altiplano
amidships
amorphous
amphibian
amphibion
amphiboly
amphigean
amphigony
amphigory
amphilogy
amphioxus
amphitoky
amphorous
ampleness
amplifier
ampullary
amputator
amylopsin
anaglyphy
analepsis
ananaplas
ananaples
anaplasis
anaplasty
anapsidan
anaptyxis
anaspalin
antipathy
antiphony
antipodes
antipolar
antispast
antitropy
apartment
apartness
apertness
apetalous
apheresis
aphidious
aphidozer
aphlaston
aphnology
aphorizer
aphyllous
apiaceous
apishness
apivorous
apocalypt
apocenter
apodeixis
apogamous
apogenous
apolarity
apologist
apolousis
apophasis
apophysis
aposaturn
apostasis
apostaxis
apothesis
appalment


### Illustration

Our regular expression is basically defined by the specification here. Let’s start
with the fact that it must start with a letter from the character class [a-m],
and end with a letter from the character class n-z. If that, plus the need for
the word to be 10 characters long, were the only requirement, then our regexp
could look like this:

    [a-m].{8}[n-z]
    
Except that this isn’t enough – to begin with, regexps can match anywhere
in the target string. This regexp will thus match 10 characters within a longer
word, as well as a 10-letter word. We can add anchors to ensure that the word
is precisely 10 characters long:

    ^[a-m].{8}[n-z]$

But of course, we still haven’t indicated that there can or should be a letter
p in there somewhere. And that’s where things get a bit complicated.
One way to indicate that a p is in there is to add the following:

    ^[a-m][a-z]*p[a-z]*[n-z]$    

The above tells the regexp engine that we want to start with a character
from [a-m], end with a character from [n-z], and have a p somewhere in the
middle. But what about the length?

So far as I can tell, there isn’t any easy way to handle both specifications at
the same time. The moment that the p could be anywhere inside of that field,
we have lost the ability to specify that “we want eight letters, at least one of
which must be p.” In cases like this, I thus rely on the programming language
I’m using to do some of the checking for me.

We could, instead, check the length with the regexp and look for p inside
of our string using a function or method within our chosen language. But to
me, at least, that doesn’t seem as satisfying – and it’s likely to be less efficient,
as well, since many high-level languages can calculate the length of a string
quickly, but cannot calculate find a substring nearly as fast.

## Matching URLs
Let’s assume that we have defined a string:

    I love to visit https://example.com/foo.html every day!
    More than http://abc-def.co.il/.

Write a regexp that will match both URLs, but not the characters before or after them. Include the /foo.html in the first URL, but not the training period (.) in the second.

In [36]:
import re

s = '''I love to visit https://example.com/foo.html every day!
More than http://abc-def.co.il/.'''

ro = re.compile('https?://[\w./-]+[\w/-]')

print(ro.findall(s))

['https://example.com/foo.html', 'http://abc-def.co.il/']


### Illustration

We often think of URLs are fairly simple. However, matching them can be a
bit tricky, because of several variations in the URLs we see here. For example,
the first begins with https://, and the second begins with http://. The
first ends with a filename (including a “.html” suffix), while the second has a
hostname containing a - character.

Starting from the beginning, we can match the URLs with https?://.
The ? metacharacter indicates that the character preceding it (s) is optional,
and can appear zero or one times. While URLs can start with any number of
different protocol names, this particular exercise only required that we match
http and https at the start.

We then need to match the hostname. We don’t want to match every possible
character, since not all characters are valid in hostnames. I’m going to
assume, for these purposes, that hostnames might contain letters, numbers, underscores,
and dashes. We also need to take into account the periods that will
appear in the URL, And, of course, they might contain periods as well, separating
the host from the domain. (The solution I’m presenting here would also
match illegal URLs, such as those containing two consecutive . characters.)
We can shorten this character class definition by using the built-in \w character
class, which is defined to be the same as [A-Za-z0-9_].

If we want to create a character class that’ll match \w, ., /, and -, then 
the - character will need to be at the start or end of the character class. Otherwise,
it’ll be interpreted as defining a range. Also note that . inside of a character
class is treated literally, not as a metacharacter. We’ll match any number of
these characters, indicated by using a + sign following our character class.

Our URL ends with a repeat of our character class, but without any . inside
(since our URL cannot end with it). This ensures that we won’t match training
punctuation marks.
Given all of this, our regular expression could be:

    https?://[\w./-]+[\w/-]


## Non-zero hours
Once again, it’s time to search for certain patterns in access-log.txt: We
want to find all of the records in which the hour doesn’t begin with a 0. (Remember
that Apache logs, like many other logfiles, operates on a 24-hour clock.
Thus, 11 p.m. is written as 23:00.) Thus, you should not show the records from
00:00 through 09:59, and then show those from 10:00 through 23:59. For the
purposes of this exercise, you may assume that square brackets ([ and ]) only
occur around the timestamp.

In [55]:
import re

filename = 'access-log.txt'
ro = re.compile(r'/20\d\d:[1-9]\d:') # times occurs after the date 
#Example:  [30/Jan/2010:11:48:31 +0200]

for line in open(filename):
    if ro.search(line):
        print(line)

66.249.65.12 - - [30/Jan/2010:10:21:43 +0200] "GET /browse/one_node/2163 HTTP/1.1" 200 4124 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.12 - - [30/Jan/2010:10:50:35 +0200] "GET /browse/one_node/1193 HTTP/1.1" 200 29168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.12 - - [30/Jan/2010:11:19:27 +0200] "GET /browse/one_node/1241 HTTP/1.1" 200 7032 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

65.55.207.25 - - [30/Jan/2010:11:43:56 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

65.55.207.25 - - [30/Jan/2010:11:44:35 +0200] "GET /help HTTP/1.1" 304 - "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

66.249.65.12 - - [30/Jan/2010:11:48:31 +0200] "GET /browse/download_model/2508 HTTP/1.1" 200 11374 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.12 - - [30/Jan/2010:12:17:24 +0

### Illustration

What we’re looking for is the hour, which consists of two digits surrounded
by colons (:), in which the first digit is not a zero. That can be expressed as
follows in a regexp:
    
    :[1-9]\d:

Normally, we can use \d to describe a digit. But in the case of the first
digit, we’re willing to have any digit but 0, This means that we can just create
our own, custom character class, setting a range from 1 to 9.

The problem is that while the above regexp will indeed find all of the nonzero
hours, it’ll also find many others. That’s because we might have such
patterns elsewhere in the line, and even elsewhere in the timestamp, thanks to
the fact that we also have two-digit minutes, surrounded by colons.

We’ll thus need to be a bit more specific. One easy way to do this is to
assume that the hour will come after the year, which is a four-digit number
starting with 20. That’s probably enough to find what we need; if you want to
be completely sure, then you can extend the regexp to match the opening [ or
the closing ]. Our regexp thus looks like this:

    /20\d\d:[1-9]\d:


## Quoted txt
In this exercise, we’re going to look for all of the quotations in Alice in Wonderland.
I’m looking for any stretch of text that starts with the double-quote
character (“) and ends with that same character.

I’m going to assume that quotes are never nested, and that there’s no use
of a programmer’s backslash () to escape the double quotes. However, quotes
might extend across more than one line.

In [56]:
import re

filename = 'alice.txt'

ro = re.compile('"[^"]+"')

s = open(filename).read() # whole file read in as a string

for quote in ro.findall(s):
    print(quote)


"STORYLAND"
"and what is the use of a book,"
"without pictures or
conversations?"
"Oh dear! Oh
dear! I shall be too late!"
"ORANGE MARMALADE,"
"Dinah'll miss me
very much to-night, I should think!"
"I hope
they'll remember her saucer of milk at tea-time. Dinah, my dear, I wish
you were down here with me!"
"Oh, my ears and whiskers, how late
it's getting!"
"Oh,"
"how I wish I could shut up like a telescope!
I think I could, if I only knew how to begin."
"which certainly
was not here before,"
"DRINK ME"
"No, I'll look first,"
"and see whether it's marked '_poison_'
or not,"
"poison,"
"poison,"
"What a curious feeling!"
"I must be shutting up like a
telescope!"
"Come, there's no use in crying like that!"
"I advise you to leave off this minute!"
"EAT
ME"
"Well, I'll eat it,"
"and if it makes me grow larger, I can reach the key; and if it
makes me grow smaller, I can creep under the door: so either way I'll
get into the garden, and I don't care which happens!"
"Which way? Which
way?"
"Curio

### Illustration

My solution to this problem is to use the following regexp:

    "[^"]+"

As we can see here, the start and end of the regexp are the double-quote
characters, which must appear at the start and finish of the matched text. Rather
than using a . character to indicate that anything might appear between the
double quotes, I’m just going to accept any character other than a quote quote.

This is a very common paradigm in regexp solutions; I often find myself
wanting to look for everything in a sentence, where “sentence” means, “anything
that isn’t a period ending a sentence.” Rather than create a regexp that
matches what I do want – which can be tricky! – I create a regexp that matches
that description, using the character class [ˆ?!.]. (Note that this can result
in false positives, given that people can use punctuation inside of words and
acronyms. The double quotes are far less likely to result in false positives!)

Now, you might be wondering why I didn’t make this non-greedy:

    "[^"]+?"

Remember that + always matches the maximum number of characters that
it can, whereas +? matches the minimum number of characters that it can. In
this particular case, though, there’s no difference between that minimum and
maximum, because we’ve stated that we want the regexp to match all non-“
characters, followedy by a “ character. There is only one string that will match
that; while it won’t hurt to add the ? to the +, it won’t help, either.

Another important point here is that this regexp won’t work if we read the
file line by line. (If we do that, then we will only see quotes that are on a single
line.) Rather, we’ll need to read the file in as a string, and then find all of the
matches caught by our string.

## Supervocalic
A word is considered “supervocalic” if it contains all five of the English-language
vowels (a, e, i, o, and u). Each letter should appear only once, and in that order.

For this task, you want to find all of the supervocalic words in the dictionary.

In [57]:
import re

filename = 'words.txt'
ro = re.compile('^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$')

for line in open(filename):
    if ro.search(line):
        print(line.rstrip())

abstemious
abstemiously
abstentious
acheilous
acheirous
acleistous
affectious
annelidous
arsenious
arterious
bacterious
caesious
facetious
facetiously
fracedinous
majestious


### Illustration

Let’s build this regexp up, slowly but surely: First of all, we want the word to
contain the letter a, which can appear anywhere:

    a

However, after a appears once, it may not appear again. So we’ll modify
our regexp to look as follows:

    [^a]*a[^a]*

In this way, we know that a appears only once, with zero or more non-a
characters coming before it. But now, we want to do the same with e, the next
vowel. Let’s do the same thing, indicating that e cannot come before a, and
that it can come at some point after a:

    [^ae]*a[^ae]*e

But of course, this will still match only part of the word. So let’s do two
things: Anchor the word to the regexp and end of the word we’re trying to
match, and ensure that after e we can have characters, but not e again (nor a
again, for that matter:

    ^[^ae]*a[^ae]*e[^ae]$

We can continue with this for some time. The bottom line is that we want
each of the vowels, in turn, with zero or more non-vowel characters coming
between them. Our regexp ends up looking like this:

    ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

This regexp should now match supervocalic words.

## Double triple vowel
In English, doubled vowels are a pretty common occurrence. Tripled vowels,
though, are a pretty rare thing.

Your task is to try to find something even rarer: Words in the dictionary
with two separate sets of triple vowels. (And yes, the dictionary I’ve included
with this book contains 69 such words.)

In [58]:
import re

filename = 'words.txt'
ro = re.compile('[aeiou]{3}.*[aeiou]{3}', re.IGNORECASE)

for line in open(filename):
    if ro.search(line):
        print(line.rstrip())

Actaeaceae
Andreaeaceae
androdioecious
aqueoigneous
beauteous
beauteously
beauteousness
calcareoargillaceous
Chromatioideae
Circaeaceae
dioecious
dioeciously
dioeciousness
Dionaeaceae
Dodonaeaceae
Elaeagnaceae
elaeagnaceous
Elaeocarpaceae
elaeocarpaceous
euouae
Fouquieriaceae
fouquieriaceous
glacioaqueous
Gloiosiphoniaceae
gynodioecious
gynodioeciously
heteroousious
homoiousious
homoousious
igneoaqueous
monoousious
Naiadaceae
naiadaceous
Nectrioidaceae
Nymphaeaceae
nymphaeaceous
ouabaio
Paeoniaceae
Palaeogaea
Palaeogaean
palaeotherioid
Penaeaceae
penaeaceous
Phaeophyceae
phaeophyceous
Phaeosporeae
polygamodioecious
Quaequae
Quiinaceae
quiinaceous
Rhodobacterioideae
Saurauiaceae
Schizaeaceae
schizaeaceous
Sphaerioidaceae
Spiraeaceae
tautoousious
thioantimonious
thioarsenious
Thymelaeaceae
thymelaeaceous
trioecious
trioeciously
tropaeolaceae
tropaeolaceous
ultraoutrageous
unbeauteous
unbeauteously
unbeauteousness


### Illustration

If we are looking for one vowel, then our regexp is

    [aeiou]

If we want three vowels in a row, then we can use the regexp

    [aeiou]{3}

This does not mean that we want the same vowel three times! Rather, it
means that three times in a row, the regexp engine should find one of the characters
located inside of the character class.

If we’re looking for a word with two such sets of letters, then we’ll want to
modify our regexp such that it has that pattern twice – but with zero or more
characters occurring between them:

    [aeiou]{3}.*[aeiou]{3}

But wait! What if the vowel is the first letter of the word, is is capitalized?
We should thus apply the appropriate flag to make our search case-insensitive.
Alternately, we could just modify our regexp to explicitly include [AEIOU], as
well. I’ve heard that this is somewhat faster, because you’re limiting the range
that the regexp engine should examine, but haven’t ever tested it. Here’s what
it would look like, if you weren’t to use the case-insensitive flag:

    [AEIOUaeiou]{3}.*[aeiou]{3}
    
In theory, we could also make the second set case insensitive, but I don’t
see a compelling reason to do that.

Now, some people might worry that the regexp engine will see four vowels
in a row as two sets of three vowels. That is, if I have aeio, then will the regexp
engine see this as aei folowed by eio? The answer is “no” – regexps are read
from left to right, and once the pointer moves to the right, it won’t go back.
Unless it is going to back off a bit, or you’re using lookahead/lookbehind. But
each character in a string is captured by a separate portion of the regexp, which
means that you needn’t worry about it.    


# Alternation

## Multiple date formats
Dates are a well-known problem in the world, in that the same representation
can mean different things. If you see the date 1/2/2016, does that mean
February 1st or January 2nd? It all depends on whether you’re in the United
States or Europe. Asian countries write dates altogether differently, starting
with the year, so 2016-2-1 would mean February 1st, 2016.

For this exercise, write a regular expression that finds all dates in the following
string:

    I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.

In [41]:
import re

s = 'I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.'

ro = re.compile("(\d{4}[-/.]\d{1,2}[-/.]\d{1,2})|(\d{1,2}[-/.]" +
"\d{1,2}[-/.]\d{4})|(\d{1,2}[-/.]\d{1,2}[-/.]\d{4})")

print(ro.findall(s))

[('2015-09-02', '', ''), ('', '2/9/2015', ''), ('', '9.2.2015', '')]


### Illustration

The key here, as you might imagine, is to use alternation. We can find all three
of the above dates by hard-coding them in a regexp:

    2015-09-02|2/9/2015|9\.2\.2015

This will work, but we need something a bit more robust and generic. We
can take advantage of the \d character class, which matches digits. And we
can use {min,max} to indicate how many numbers we want. Our regexp thus
becomes:

    \d{4}-\d{1,2}-\d{1,2}|\d{1,2}/\d{1,2}/\d{4}|\d{1,2}\.\d{1,2}\.\d{4}

Let’s finish this off by making the symbols a bit more generic, using a
character class:

    (\d{4}[-/.]\d{1,2}[-/.]\d{1,2})|(\d{1,2}[-/.]\d{1,2}[-/.]\d{4})|(\d{1,2}[-/.]\d{1,2}[-/.]\d{4})

Yes, this is a bit long and ugly. In such cases, it’s often a good idea to break
the regexp up, using the verbose/extended flag. Notice that I also used parentheses,
to ensure that our alternation is handled as a group not an individual
character. As a result of these additional parentheses we will get results that
contain a bit more than might like.

If you’re a bit more advanced with regexps, then you might want to use
non-capturing parentheses (with ?: inside of parentheses) for this purpose:

    (?:\d{4}[-/.]\d{1,2}[-/.]\d{1,2})|(?:\d{1,2}[-/.]
    \d{1,2}[-/.]\d{4})|(?:\d{1,2}[-/.]\d{1,2}[-/.]\d{4})

(Note that the above should be written as a single line.)
Using non-capturing parentheses is a bit advanced, and it makes the regexp
uglier, but it’s extremely useful.

## “oo” and “ee” words
Find all of the words containing the double-letter combination oo and/or ee in
the Alice in Wonderland, regardless of case.

In [42]:
import re

filename = 'alice.txt'
ro = re.compile(r'\b(\w*(oo|ee)\w*)\b', re.IGNORECASE)

s = open(filename).read()

for quote in ro.findall(s):
    print(quote)

('EBook', 'oo')
('eBook', 'oo')
('eBook', 'oo')
('EBook', 'oo')
('EBOOK', 'OO')
('Proofreading', 'oo')
('Room', 'oo')
('peeped', 'ee')
('book', 'oo')
('book', 'oo')
('feel', 'ee')
('sleepy', 'ee')
('too', 'oo')
('took', 'oo')
('looked', 'oo')
('feet', 'ee')
('seen', 'ee')
('see', 'ee')
('seemed', 'ee')
('deep', 'ee')
('deep', 'ee')
('look', 'oo')
('too', 'oo')
('see', 'ee')
('looked', 'oo')
('book', 'oo')
('took', 'oo')
('soon', 'oo')
('looked', 'oo')
('seen', 'ee')
('roof', 'oo')
('doors', 'oo')
('been', 'ee')
('door', 'oo')
('doors', 'oo')
('too', 'oo')
('too', 'oo')
('door', 'oo')
('fifteen', 'ee')
('door', 'oo')
('looked', 'oo')
('cool', 'oo')
('doorway', 'oo')
('book', 'oo')
('look', 'oo')
('see', 'ee')
('disagree', 'ee')
('sooner', 'oo')
('soon', 'oo')
('feeling', 'ee')
('indeed', 'ee')
('door', 'oo')
('poor', 'oo')
('door', 'oo')
('see', 'ee')
('too', 'oo')
('poor', 'oo')
('good', 'oo')
('Soon', 'oo')
('creep', 'ee')
('door', 'oo')
('feel', 'ee')
('soon', 'oo')
('POOL', 'OO')
('

### Illustration

We’re looking for either oo or ee. We’ll thus need to use alternation, the regexp
for which looks as follows:

    oo|ee

We’re interested not just in the doubled vowel, but in the word in which the
doubled vowel occurs. This means that we need to use parentheses to stop |
from extending to the edge of the regexp, as follows:

    (oo|ee)

With that in place, now we can extend the regexp to look for words:

    \b\w*(oo|ee)\w*\b

Because of the way parentheses and grouping works, we’ll put one final
group around the entire regexp:

        \b(\w*(oo|ee)\w*)\b


## British and American spelling
The problem here is a relatively simple one. We have a sentence:

    The new box of cheques is blue in colour.

Or I might have this sentence:

    The new box of checks is blue in color.

Write a regexp that matches either of these.

In [44]:
import re

s1 = 'The new box of cheques is blue in colour.'
s2 = 'The new box of checks is blue in color.'

ro = re.compile('The new box of che(que|ck)s is blue in colou?r.')

if ro.match(s1) and ro.match(s2):
    print("Matches!")

Matches!


### Illustration

One solution is to use a combination of alternation and the ? metacharacter:

    The new box of che(que|ck)s is blue in colou?r.

In the first case, we want to match either check or cheque. We could, of
course, use something like (check|cheque), and that would work just fine.
You could even argue that it would be more readable. But in many cases, we
want our regexps to be short and to the point – thus, if we have only a few
letters that are different

Notice that we put the word inside of parentheses. If we weren’t to do that,
the alternation character (|) would look all the way to the front of the string,
and all the way to the end of the string. Using parentheses in this way can have
some surprising side effects, because it means we have created a group, even if
we didn’t intend to do so.

In the second case, of color and colour, we could have used alternation.
But when it’s just a single character that is optional, I find it easier and more
intuitive to use ? to make a specific character optional.

Note that this regexp will also match the following sentence:
    
    The new box of checks is blue in colour.

Whether you see that as a bug or a feature is, of course, up to you; I’m
willing to live with it.    

# Anchoring

## Capital vowel starts
In this assignment, find and print all of words that begin with a capital vowel
(A, E, I, O, or U) and are at the start of a line.

In [60]:
import re

filename = 'words.txt'
ro = re.compile('^[AEIOU]\w*')

for line in open(filename):
    if ro.search(line):
        print(line.rstrip())

A
Aani
Aaron
Aaronic
Aaronical
Aaronite
Aaronitic
Aaru
Ab
Ababdeh
Ababua
Abadite
Abama
Abanic
Abantes
Abarambo
Abaris
Abasgi
Abassin
Abatua
Abba
Abbadide
Abbasside
Abbie
Abby
Abderian
Abderite
Abdiel
Abdominales
Abe
Abel
Abelia
Abelian
Abelicea
Abelite
Abelmoschus
Abelonian
Abencerrages
Aberdeen
Aberdonian
Aberia
Abhorson
Abie
Abies
Abietineae
Abiezer
Abigail
Abipon
Abitibi
Abkhas
Abkhasian
Ablepharus
Abnaki
Abner
Abo
Abobra
Abongo
Abraham
Abrahamic
Abrahamidae
Abrahamite
Abrahamitic
Abram
Abramis
Abranchiata
Abrocoma
Abroma
Abronia
Abrus
Absalom
Absaroka
Absi
Absyrtus
Abu
Abundantia
Abuta
Abutilon
Abyssinian
Acacia
Acacian
Academic
Academus
Acadia
Acadian
Acadie
Acaena
Acalepha
Acalephae
Acalypha
Acalypterae
Acalyptrata
Acalyptratae
Acamar
Acanthaceae
Acantharia
Acanthia
Acanthocephala
Acanthocephali
Acanthocereus
Acanthodea
Acanthodei
Acanthodes
Acanthodidae
Acanthodii
Acanthodini
Acantholimon
Acanthomeridae
Acanthopanax
Acanthophis
Acanthopteri
Acanthopterygii
Acanthuridae
Acanthuru

### Illustration

There are two basic ways to solve this problem. One, and the one I prefer,
is to read through the file line by line. When we do that, we can use ˆ to
anchor our regexp to the start of the string. Then all we have to do is continue
the word using \w, which represents any alphanumeric character, and then *,
which matches zero or more characters.

Why would I use *, rather than +? Because two of the capital vowels (A
and I) are words. If we were to use +, then the regexp would need to match at
least two letters, not just one.

Our regexp can thus look like this:

    ^[AEIOU]\w*

Another method would be to read the entire file as a single string, and then
to look for our capital-vowel-word at the start of each line – either by looking
for \n followed by our regexp, or by using a flag to indicate multi-line mode,
such that ˆ matches the start of a line, rather than the start of the entire string.

## Comment lines
Many Unix-style files, including programs written in such languages as Python
and Ruby, indicate comments by having a # at the start of the line. In this
exercise, you are to print all comment lines – meaning, all lines that start with
#, or that are preceded by whitespace. Comments that follow whitespace can
be ignored.
Thus, given the following file:

    # Comment 1
        # Comment 2
    
    print("Hello") # Comment 3

Your solution should print comments 1 and 2, but not comment 3.


In [46]:
import re

filename = 'words.txt'
ro = re.compile('^\s*#')

for line in open(filename):
    if ro.search(line):
        print(line)

### Illustration

We’re only interested in comments that appear at the beginning of the line, or
coming after whitespace at the start of the line. In other words, we’re looking
for a # character just after the start of the line, or with optional whitesapce
before the #. We can thus use the following regexp:

    ^\s*#


## Last five characters
In Alice inWonderland, print the last five characters of every line, in which the
third-to-last character is a lowercase letter in the second half of the alphabet
(i.e., starting with n). We’re looking for the final five characters, in which the first of those is in the range from n to z. 

In [47]:
import re

filename = 'alice.txt'
ro = re.compile('[n-z].{4}$')

for line in open(filename):
    m= ro.search(line)
    if m:
        print(m.group(0))
               


rroll
rland
rroll
nline
p.net
tion]
n the
o the
ns in
ons?"
r the
re of
p and
s ran
r! Oh
watch
never
under
r it!
tion]
think
well.
s and
r, so
t it.
thing
ss me
over.
not a
seen.
ow of
r the
was a
n key
tted!
tion]
ssage
ut of
s and
ters.
son_'
ottle
tured
or of
tered
ope!"
rden.
o the
n she
reach
pery,
thing
ried.
o her
tion]
w I'm
ouble
you."
white
tion]
zle!"
o see
while
st be
ther.
t the
y was
ver,"
ver!"
t she
tion]
self.
ouse?
o she
tired
rench
quite
you'd
you'd
not."
rty.]
rd as
went.
nd we
n the
ogs."
tion]
tion]
tural
nd it
ng to
poke.
tion]
st it
party
t off
over.
quite
won?"
ught.
zes."
sked.
t, in
zes!"
o her
y one
ound.
taste
onder
pt on
n the
th go
u_.--
nial:
rial;
rning
thing
se to
uch a
udge,
would
sting
udge,
ury,'
nning
whole
ndemn
ou to
th.'"
t are
p and
shook
sed a
rious
o cry
tance
tion]
tion]
oking
rd it
s are
tely.
tone,
t and
now!"
thout
t the
oves.
ottle
o her
ssing
quite
ore."
wing,
tever
tion]
tside
sten.
t was
ut as
t it,
ncied
t her
nto a
sort.
nor!"
nor!

### Illustration

When you hear that you’re looking to match “the first” or “the last” characters
on a line, then you almost certainly want to use an anchor. In this case, we’ll
use $, which anchors the regexp to the end of a line. If we were looking for the
last five characters, we could simply say:

    .{5}$

But we’re looking for the final five characters, in which the first of those is
in the range from n to z. In other words:

    [n-z].{4}$

## u in the 2nd-to-last word

Show the final two words of each line of Alice in Wonderland in which u is in
the second-to-last word.

In [48]:
import re

filename = 'alice.txt'
ro = re.compile(r'\b\w*u\w*\s+\S+$')

for line in open(filename):
    m= ro.search(line)
    if m:
        print(m.group(0))
        
# Remember to use a raw string (or a doubled backslash) when your raw string
# includes \b. Otherwise, Python will interpret \b as the backspace character
# (ASCII 8), which will lead to a mismatch.        

pictures or
pleasure of
up and
out again.
out of
through the
up like
beautifully printed
surprised that
about you."
could go.
must be
subject of
quite natural
authority among
found it
you know
turning to
must have
mouse to
Such a
you to
up and
caused a
up eagerly.
found the
full effect
but as
out her
thought Alice.
suppose I
question is
up on
because I'm
turned away.
minutes together!"
such a
mushroom and
you grow
would bend
succeeded in
just as
usual height.
suddenly a
you are;
you know."
Duchess sneezed
your cat
sure to
Queen to-day?"
house of
you know."
poured a
though she
mushroom (she
found herself
Duchess close
quarreling with
summer day;
quite away!"
you begin?"
jurymen are
out the
ventured Alice.
jury consider
turning purple.
found in:
Redistribution is
Full Project
intellectual property
your possession.
you are
you from
full Project
distributing or
redistribute this
including any
but he
such and
return or
inaccurate or
your equipment.
including legal
you can
medium with
you wi

### Illustration
If I want to see the final word in each line, then it’s probably easiest to iterate
over each line of the file, grabbing the final non-whitespace characters:

    \S+$

Note that the above is already potentially problematic: Because of the way
in which Unix and Windows mark line endings, using the $ to mark the end
of the line and then \S to indicate non-whitespace characters right before it,
means that you might miss lines that have a \r\n at the end, from Windows.
We will assume, for now, that the file has the appropriate line endings for your
operating system.

The thing is, we don’t want the final word. We want the final two words.
We’ll thus have to capture two such words:

    \S+\s+\S+$
    
This gives us the final two words, but we aren’t yet filtering through those
words. The first of the two words (i.e., the second-to-last word on the line)
must contain an u. We can do that with the following:

    \b\w*u\w*\s+\S+$

It’s helpful to read this regexp from the back, because of the $ at the end:
We want one or more non-whitespace characters at the end of the line. We
could probably have used \w instead of \S; the question is whether we want to
include punctuation or not. And indeed, the regexp
    
    \tb\W+\w+$

would have roughly the same result. That said, I’ll stick with the one that
uses whitespace.

The second-to-last word itself is found in the regexp’s first section:

    \b\w*u\w*

This means that we want to have zero or more letters (well, alphanumeric
characters), u, and then zero or more letters. This allows for words that start or
end with u, as well as those with u in the middle. By having a \b at the start of
the regexp, we ensure that we capture the entire word, rather than just a portion
of it.

Thus, our final regexp to match the final two words of any line in which the
second-to-last word contains a u is: 

    \b\w*u\w*\s+\S+$

# Groups

## Date and Time

In access-log.txt, each line contains a timestamp, which looks like this:

    [30/Jan/2010:00:03:18 +0200]

Notice that the timestamp starts with [, ends with ], and contains both the
date (in DD/MMM/YYYY format) and the time (in HH:MM:SS +TZ format).

For this exercise, you are to grab the date and time in separate groups. Each
language has a slightly different way of extracting the groups; the idea is that
for each line, it should be possible to extract and display the date and time
separately. The time should include the time zone; for now, we’ll leave it in the
format used by the access log.

In [49]:
import re

filename = 'access-log.txt'
ro = re.compile('(\d{2}/\w{3}/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})')

for line in open(filename, 'U'):
    m = ro.search(line)
    if m:
        print("Date = '{0}', Time = '{1}'".format(m.group(1), m.group(2)))

Date = '30/Jan/2010', Time = '00:03:18 +0200'
Date = '30/Jan/2010', Time = '00:12:06 +0200'
Date = '30/Jan/2010', Time = '01:29:23 +0200'
Date = '30/Jan/2010', Time = '01:30:06 +0200'
Date = '30/Jan/2010', Time = '02:07:14 +0200'
Date = '30/Jan/2010', Time = '02:10:39 +0200'
Date = '30/Jan/2010', Time = '03:13:34 +0200'
Date = '30/Jan/2010', Time = '03:13:34 +0200'
Date = '30/Jan/2010', Time = '03:43:39 +0200'
Date = '30/Jan/2010', Time = '04:05:43 +0200'
Date = '30/Jan/2010', Time = '04:05:51 +0200'
Date = '30/Jan/2010', Time = '04:24:33 +0200'
Date = '30/Jan/2010', Time = '04:25:36 +0200'
Date = '30/Jan/2010', Time = '04:34:36 +0200'
Date = '30/Jan/2010', Time = '04:39:37 +0200'
Date = '30/Jan/2010', Time = '05:03:34 +0200'
Date = '30/Jan/2010', Time = '05:32:31 +0200'
Date = '30/Jan/2010', Time = '06:01:22 +0200'
Date = '30/Jan/2010', Time = '06:30:19 +0200'
Date = '30/Jan/2010', Time = '06:59:14 +0200'
Date = '30/Jan/2010', Time = '07:07:13 +0200'
Date = '30/Jan/2010', Time = '07:0



### Illustration

When working on such a problem, in which I have to match multiple parts of a
string, I always try to start by matching the first part, and only then by matching
the second part. To match our date, we know that we’ll need to find two digits,
three letters, and two digits, all separated by slashes. We can do that with:

    \d{2}/\w{3}/\d{4}

Now, you might be thinking that the middle should use a character class,
such as [a-z], rather than \w. But I don’t think that it’s crucial in this particular
case; it’s true that \w is more general, and thus slightly slower and more
general, but this is a case in which I prefer readability to speed.

Now, the above regexp matches the date. But I want to grab it in a group,
and be able to access the group later. Thus, I put it inside of parentheses:

    (\d{2}/\w{3}/\d{4})

With that in place, I can start to attack the second part, namely the time.
That consists of pairs of numbers separated by colons, followed by a space,
followed by a + and then four digits indicating the time zone. In other words,
the time, by itself, is identifiable as:

    \d{2}:\d{2}:\d{2} \+\d{4}

Remember that + is a metacharacter, which means that matching a literal +
requires using \+!
We can then find this as a group by putting parentheses around it:

    (\d{2}:\d{2}:\d{2} \+\d{4})

Now we can combine our two groups, joining them with the : that appears
between the date and time in the access log:

    (\d{2}/\w{3}/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})

If we look for the above in access-log.txt, we’ll find that group #1 is
the date, and group #2 is the time.

## Config pairs

config.txt is a simple configuration file. Simple, in that the configuration is
set with lines that look like

    name:value

But as often happens in such files, the people writing the file have gone a
bit crazy, and have added lots of extra whitespace. Some lines contain only
whitespace, or are generally illegal, without either a name or a value.

We want to extract all of the name-value pairs from this file, grabbing the
name and value in separate groups from legal lines. Moreover, we want to
ignore any leading and trailing whitespace surrounding the name and value.

In [50]:
import re

filename = 'config.txt'
ro = re.compile('(\w+)\s*:\s*(\w+)')

for line in open(filename):
    m = ro.search(line)
    if m:
        print("Name = '{0}', Value = '{1}'".format(m.group(1), m.group(2)))

Name = 'a', Value = '1'
Name = 'c', Value = 'hello'
Name = 'b', Value = '100'


### Illustration

As usual, it’s a good idea to start with the simple part of the regexp, and then
work up to the more complex parts.
The simplest possible regexp is the one that matches our basic name:value:

    (\w+):(\w+)

In other words, we’re looking for all of the alphanumeric characters before
:, and then all of those after :. Those will be our name and value.

        But our name and value might have whitespace before and after them.
Thus, we need to account for that by using \s, along with *, indicating that
the whitespace is optional:

    (\w+)\s*:\s*(\w+)

Now, what about those illegal lines? We don’t need to worry about them,
since they won’t match our regexp: If there isn’t at least one alphanumeric
character before and after the colon, the line won’t match our regexp. This is
also true for lines that contain only whitespace.

And what about whitespace either before the name or after the value? Again,
we don’t need to worry about this, because they occur before and after our regexp’s
groups, and thus won’t be captured.


## Postfix dollar

In the United States, we put the dollar sign before the price of something, as
in \$123.45. In my travels, I’ve noticed and discovered that many people, in
many countries, aren’t used to this, and put the $ sign after the numbers. Given
the sentence:

    They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).

For this exercise, write a regular expression that finds all of the cases of
numbers (including commas and decimal points) followed by dollar signs.
Thus, the results should find 1,000\$ and 123.45\$

In [51]:
import re
s = 'They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).'
print(re.findall('[\d.,]+\$', s))

['1,000$', '123.45$']


### Illustration

    [\d.,]+\$

To find a decimal digit (0-9), we can use the built-in character class \d.
But we don’t want to find just digits; we also need to find decimal points and
commas. To that end, I create a new character class, containing not only \d,
but also periods and commas.

But of course, we’re not only interested in numbers. We’re interested in
numbers that have a trailing \$. Normally, you might think that you can use
a plain \$ at the end of this regular expression. But we can’t do that in this
case, because a \$ in the final position of a regexp becomes a metacharacter,
anchoring the regexp to the end of the string. (Or, if you’re in multi-line mode,
it matches the end of a line.) So in order to match a trailing dollar sign, we’ll
need to put a backslash before that final \$.

## Quote first and last words

In an earlier exercise (5.8), we found all of the quotations in Alice in Wonderland.
For this exercise, find the first word and last from each quotation, not
including the quotation marks and punctuation.

Thus, if the quote is

    "Hello out
    there!"

You should find Hello and there. Note that quotes might extend across
lines.

In [52]:
""" Because this regexp includes both double and single quotes, we’ll need to use
a backslash when defining our regexp string in Python, escaping the single
quotes within the regexp string
"""
import re

filename = 'alice.txt'
ro = re.compile('"([a-zA-Z\']+)[^"]+?([a-zA-Z\']+)[.?!]*"')

s = open(filename).read()

for quote in ro.findall(s):
    print(quote)

('STORYLA', 'D')
('without', 'conversations')
('Oh', 'late')
("Dinah'll", 'think')
('I', 'me')
('Oh', 'getting')
('how', 'begin')
('DRINK', 'ME')
('What', 'feeling')
('I', 'telescope')
('Come', 'that')
('I', 'minute')
('EAT', 'ME')
('and', 'happens')
('Which', 'way')
('Curiouser', 'curiouser')
('Now', 'you')
('Oh', 'waiting')
('Dear', 'puzzle')
('How', 'that')
('I', 'again')
('That', 'escape')
('And', 'garden')
('for', 'never')
('to', 'trying')
('O', 'Mouse')
('I', 'Conqueror')
('O', 'chatte')
('Oh', 'pardon')
('I', 'cats')
('Not', 'cats')
('Would', 'me')
("don't", 'thing')
('We', 'not')
('We', 'indeed')
('As', 'again')
('I', 'indeed')
('Are', 'dear')
("I'm", 'again')
('Mouse', 'them')
('Let', 'dogs')
('Sit', 'enough')
('Ah', 'm')
('U', 'h')
('of', 'means')
("it's", 'find')
("'", 'dear')
('it', 'all')
('Speak', 'English')
('I', 'either')
('is', 'race')
('What', 'race')
('the', 'it')
('One', 'away')
('The', 'over')
('But', 'won')
('But', 'prizes')
('Prizes', 'Prizes')
('Mine', 'tale')
(

The solution to our previous exercise on quoting was:

    "[^"]+"

Now we want to find the first and last words in that sentence. Let’s start with
the first word, which will contain letters immediately following the opening
quotes:

    "([a-zA-Z']+)[^"]+"

In this case, I decided to match all of the letters (capital and lowercase), as
well as apostrophes (’). If I run this regexp across the text of Alice – not line
by line, but rather across the entire book, so that I can grab quotes that exist
across newlines – then group #1 matches the first word.
Now let’s try to grab the last word. On the face of it, this should be the same
as the first word. However, the instructions for this exercise indicated that we
shouldn’t include any punctuation in our final word. Thus, we’ll need to grab
optional punctuation at the end of the quote (i.e., immediately preceding the
final quotes), and then letters and apostrophes before that:

    "([a-zA-Z']+)[^"]+([a-zA-Z']+)[.?!]*"

The thing is, this doesn’t quite work. Instead of the final word in our second
group, we get the final character of the final word. What went wrong?

The answer lies in the fact that regexps are greedy. This means that as the
regexp engine tries to match text, it grabs as much as it can, from left to right.
So the first expression in the regexp will get as much as it can, and then the
second will get as much as it can, and so forth.

The problem is that if you have two expressions in your regexp that are
right next to each other, and which can potentially match the same text, the one
on the left wins. For example:

    (\w+)(\w+)

If we match the above against abcde, group #1 will be abcd, and group
#2 will be e. This is normally a good thing, but in the case of this exercise, it
causes trouble. We don’t want the middle characters of the quotation to come
at the expense of the final word!

The solution is to make the middle section non-greedy. That is, we still
want it to grab characters, but it should grab the minimum possible for a match,
rather than the maximum. We can indicate that *, +, ?, and {} are non-greedy
by putting an ? after them. For example, let’s try our sample regexp again:

    (\w+?)(\w+)

Matched against the string abcde, group #1 will now be a, and group #2
will be bcde.
To get the full final word, we thus modify the regexp one last time:

    "([a-zA-Z']+)[^"]+?([a-zA-Z']+)[.?!]*"

## Question first word

Once again, let’s extract some text from Alice inWonderland: Retrieve the first
word of every question – meaning, every sentence that ends with a question
mark.


In [53]:
import re

filename = 'alice.txt'
ro = re.compile('(\w+)[^.?!]*\?')

s = open(filename).read()

for quote in ro.findall(s):
    print(quote)

Once
Would
She
Which
Oh
_Was_
But
How
Would
So
So
Would
Are
Are
And
The
How
What
and
But
It
What
I
Where
Very
What
Where
Now
She
Will
Who
Then
A
I
Who
What
Why
Is
It
What
One
The
And
Where
And
You
The
How
_Are_
Please
Wouldn
Alice
Cheshire
Do
The
Do
Illustration
What
Would
When
Can
Where
What
How
How
IX
When
What
What
Who


### Illustration

The first thing we need to figure out in order to solve this problem is how we
can describe a question using regular expressions.

We know that a question starts with a word – and that word might be only
one character long, as in I – and ends with a question mark. Maybe we could
identify questions this way:

    \w+\?

But of course, the above won’t work, because there might be spaces in the
middle. We could also use a non-greedy regexp, such as:

    .+\?

But that won’t go over the newlines, at least not without invoking the singleline
flag that most regexp engines offer. Instead, I’m going to use a technique
similar to what we saw in Exercise 5.8, in which we said that a quote started
with ", ended with ", and that in the middle we had everything that was not a
". That might lead us to the following:

    \w[^?]\?

But this will likely pick up all sorts of other things. I’m thus going to
expand the negated character class in the middle, to ensure that anything we
capture will not cross the boundary of a sentence:

    \w[^!.?]*\?

I use a * here after the negated character class, to allow for one-letter questions (e.g., I?) Finally, we can indicate that we want the first word, and then
capture that word:

    (\w+)[^.?!]*\?

## t, but no "ing"

In this exercise, you are to find all of the words in Alice in Wonderland that
start with t and end with ing. However, you are to return the portion of the
word that precedes the int. Thus, if the word is trailing, you should only
match and return trail.

In [54]:
import re

filename = 'alice.txt'
ro = re.compile(r'\b(t\w+)ing\b')

s = open(filename).read()

for quote in ro.findall(s):
    print(quote)

talk
try
try
th
trott
talk
talk
try
th
trembl
trembl
talk
th
th
turn
th
th
turn
think
talk
trembl
trott
th
talk
th
turn
th
th
th
tak
try
think
th
throw
th
th
talk
th
th
turn
try
tumbl
try
think
th
trembl
turn
turn
think


### Illustration

Let’s start by defining a regexp that’ll give us all of the words that start with t:

    \bt\w+\b

The above describes a word (because of the \b on either side). The words
starts with t and then continues with at least one more letter (thanks to the +)
until it reaches the end of the world.
Now, let’s add a check to see if the word ends with ing:

    \bt\w+ing\b

And finally, we’ll add parentheses to capture the initial part of the word:

    \b(t\w+)ing\b
