### Regular expressions and loops

Python loops

Regular expressions (regex) are patterns which match parts of text. Optionally, they can also replace. They are powerful ways of finding and changing strings.

The most important point is that regular expressions are not specific to Python. They're ubiquitous in programming and usually available in any piece of software that works with sequences of characters.

Data preparation is a large part of most DH projects. Regex is a key tool, whatever your software or programming language. Learn regex once: use everywhere!

Regex has limitations. We'll come back to the at the end, but regex only works with strings. In regex, the number _5_ is a string, not an integer（整数）. That means that you can find sequences of numbers but with regex alone you can't increment those numbers or do other mathematical operations on them.

Fortunately a programming language can do that. So if you combine regex with something like Python you can have the best of both.

Let's read in _Persuasion_ again, as we did last week. First we need to get it into Colab:

In [34]:
!wget https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/persuasion.txt

--2025-10-20 14:57:20--  https://raw.githubusercontent.com/jonathanblaney/2025-1-plain-text/refs/heads/main/persuasion.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 497612 (486K) [text/plain]
Saving to: ‘persuasion.txt.2’


2025-10-20 14:57:20 (13.6 MB/s) - ‘persuasion.txt.2’ saved [497612/497612]



Now we can read in the text again and use the variable `persuasion` to reference it:

In [35]:
with open('persuasion.txt', 'r') as f:
    persuasion = f.read()

Last week we had a really awkward way of looking for the context in which 'Anne' occurs:

In [36]:
persuasion.find("Anne")

2133

In [47]:
persuasion[2113:2153]

' born\nJune 1, 1785; Anne, born August 9,'

To get the next occurrence we'd have to look in the slice after 2133 (note that parts of a slice can be omitted).

In [48]:
persuasion[2153:].find("Anne")

4320

But...

In [49]:
persuasion[4300:4340]

'ealed his failings, and promoted his rea'

We don't want to do this manually 496 times. If you want to do something repeatedly in Python a good approach is often writing a loop. Here we'll look at a *while* loop (a bit later we'll do the *for* loop).

Suppose you are with someone who is backing up a car towards a cliff edge. They ask you to get out and tell them when to stop. You say "keep coming" *while* there is space. When that changes you shout "Stop!" Let's write some Python for that:

"f" -- x string

you can break up the loop

distance 越大，车走得越快

'while' loop


如果是前面有空格
  print ("Stop!") -


In [2]:
distance = 100
while distance > 5:
    print(f"Keep coming: you've got {distance} centimeters")
    distance -= 10
    print("Stop!")

Keep coming: you've got 100 centimeters
Stop!
Keep coming: you've got 90 centimeters
Stop!
Keep coming: you've got 80 centimeters
Stop!
Keep coming: you've got 70 centimeters
Stop!
Keep coming: you've got 60 centimeters
Stop!
Keep coming: you've got 50 centimeters
Stop!
Keep coming: you've got 40 centimeters
Stop!
Keep coming: you've got 30 centimeters
Stop!
Keep coming: you've got 20 centimeters
Stop!
Keep coming: you've got 10 centimeters
Stop!


In [None]:
distance = 100
while distance > 5:
    print(f"Keep coming: you've got {distance} centimeters")
    distance -= 10
print("Stop!")

In the same way, we could loop through the text of _Persuasion_ and look for the string we want in each fragment. Here's an approach to splitting the text in Python, using a `while` loop. But this is getting complicated immediately.

must start as TRUE

In [50]:
offset = 0
while offset < 10000:
    persuasion_chunk = persuasion[offset:offset + 100]
    print(f"chunk range currently: {offset}:{offset + 100}:\n") # show the chunk range
    print(f"{persuasion_chunk}\n")
    offset += 1000
print("finished")

chunk range currently: 0:100:

﻿The Project Gutenberg eBook of Persuasion
    
This ebook is for the use of anyone anywhere in the 

chunk range currently: 1000:1100:

I.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.
 CHAPTER X

chunk range currently: 2000:2100:

Stevenson, Esq. of South Park, in the county of
Gloucester, by which lady (who died 1800) he has iss

chunk range currently: 3000:3100:

Elizabeths they had married; forming altogether two
handsome duodecimo pages, and concluding with th

chunk range currently: 4000:4100:

rior character to any
thing deserved by his own. Lady Elliot had been an excellent woman,
sensible a

chunk range currently: 5000:5100:

d advice, Lady Elliot mainly relied for the best help
and maintenance of the good principles and ins

chunk range currently: 6000:6100:

 his eldest, he would really have given up any thing,
which he had not been very much tempted to do.

chunk range currently: 7000:7100:

loom had v

In [51]:
offset = 0
while offset < 1_0000:
    persuasion_chunk = persuasion[offset:offset + 100]
    print(f"chunk range currently: {offset}:{offset + 100}:\n") # show the chunk range
    print(f"{persuasion_chunk}\n")
    offset += 1000
print("finished")

chunk range currently: 0:100:

﻿The Project Gutenberg eBook of Persuasion
    
This ebook is for the use of anyone anywhere in the 

chunk range currently: 1000:1100:

I.
 CHAPTER XIII.
 CHAPTER XIV.
 CHAPTER XV.
 CHAPTER XVI.
 CHAPTER XVII.
 CHAPTER XVIII.
 CHAPTER X

chunk range currently: 2000:2100:

Stevenson, Esq. of South Park, in the county of
Gloucester, by which lady (who died 1800) he has iss

chunk range currently: 3000:3100:

Elizabeths they had married; forming altogether two
handsome duodecimo pages, and concluding with th

chunk range currently: 4000:4100:

rior character to any
thing deserved by his own. Lady Elliot had been an excellent woman,
sensible a

chunk range currently: 5000:5100:

d advice, Lady Elliot mainly relied for the best help
and maintenance of the good principles and ins

chunk range currently: 6000:6100:

 his eldest, he would really have given up any thing,
which he had not been very much tempted to do.

chunk range currently: 7000:7100:

loom had v

We're only looking for literal strings so we can't ask for the context around a string like _Anne_ because we don't know in advance what that context is. This is where _regular expressions_, also called _regex_ can help.

First we need to import Python's ```re``` module. It's part of the standard library so it will be installed in any normal installation of Python.

Last week we didn't import anything, which is why we were very restricted in what we could do. Most Python code you see while have imports, usually at the top of the notebook or file.

In [52]:
import re

Now we can use regex, in which some characters are _literal_ and some are _special_. Nearly every regex has a combination of both.

But let's start just by running up a regex that only looks for the literal string _Anne_. Note that, because we imported the whole ```re``` library, we have to refer to ```re.findall```, not just ```findall```.

We'll read the results into a variable called ```anne_context```.

In [37]:
anne_context = re.findall(r"Anne", persuasion)

The results are now held in ```anne_context```.



In [None]:
anne_context

A key special character in regex is ```.``` and it means _any character_ so we can add this either side of our literal _Anne_ string to get the context, eg 10 characters either side.

literal characters

& patterns

In [38]:
anne_context = re.findall(r"...........Anne..........", persuasion)
anne_context

['e 1, 1785; Anne, born Aug',
 'grove; but Anne, with an ',
 'as only in Anne that she ',
 'rs before, Anne Elliot ha',
 'e growing. Anne haggard, ',
 ' father of Anne and her s',
 'ndation of Anne’s had bee',
 'e on which Anne wanted he',
 'untry. All Anne’s wishes ',
 'al fate of Anne attended ',
 'e her dear Anne’s known w',
 'hbourhood. Anne herself w',
 ' regard to Anne’s dislike',
 ' What Miss Anne says, is ',
 'place; and Anne, after th',
 'ched, than Anne, who had ',
 'nths ended Anne’s share o',
 'ore, while Anne was ninet',
 'this case, Anne had left ',
 'ssness for Anne’s being t',
 'g point of Anne’s conduct',
 'ed to; but Anne, at seven',
 'uent could Anne Elliot ha',
 'inced that Anne would not',
 'ished, and Anne though dr',
 'f claiming Anne when anyt',
 'do without Anne,” was Mar',
 ' I am sure Anne had bette',
 't all; and Anne, glad to ',
 'ttled that Anne should no',
 'ntained to Anne, in Mrs C',
 'use, while Anne could be ',
 ',” replied Anne, “which a',
 'elves, a

What are we getting back from Python here? Is it a string? How can we check?

In [53]:
type(anne_context)

list

Because this turns out to be a list, we can use **slicing** again, just like we did last week for strings. Slices work in lots of contexts in Python. For example to get the last mentions of Anne in _Persuasion_ we can do this:

In [54]:
anne_context[-10:-1]

['e loved Anne better th',
 'keeping Anne with her ',
 ' seeing Anne restored ',
 'lation. Anne had no',
 'ns with Anne.',
 ' cousin Anne’s engagem',
 'Anne, satisfie',
 'ices by Anne had been ',
 'friend Anne’s was in ']

If you're ever unsure about the syntax for lists, create a small list of your own to check your intuition.

In [55]:
mylist = [1, 2, 3, 4, 5, 6]
mylist[-3:-1]

[4, 5]

In [6]:
mylist = [1, 2, 3, 4, 5, 6]
mylist[-1]

6

In [5]:
mylist = [1, 2, 3, 4, 5, 6]
mylist[-5:]

[2, 3, 4, 5, 6]

With the regex, we can always add or subtract more full points to get more or less context.

But there is a problem with the results of the regex. Since this is a list, we can get its length:

--

Got half of the "anne"s

In [56]:
len(anne_context)

494

In [57]:
persuasion.count("Anne")

496

These kinds of sense checks are good to build in to your thinking, and your code, as much as possible.

The next special character we'll use is ```?```, meaning _one or none_ of the preceding characters.

If we're not sure of the spelling of _Anne_ we can now allow for _Ann_ as well


In [None]:
anne_context = re.findall(r"..........Anne?..........", persuasion)

We can also use ```[^]``` to ask for any characters other than the ones after the ```^``` symbol.

In [58]:
no_anne_context = re.findall(r".......Ann[^e]+......", persuasion)

In [59]:
no_anne_context

[]

But we can also use this to make the characters around our string optional. This is pretty crude but let's do it anyway:

In [60]:
anne_context = re.findall(r".?.?.?.?.?.?.?.?Anne.?.?.?.?.?.?.?.?.?.?", persuasion)

In [61]:
len(anne_context)

494

A much better way is to give a range of how many characters we want to match, using ```{``` and ```}```.

In [9]:
anne_snippet = "Anne Elliot was doing a web search for herself. She wanted to find Anne and Anne's"

In [63]:
anne_snippet

"Anne Elliot was doing a web search for herself. She wanted to find Anne and Anne's"

In [62]:
anne_context = re.findall(r".{0,20}Anne.{0,20}", anne_snippet)
print(anne_context)
print(len(anne_context))

['Anne Elliot was doing a ', " She wanted to find Anne and Anne's"]
2


In [11]:
anne_context = re.findall(r".{0,20}?Anne.{0,20}?", anne_snippet)
print(anne_context)
print(len(anne_context))

['Anne', ' She wanted to find Anne', ' and Anne']
3


In [64]:
anne_words = re.findall(r"Anne [^ ]\w+", persuasion)

'w'stands for word; '+' stands for more

In [44]:
anne_words

['Anne that',
 'Anne Elliot',
 'Anne haggard',
 'Anne and',
 'Anne wanted',
 'Anne attended',
 'Anne herself',
 'Anne had',
 'Anne included',
 'Anne spoke',
 'Anne says',
 'Anne an',
 'Anne Elliot',
 'Anne Elliot',
 'Anne could',
 'Anne was',
 'Anne had',
 'Anne Elliot',
 'Anne found',
 'Anne would',
 'Anne though',
 'Anne when',
 'Anne had',
 'Anne should',
 'Anne could',
 'Anne herself',
 'Anne was',
 'Anne had',
 'Anne walked',
 'Anne was',
 'Anne had',
 'Anne said',
 'Anne had',
 'Anne always',
 'Anne very',
 'Anne had',
 'Anne could',
 'Anne often',
 'Anne was',
 'Anne to',
 'Anne had',
 'Anne gave',
 'Anne hoped',
 'Anne was',
 'Anne had',
 'Anne will',
 'Anne to',
 'Anne followed',
 'Anne will',
 'Anne undertakes',
 'Anne was',
 'Anne understood',
 'Anne were',
 'Anne might',
 'Anne fully',
 'Anne Elliot',
 'Anne Elliot',
 'Anne Elliot',
 'Anne Elliot',
 'Anne felt',
 'Anne could',
 'Anne of',
 'Anne suppressed',
 'Anne offered',
 'Anne did',
 'Anne had',
 'Anne Elliot',
 'Anne 

We already know that we're searching for the literal string "Anne" so maybe we want to exclude that from the results. You can do that with a slightly complicated regex, but instead let's use the other main type of loop: a for loop. A for loop loops over everything in a collection of things.

In [45]:
for word in anne_words:
    newword = word.replace("Anne", "")
    print(newword)

 that
 Elliot
 haggard
 and
 wanted
 attended
 herself
 had
 included
 spoke
 says
 an
 Elliot
 Elliot
 could
 was
 had
 Elliot
 found
 would
 though
 when
 had
 should
 could
 herself
 was
 had
 walked
 was
 had
 said
 had
 always
 very
 had
 could
 often
 was
 to
 had
 gave
 hoped
 was
 had
 will
 to
 followed
 will
 undertakes
 was
 understood
 were
 might
 fully
 Elliot
 Elliot
 Elliot
 Elliot
 felt
 could
 of
 suppressed
 offered
 did
 had
 Elliot
 had
 longed
 could
 felt
 have
 could
 found
 first
 if
 instead
 to
 distinguished
 could
 was
 was
 suspect
 to
 felt
 had
 thought
 found
 was
 could
 and
 smiled
 was
 Elliot
 felt
 quietly
 avoided
 found
 did
 found
 had
 walking
 was
 will
 was
 had
 wondered
 was
 found
 was
 thought
 was
 had
 would
 was
 Elliot
 did
 had
 the
 of
 conceived
 enquired
 twice
 hazarded
 were
 return
 paid
 did
 was
 entered
 had
 could
 wonder
 had
 listened
 mentioned
 was
 drew
 alone
 could
 was
 was
 and
 that
 so
 had
 presumed
 could
 had


Last week we weren't sure if there were characters called _Annette_ or places called _Annecy_ in the text. With regex we can check that. Let's look for _A_ followed by any number of lower-case letters.

Square brackets represent a _character class_, meaning _any one of these in any order_. ```[a-z]``` is a convenience to save you from typing ```[abcdefghijklmnopqrstuvwxyz]``` every time.

```+``` is like the ```?``` we saw above, but it means _one or more_.

In [65]:
capital_a = re.findall(r"A[a-z]+", persuasion)
capital_a

['Author',
 'Austen',
 'Austen',
 'Anne',
 'August',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'All',
 'Anne',
 'Always',
 'As',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'All',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'And',
 'Anne',
 'Anne',
 'Anne',
 'As',
 'After',
 'Anne',
 'Anne',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'And',
 'Admiral',
 'Anne',
 'Admiral',
 'Admiral',
 'And',
 'Admiral',
 'At',
 'After',
 'Anne',
 'As',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'An',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'All',
 'Anne',
 'Admiral',
 'Admiral',
 'Anne',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'And',
 'Anne',
 'Anne',
 'Anne',
 'Admiral',
 'Anne',
 'Accordingly',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Ann

In [66]:
capital_a.sort()
capital_a

['About',
 'About',
 'Abydos',
 'Accordingly',
 'Additional',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiralty',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'Af

In [67]:
set(capital_a)

{'About',
 'Abydos',
 'Accordingly',
 'Additional',
 'Admiral',
 'Admiralty',
 'After',
 'Again',
 'Ah',
 'Alarming',
 'Alas',
 'Alicia',
 'All',
 'Allowances',
 'Altered',
 'Always',
 'An',
 'And',
 'Anne',
 'Another',
 'Anxious',
 'Any',
 'Anybody',
 'Anything',
 'Archibald',
 'Archive',
 'Are',
 'As',
 'Asp',
 'At',
 'Atkinson',
 'Atlantic',
 'August',
 'Austen',
 'Author',
 'Ay',
 'Aye'}

So there are, apparently, characters called _Alicia_, _Archibald_ and _Atkinson_.

In [68]:
len(set(capital_a))

37

Can we use regex to look at all the verbs associated with Anne in _Persuasion_? Here's a first attempt:

In [70]:
annes_verbs = re.findall(r"Anne [^ ]+ed\W", persuasion)

In [71]:
annes_verbs

['Anne wanted ',
 'Anne attended ',
 'Anne included ',
 'Anne walked ',
 'Anne hoped ',
 'Anne followed ',
 'Anne suppressed ',
 'Anne offered ',
 'Anne longed ',
 'Anne felt\npersuaded,',
 'Anne distinguished ',
 'Anne smiled ',
 'Anne avoided ',
 'Anne wondered ',
 'Anne conceived ',
 'Anne enquired ',
 'Anne hazarded ',
 'Anne entered ',
 'Anne listened,',
 'Anne mentioned ',
 'Anne presumed,',
 'Anne ventured ',
 'Anne smiled ',
 'Anne viewed ',
 'Anne ventured ',
 'Anne looked ',
 'Anne walked ',
 'Anne sighed ',
 'Anne named ',
 'Anne recollected ',
 'Anne admired ',
 'Anne walked ',
 'Anne convinced ',
 'Anne delighted ',
 'Anne talked ',
 'Anne hoped ',
 'Anne struggled,',
 'Anne smiled,',
 'Anne restored ']

These aren't, of course, all of Anne's verbs. Regex only operates on sequences of characters.

We've now seen quite a lot of the regex syntax you'll ever need to find things with. To sum up:

```.``` any character

```+``` one or more of the preceding (by default, matches as much as possible: 'greedy')

```?``` one or none of the preceding

```*``` one or none of the preceding (by default, matches as much as possible: 'greedy')

```[]``` a character class, 'find any of these, in any order'

```[^]``` a negated character class 'find anything that is not one of these'

What about if you want to find literal versions of the above, like a literal full stop?

Put a ```\``` in front of it to _escape_ it: make it not special. For example ```\?``` matches a literal question mark.

#### some shortcuts

```\w``` any non-whitespace character  匹配字母、数字、下划线。

```\W``` any whitespace character, including punctuation 匹配非字母、数字、下划线。

```[0-9]``` any number

```[a-z]``` any lowercase letter

```[A-Z]``` any uppercase letter

But if you're new to regex this will still be a lot to take in. Practice is the only way to learning regex, so don't worry. The key thing is to remember that there are many situations where regex will make your life easier and you can look up the syntax any time you need to.

Last week, splitting on whitespace was too crude for us to get all the words from _Persuasion_. Regex allows us to fix that.

In [72]:
persuasion_words = re.findall(r'\w+', persuasion)

In [73]:
persuasion_words

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Persuasion',
 'This',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States',
 'and',
 'most',
 'other',
 'parts',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 'You',
 'may',
 'copy',
 'it',
 'give',
 'it',
 'away',
 'or',
 're',
 'use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www',
 'gutenberg',
 'org',
 'If',
 'you',
 'are',
 'not',
 'located',
 'in',
 'the',
 'United',
 'States',
 'you',
 'will',
 'have',
 'to',
 'check',
 'the',
 'laws',
 'of',
 'the',
 'country',
 'where',
 'you',
 'are',
 'located',
 'before',
 'using',
 'this',
 'eBook',
 'Title',
 'Persuasion',
 'Author',
 'Jane',
 'Austen',
 'Release',
 'date',
 'February',
 '1',
 '1994',
 'eBook',
 '105',
 'Most',
 'recently',
 'updated',

In [74]:
biggest_words = sorted(persuasion_words, key=len, reverse=True)

In [75]:
biggest_words[:10]

['unreasonableness',
 'incomprehensible',
 'misconstructions',
 'acknowledgements',
 'unenforceability',
 'disappointments',
 'disrespectfully',
 'accomplishments',
 'undesirableness',
 'unexceptionable']

In [76]:
from collections import Counter

In [77]:
mycounts = Counter(persuasion_words)
mycounts.most_common(10) # or whatever number required

[('the', 3290),
 ('to', 2852),
 ('and', 2804),
 ('of', 2675),
 ('a', 1586),
 ('in', 1404),
 ('was', 1332),
 ('had', 1176),
 ('her', 1156),
 ('I', 1125)]

What about replacing? For that, in Python, we use ```re.sub```. It works the same way as ```findall``` but we need an extra argument to the function: the thing we want to put in place of what we found. As always, the simplest possible example is a good place to start.

Eliot 可能会拼成Eliot / Elliot

In [78]:
sample = "Anne Elliot"
print(sample)
sample = re.sub(r"Anne? El+iot+", "the principal character", sample)
print(sample)

Anne Elliot
the principal character


The most powerful part of replacement is re-using parts of the find, for example to add to them or move them around.

To do this, put round brackets around a part of the regex you want to recall in the replacement, this is known as a _capture group_.

In the replacement text the contents of the first set of brackets are referred to with ```\\1```, the second set as ```\\2``` and so on. In regex this is known as a _back reference_.

In [79]:
sample = "Anne Elliot"
print(sample)
print("But let's swap the names around:")
sample = re.sub(r"(Anne?) (El+iot+)", "\\2, \\1", sample)
print(sample)

Anne Elliot
But let's swap the names around:
Elliot, Anne


When not to use regular expressions.

Because regex work on strings, they cannot reliably _parse_ data, that is work with its structure.

Once you get good at regex, you might be tempted to use it to parse structured data. Here's a simple example of data in _CSV_ (_comma-separated values_) format:
```
character,novel,occurence_count
Anne Elliot,Persuasion,486
Emma Woodhouse,Emma,397
Elizabeth Bennett,Pride and Prejudice,292
Fanny Price,Mansfield Park,331
```

If you try to extract the middle column you will be trying to parse the data with regex. This is highly unreliable and inadvisable.

Regex doesn't work for structure; but look for characters

### Group work

#### finding

1. Find the context around another main character in _Persuasion_, Captain Wentworth.

2. Does Captain ever get abbreviated to _Capt._?

3. Find the word following _Anne_. Can you make a unique list of these? Can you take account of punctuation between _Anne_ and the following word?

4. Can you create an alphabetised list of all 9-letter words in _Persuasion_?

5. By default in Python, a ```.``` won't run past a ```\n``` character. Can you modify one of the above searches to include characters from the next line? You might need to look at the ```re``` (https://docs.python.org/3/library/re.html)[documentation] for the answer to this.

#### replacing

Use ```re.sub``` to replace some text in ```persuasion```. If you work on the whole novel, Python will have no trouble with this, but it might be hard to see the results. You might prefer to create a slice of Persuasion of a few hundred characters, so you can see the output of your replacement more easily.

This will overwrite the text of ```persuasion```, so if you prefer you can create a string with a different variable name, eg:

```modified_persuasion = re.sub(r"search string", "replacement", persuasion)```


#### finally

Can you explain why, above, we got slightly more results for ```persuasion.count("Anne")```, when compared to ```re.findall(r"{0,20}Anne{0,20}, persuasion)```? This is a bit tricky! Maybe create a small text of your own to test the way these two behave.