# RegEx og Frankenstein

The Online RegEx tester, [regex101](https://regex101.com/), is an absolutely fantastic helpful site for learning how to use regular expressions (Regex).

[W3schools](https://www.w3schools.com/python/python_regex.asp) also has a very useful page about RegEx.

Regex's use is very widespread because RegEx is super smart in relation to text processing, because it can be used to perform advanced searches. RegEx is used for search engines and for search and replace functions. Working with RegEx is definitely an experience in itself, but when you get an insight into the scope of tasks that can be solved with RegEx, you realize that it is an incredibly good tool.

This notebook doesn't try to teach you everything about RegEx, but it does try to create learning about it, and only a few of the possibilities are illustrated below.

In addition to RegEx, this notebook contains many loops and list comprehensions, so that way you can also get an insight into how to write this sort of thing.

## Get some data

In [3]:
import urllib.request 
url = 'https://gutenberg.org/cache/epub/84/pg84.txt'
raw_text = urllib.request.urlopen(url).read().decode()
text_start = raw_text.find('*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text_start = text_start + len('*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text_end = raw_text.find('*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text = raw_text[text_start:text_end].strip()

In [4]:
text[0:100]

'Frankenstein;\r\n\r\nor, the Modern Prometheus\r\n\r\nby Mary Wollstonecraft (Godwin) Shelley\r\n\r\n\r\n CONTENTS'

# Meta characters \b \S og \w and the + sign

We often need to clean the texts of symbols such as commas and full stops, etc.

Cleaning of text can be done in several ways. Below we try out a few of the ways. We begin by importing RegEx (import re).

In [5]:
import re

## First way
The RegEx pattern is '\b\S+\b'.

\b : \b finds the position at the boundary of a word (word boundary).
\S: \S matches any non-space
+: + matches the previous character between one and an unlimited number of times, as many times as possible until the next character. They say the plus is greedy.
\b : \b finds the position at the boundary of a word (word boundary).

When you set \b\S+\b, you match from, you match all "non-space characters" as well as underscores, but not symbols such as periods, commas, question marks.


## Secound way
\w: \w matches any alphabetic letter (uppercase and lowercase), any number, or an underscore (_).
+: + matches the preceding character one or more times.

When you put \w+ together, you match whole words composed of letters, digits and underscores.

In [6]:
def clean_text_1(text):
    # Use \w+ regex pattern to extract words
    words = re.findall(r'\b\S+\b', text)

    # Join the extracted words into a cleaned text
    cleaned_text = ' '.join(words)

    return cleaned_text


def clean_text_2(text):
    # Use \w+ regex pattern to extract words
    words = re.findall(r'\w+', text)

    # Join the extracted words into a cleaned text
    cleaned_text = ' '.join(words)

    return cleaned_text

In [7]:
cleaned_text = clean_text_1(text)

print(cleaned_text[:100])

Frankenstein or the Modern Prometheus by Mary Wollstonecraft Godwin Shelley CONTENTS Letter 1 Letter


In [8]:
cleaned_text = clean_text_2(text)

print(cleaned_text[:100])

Frankenstein or the Modern Prometheus by Mary Wollstonecraft Godwin Shelley CONTENTS Letter 1 Letter


We must try to relate to the results and compare the results.

Search e.g. after There—for
In the first method, the There-for remains in a word. In the second method, it becomes two words.

Search e.g. after About two o'clock.

In the first method, o'clock remains a word. In the second method, it becomes two words "o clock".

Both methods leave us with underscores.

# w+ along with \b

Why doesn't anything happen on a Friday?

Find words with special endings, e.g. _day_, can be a help to gain insight into where and when the literature takes place.

You can also use the endings to find grammatical forms, e.g. words with a long affix will be relatively easy to identify.

In [9]:
ending = re.findall(r'\w+day\b', text)
print(ending)

['yesterday', 'holiday', 'Monday', 'Yesterday', 'Sunday', 'Thursday', 'today', 'today', 'yesterday', 'yesterday', 'everyday']


# More metacharacters, as well as pipes, lists and question marks

In literature, comparisons are often used to illustrate points more clearly by putting pictures on what you want to describe. Comparisons also contribute to making the text more lively and interesting.

But regex makes it a manageable task to retrieve examples of comparisons in Grimm's fairy tales, because we can find text strings that follow the pattern of a typical comparison.

We can illustrate it in the following way. We look for phrases whose pattern is either as a ... or as an ....

The RegEx pattern can be written like this:

'as\sa\s\w+'

The word 'as' is followed by \s, meaning white space, followed by a, then followed by \s, followed by \w, meaning word charater, followed by + meaning "one or more of the previous".


If you also want to search for "as an ..." there are two ways to do it.


First way is to use pipe |. Pipe means "or". The regex pattern will then look like this: 'as\sa\s\w+|as\san\s\w+'

Another way is to use the list character []?

It looks like this: 'as\sa[n]?\s\w+'. In the list, letters can be added that can stand in that place in the word. The question mark indicates that the letter may or may not be there.

In [10]:
comparison = re.findall(r'as\sa\s\w+', cleaned_text)
print (comparison)

['as a steady', 'as a child', 'as a most', 'as a Turk', 'as a remarkably', 'as a human', 'as a little', 'as a brother', 'as a double', 'as a halo', 'as a merchant', 'as a considerable', 'as a sense', 'as a show', 'as a fair', 'as a restorative', 'as a necessity', 'as a German', 'as a boy', 'as a promise', 'as a deformed', 'as a strong', 'as a narrow', 'as a little', 'as a dream', 'as a certain', 'as a bold', 'as a mystery', 'as a most', 'as a proof', 'as a tendency', 'as a divine', 'as a widow', 'as a servant', 'as a great', 'as a judgement', 'as a Roman', 'as a miniature', 'as a new', 'as a strange', 'as a girl', 'as a proof', 'as a dire', 'as a murderer', 'as a wretch', 'as a creature', 'as a murderess', 'as a wreck', 'as a lullaby', 'as a poor', 'as a new', 'as a little', 'as a small', 'as a lovely', 'as a lady', 'as a guide', 'as a vagabond', 'as a Turkish', 'as a Christian', 'as a boarder', 'as a distant', 'as a listener', 'as a true', 'as a luxury', 'as a fool', 'as a recompense'

In [11]:
comparison = re.findall(r'as\sa\s\w+|as\san\s\w+', cleaned_text)
print (comparison)

['as a steady', 'as a child', 'as an under', 'as a most', 'as a Turk', 'as a remarkably', 'as a human', 'as a little', 'as a brother', 'as a double', 'as a halo', 'as a merchant', 'as a considerable', 'as a sense', 'as a show', 'as a fair', 'as a restorative', 'as an infant', 'as a necessity', 'as a German', 'as a boy', 'as an inferior', 'as a promise', 'as a deformed', 'as a strong', 'as a narrow', 'as an uncouth', 'as a little', 'as a dream', 'as a certain', 'as an easier', 'as a bold', 'as a mystery', 'as a most', 'as a proof', 'as a tendency', 'as a divine', 'as an odious', 'as a widow', 'as a servant', 'as a great', 'as a judgement', 'as a Roman', 'as an irresistible', 'as an historical', 'as an air', 'as a miniature', 'as a new', 'as a strange', 'as a girl', 'as a proof', 'as a dire', 'as a murderer', 'as a wretch', 'as a creature', 'as a murderess', 'as a wreck', 'as a lullaby', 'as a poor', 'as a new', 'as a little', 'as a small', 'as a lovely', 'as a lady', 'as a guide', 'as a

In [12]:
comparison = re.findall(r'as\sa[n]?\s\w+', cleaned_text)
print (comparison)

['as a steady', 'as a child', 'as an under', 'as a most', 'as a Turk', 'as a remarkably', 'as a human', 'as a little', 'as a brother', 'as a double', 'as a halo', 'as a merchant', 'as a considerable', 'as a sense', 'as a show', 'as a fair', 'as a restorative', 'as an infant', 'as a necessity', 'as a German', 'as a boy', 'as an inferior', 'as a promise', 'as a deformed', 'as a strong', 'as a narrow', 'as an uncouth', 'as a little', 'as a dream', 'as a certain', 'as an easier', 'as a bold', 'as a mystery', 'as a most', 'as a proof', 'as a tendency', 'as a divine', 'as an odious', 'as a widow', 'as a servant', 'as a great', 'as a judgement', 'as a Roman', 'as an irresistible', 'as an historical', 'as an air', 'as a miniature', 'as a new', 'as a strange', 'as a girl', 'as a proof', 'as a dire', 'as a murderer', 'as a wretch', 'as a creature', 'as a murderess', 'as a wreck', 'as a lullaby', 'as a poor', 'as a new', 'as a little', 'as a small', 'as a lovely', 'as a lady', 'as a guide', 'as a

# Curly brackets

Keyword-in-context, contexts or find a text snippet based on keywords and a range.

We want to find text extracts that contain Turk or Roman, because we are actually interested in pointing down in the text and seeing how exactly the terms are used.

For this we need to use the full stop ( . ) because it gives us more word characters and {30} searches for us to get 30 word characters before we hit the letters Turk.

The period {30} after Turk gives us another 30 word characters.

Try to see if you can use some of what has been reviewed above to include text extracts that contain the word Roman.

In [13]:
re.findall(r'.{30}Turk.{30}', cleaned_text)

['educated he is as silent as a Turk and a kind of ignorant carele',
 ' cause of their ruin He was a Turkish merchant and had inhabited',
 ' intentions in his favour The Turk amazed and delighted endeavou',
 'eward his toil and hazard The Turk quickly perceived the impress',
 'eized and made a slave by the Turks recommended by her beauty sh',
 ' day for the execution of the Turk was fixed but on the night pr',
 'passing into some part of the Turkish dominions Safie resolved t',
 'parture before which time the Turk renewed his promise that she ',
 'irs of her native country The Turk allowed this intimacy to take',
 ' He quickly arranged with the Turk that if the latter should fin',
 ' learned that the treacherous Turk for whom he and his family en',
 'it but the ingratitude of the Turk and the loss of his beloved S',
 ' mandate A few days after the Turk entered his daughter s apartm',
 'this emergency A residence in Turkey was abhorrent to her her re',
 'rstood the common language of Tu

# Square brackets [A-Z]

Find words that start with capital letters

In [14]:
upper_case_word = re.findall(r'[A-Z]\w+', text)
print (upper_case_word[:20])

['Frankenstein', 'Modern', 'Prometheus', 'Mary', 'Wollstonecraft', 'Godwin', 'Shelley', 'CONTENTS', 'Letter', 'Letter', 'Letter', 'Letter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter']


Many of these words are capitalized because they appear after a period, and thus are not what I would call "real" capitalized words.

If you want to filter out the "inauthentic" words from your list, then you can reveal them by making a loop and inserting a condition that can check whether the words should be written in lowercase elsewhere in the texts, because if they are, then they are "fake".

In [15]:
true_upper_case = []
for word in upper_case_word:
    if word.lower() not in text:
        true_upper_case.append(word)
print (true_upper_case[0:20])

['Frankenstein', 'Prometheus', 'Mary', 'Wollstonecraft', 'Godwin', 'Shelley', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter']
