## Distant reading course week 2 (VT-23)

### Learning material 2b: Regular expressions

Matti La Mela

This learning material will guide you to the use of regular expressions. Regular expressions are strings of text (patterns) that help us to match combinations that we find in other strings. The regular expressions are useful in different text mining operations. With regular expressions, we are able to do advanced search operations. Regular expressions are used in cleaning data but also locating and parsing interesting elements for analysis. An example would to locate all year numbers, thus occurrences of four digits in a row, in a large amount of text. In regular expressions these could be matched for example with \d\d\d\d (\d for any decimal digit) or the more concise \d{4}.

We can use regular expressions in python, but also with many other tools where we do searching and matching, eg. Notepad++ or Antconc that we will learn to use this week.

The reference readings for this learning material is:

o Tutorial at https://regexone.com/

o Doug Knox (2013), "Understanding Regular Expressions," Programming Historian, https://doi.org/10.46430/phen0033.

o Laura Turner O'Hara (2013), "Cleaning OCR'd text with Regular Expressions," Programming Historian, https://doi.org/10.46430/phen0024.

o Chapters 3.5 and 3.7 (some examples) in Bird, Steven, Ewan Klein, and Edward Loper (2019), Natural Language Processing with Python (NLTK 3.0),https://www.nltk.org/book/ book.


### 1. Tutorial

We will start with an online tutorial at https://regexone.com/. 

Do now the exercises at least until exercise 9.

### 2. Regex editors online

There are regex editors that help us to test and build regular expressions for our matching purposes. Visit now https://regexr.com/ and/or https://regex101.com/, where you can enter text and then match patterns with your regular expressions.

### 3. Regular expressions in Python

Here are three regex functions with use cases: re.findall, re.search and re.sub.

In [2]:
# In python, we use the "re" library for the regular expression matching operations. We import re, thus:

import re

# With re.findall(), we are able to find all occurrences of a regex in a string. Findall returns a list.

# From wikipedia: https://en.wikipedia.org/wiki/Uppsala_University

text = "Uppsala University is a public research university in Uppsala, Sweden. Founded in 1477, it is the oldest university in Sweden and the Nordic countries still in operation."

list = re.findall("\d{4}", text)

print(list)


['1477']


In [3]:
# re.search() enables us to match strings and get the location where the matched string can be found. It stops at the first match.

# Here we see if there are words with a capital letter:

match = re.search("[A-Z][a-z]{1,}", text)
print(match)

# re.search returns a Match object: we see the index numbers (.start() and .end()) where the word is found, and what word it is (.group())

start = match.start()
end = match.end()

print("The match is at index [" + str(start) + ":" + str(end) + "]")

word = match.group()

print("The first match was: " + word)


<re.Match object; span=(0, 7), match='Uppsala'>
The match is at index [0:7]
The first match was: Uppsala


In [4]:
# Hre we want to pick the words that start with a capital letter.

words = []
list = text.split()

for word in list:
    if re.search("[A-Z][a-z]+", word):
        print(word + " -> starts with a capital letter")
        words.append(word)
    else:
        print(word + " -> does not start with a capital letter")

print(words)


Uppsala -> starts with a capital letter
University -> starts with a capital letter
is -> does not start with a capital letter
a -> does not start with a capital letter
public -> does not start with a capital letter
research -> does not start with a capital letter
university -> does not start with a capital letter
in -> does not start with a capital letter
Uppsala, -> starts with a capital letter
Sweden. -> starts with a capital letter
Founded -> starts with a capital letter
in -> does not start with a capital letter
1477, -> does not start with a capital letter
it -> does not start with a capital letter
is -> does not start with a capital letter
the -> does not start with a capital letter
oldest -> does not start with a capital letter
university -> does not start with a capital letter
in -> does not start with a capital letter
Sweden -> starts with a capital letter
and -> does not start with a capital letter
the -> does not start with a capital letter
Nordic -> starts with a capital le

In [5]:
# Finally, we use re.sub to replace something in our text. It is similar to .replace(), but is more versatile as we use regular expressions

# we want to replace all years with YEAR in our text

output = re.sub("\d{4}", "YEAR", text)

print(output)



Uppsala University is a public research university in Uppsala, Sweden. Founded in YEAR, it is the oldest university in Sweden and the Nordic countries still in operation.


In [7]:
# In this final example, we take Barack Obama's inaugural address (2009), and find all the words that start with a capital letter, 
# but are not in the beginning of the sentence.

with open("./obama_inaugural.txt", mode="r", encoding="utf-8") as file:
    text = file.read()

text = text.replace("\n\n", " ")  # we remove paragraphs, and empty lines; replace them with " " to keep space between words. We could convert this also to a list
text = text.replace("\n", " ")    # with split()

# for our regex, we have words that start with a capital letter, but that do not have ". " before them, thus no period.

# Let's replace first all capital letters after period with a small letter; we will replace them with small x as this is not something we want

result = re.sub("\. [A-Z]", ". x", text)

# then let's find all words with capital letters

list = re.findall("[A-Z][A-Za-z]+", result)

print(list)

    

['Obama', 'Inaugural', 'Address', 'January', 'My', 'President', 'Bush', 'Americans', 'America', 'We', 'People', 'Americans', 'America', 'America', 'Scripture', 'God', 'West', 'Concord', 'Gettysburg', 'Normandy', 'Khe', 'Sahn', 'America', 'Earth', 'America', 'Gross', 'Domestic', 'Product', 'Founding', 'Fathers', 'America', 'Iraq', 'Afghanistan', 'Christians', 'Muslims', 'Jews', 'Hindus', 'Earth', 'America', 'Muslim', 'West', 'Americans', 'Arlington', 'American', 'American', 'God', 'America', 'Let', 'America', 'God', 'God', 'United', 'States', 'America']
