# CompUp Workshop

17th of November, 2021

Thomas Arildsen, CLAAUDIA

# Text processing with regular expressions

What are regular expressions?

> A **regular expression** (shortened as **regex** or **regexp**; also referred to as rational expression) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

\[[Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)\]

So, a "regex" is a sort of special text string that specifies a pattern you want to search for.

A simpler example - may be familiar to some: "wildcard" text strings:

    some*
    
- Matches: "something", "some", "somelier", etc.
- Often used when listing / searching for files:

In [None]:
!ls *.ipynb

"Wildcard" text strings are just a simple example of specifying text patterns.
- Regular expressions are more general than this
- Regular expressions are a combination of *literal characters* and *meta-characters*
- Literal characters: for example ordinary letters that should be taken as-is: "b"
- Meta-characters: typically special or punctuation characters that serve a functional role in the regex: "." (matches any 1 character)
  
      b.

- Matches "be", "by", "b5", "b:"; etc. *Not* "beer", "banana", etc.
- Notice how the meta-character "." matches only 1 character. Must be combined with more meta-characters to match longer string parts.

# Regular expressions in Python

Python has a module with functionality for working with regular expressions: `re`
- Part of the Python standard library, so no need to install anything extra


## How to write regex patterns in Python

`\` in Python strings is used to "escape" certain special characters.
- For example, we must type `'\\'` to get a backslash or `'\n'` to get a new line - try it:

In regex, `\` is also used to denote special sequences - as we shall see shortly.
- Can "collide" with Python's "escaping" of characters in text strings - makes it cumbersome to write special sequences in regex.
- Solution: use Python's "raw text strings" for regex:

## Searching with regex

Now we are ready to try some regex searching. What does `re.search` return?:

# Regex syntax

Now we know a little better how `re.search` behaves. Time to look closer at how to type regex.
- So far, we have seen simple examples with literal characters

      r'needle'
      
- The real strength of regex comes from meta-characters, or special characters, and special sequences

`.`

# Sifting through Twitter data

Let us put regex search to some real use. We download a selection of tweets from Twitter mentioning the hashtag '#metoo'.

Save [this file](https://gist.githubusercontent.com/ThomasA/9c524894e17d56b211c51cdc34c404ca/raw/8cfade6dbe859999dcfaaba3cc5d96f46cd43da9/twitter.csv).


In [None]:
!wget https://gist.githubusercontent.com/ThomasA/9c524894e17d56b211c51cdc34c404ca/raw/8cfade6dbe859999dcfaaba3cc5d96f46cd43da9/twitter.csv

## Character classes

Specifying a word like `r'[A-Za-z]'` may be a bit cumbersome.
- Quicker way of specifying character classes like words (`\w` matches "word" *characters*; `[a-zA-Z0-9_]` in ASCII):

See more character classes and regex features [in the documentation](https://docs.python.org/3.8/library/re.html#index-24).


## Substitution

Regex can also be used for more things than just searching:
- Substitution: