#### Our first examples with metacharacters

In order to use regular expressions, we must be familiar with some of the basic patterns called metacharacters. Metacharacters allow us to match more complex things than just specific substrings. Let’s look at an example. Suppose we had the following Series of strings:

In [1]:
import pandas as pd
import numpy as np

s = pd.Series(
    [
        "0",
        "John Wood",
        "Colin Welsh",
        "my list",
        "02456",
        np.nan,
        "HELLO WORLD",
        "water%",
    ]
)

In [2]:
# we want to check whether each entry contains the string ‘John’
s.str.contains("John")


0    False
1     True
2    False
3    False
4    False
5      NaN
6    False
7    False
dtype: object

In [3]:
#  what if we wanted to check if an entry contains the string ‘John’ or ‘Colin’?
s.str.contains("John") | s.str.contains("Colin")

0    False
1     True
2     True
3    False
4    False
5    False
6    False
7    False
dtype: bool

In [4]:
#  We can use the metacharacter |, which acts as the “or” operator.
s.str.contains("John|Colin")

0    False
1     True
2     True
3    False
4    False
5      NaN
6    False
7    False
dtype: object

Here we are looking for substrings of the type xar, where x can be any character. So, as long as the substring ar is found and there is at least one other character in front of the a, the result will be True.

In [6]:
s2 = pd.Series(["bar", "sugar", "cartoon", "argon"])
s2.str.contains(".ar")

0     True
1     True
2     True
3    False
dtype: bool

#### Matching sets of characters

Another very common metacharacter is the square brackets []. Inside the brackets, we can specify a set of characters to match. For example:

In [7]:
s2.str.contains("[bc]ar")

0     True
1    False
2     True
3    False
dtype: bool

* [a-z] - match any lowercase letter
* [A-Z] - match any uppercase letter
* [0-9] - match any digit
* [a-zA-Z0-9] - match any letter or digit
we can search for all strings containing a digit in the string s

In [8]:
s[s.str.contains("[0-9]", na=False)]

0        0
4    02456
dtype: object

Adding the ^ symbol inside the square brackets matches any characters NOT in the set. So we have

* [^a-z] - match any character that is not a lowercase letter
* [^A-Z] - match any character that is not a uppercase letter
* [^0-9] - match any character that is not a digit
* [^a-zA-Z0-9] - match any character that is not a letter or digit

On top of this, we can use certain shorthand for specifying common sequences:

* \d - match any digit
* \D - match any non-digit
* \w - match any alphanumeric character (letter or digit) or an underscore (_)
* \W - match any character that is not alphanumeric or an underscore as described above
* \s - match whitespace (spaces, tabs, newlines, etc.)
* \S - match non-whitespace

Here is then another way to find all strings containing a digit.

In [9]:
s[s.str.contains("[\d]", na=False)]

0        0
4    02456
dtype: object

#### Matching at the start and end of strings

We can also specify the location of the string where we want to match by using:

* ^ - match at the beginning of a string
* $ - searches for matches at the end of a string
We want to search for strings that start with the letter 'b' or 'c' in s2. Then we can say:



In [10]:
s2[s2.str.contains("^[bc]", na=False)]

0        bar
2    cartoon
dtype: object

In [11]:
s2[s2.str.contains("ar$", na=False)]

0      bar
1    sugar
dtype: object

#### Matching preceding characters

Often we want to mention a certain character and then ask to match one or more copies of this character. We can do this using the following metacharacters:

* [*] - match zero or more copies of the preceding character
* [?] - match zero or 1 copy of the preceding character
* [+] - match 1 or more copies of the preceding character
Or we can use curly braces to specify how many times we want to match the given character. We have the following choices:

* {m} - match the preceding element m times
* {m,} - match the preceding element m times or more
* {m,n} - match the preceding element between m and n times
Let’s look at one other example.



In [14]:
s3 = pd.Series(["forest", "o", "ff", "foo", "fof"])
s3.str.contains("f+o?f+")

0    False
1    False
2     True
3    False
4     True
dtype: bool

This will search for all strings that contain 1 or more f’s, then an optional o, and finally 1 or more f’s. We can see that the third and fifth strings satisfy this pattern, as shown in the output.

#### Grouping

Groups are parts of regular expression patterns enclosed in parentheses (e.g. (abc)). We use them to combine smaller regular expressions into larger ones.

For example, when used with the str.extract() method, grouping allows extracting captured groups in separate columns in a dataframe. Let’s look at an example.

In [15]:
s4 = pd.Series(["Monday5km", "Wednesday10km", "Saturday25km"])
# Extract weekday names in a new column
s4.str.extract("(\w+day)", expand=True)

Unnamed: 0,0
0,Monday
1,Wednesday
2,Saturday


In [16]:
# Extract weekday names and distances in km in separate columns
s4.str.extract("(\w+day)(\d+km)", expand=True)

Unnamed: 0,0,1
0,Monday,5km
1,Wednesday,10km
2,Saturday,25km


Grouping also means that we can refer to the captured groups. Let’s see this with the example below.

In [17]:
# Define string sample
sample = 'Monday5km'
sample

'Monday5km'

We will use the match function from the re library to match groups in the string sample from above. The function groups returns the matched groups in a tuple.



In [18]:
# Import re library
import re

# Match groups according to regex pattern
m = re.match('(\w+day)(\d+km)', # regex pattern
             sample              # string sample
            )

# Show matched groups
m.groups()

('Monday', '5km')

In [19]:
# Show first matched group

m.groups()[0]

'Monday'

In [20]:
# Show second matched group

m.groups()[1]

'5km'

In [21]:
m.groups()[0][:3]

'Mon'

In [22]:
def f(x):
    return x.groups()[0][:3]

In [23]:
s4.str.replace("(\w+day)",
               f,           
               regex=True
              )

0     Mon5km
1    Wed10km
2    Sat25km
dtype: object