# Regular Expressions in Python

## Introduction
This notebook introduces **regular expressions** (regex) in Python, focusing on their use for pattern matching and text manipulation. By the end of this notebook, you will be able to:
- Use basic string methods for text manipulation.
- Understand and apply regular expressions for pattern matching.
- Solve real-world text processing tasks using regex.

Python provides several built-in string methods for text manipulation. Let's start with a simple example using the `split()` method. String methods such as `split()` are useful for extracting portions of a string.

In [None]:
string = '9 13 and 15 are odd numbers.'
string.split()

['9', '13', 'and', '15', 'are', 'odd', 'numbers.']

This could be used to find the numbers in a string.

In [None]:
string = '9 13 and 15 are odd numbers.'
numbers = []
for item in string.split():
    if item.isdigit():
        numbers.append(item)
print(numbers)

['9', '13', '15']


However, string methods have limitations when dealing with complex patterns. For example, if the string contains commas or other delimiters, split() may not work as expected. Let's run the solution above on the same string with a comma between 9 and 13.

In [None]:
string = '9, 13 and 15 are odd numbers.'
numbers = []
for item in string.split():
    if item.isdigit():
        numbers.append(item)
print(numbers)

['13', '15']


9 is missing because split returns `'9,'` instead of `'9'`.

In [None]:
'9,'.isdigit()

False

There are many ways to solve this issue. One way is to remove commas from the string.

In [None]:
# Solution 1
string = '9, 13 and 15 are odd numbers.'
string_wo_comma = string.replace(',', '')
numbers = []
for item in string_wo_comma.split():
    if item.isdigit():
        numbers.append(item)
print(numbers)

['9', '13', '15']


# Regular Expressions
But as more complexities arise, it will be harder to handle all using string methods. This task of searching and extracting is so common that Python has a very powerful
module called *regular expressions* that handles many of these tasks quite elegantly.

https://docs.python.org/3/library/re.html

Let's start by importing the module `re`.

In [None]:
import re

Let's solve the above task using regular expressions.

In [None]:
# Solution 2
string = '9, 13 and 15 are odd numbers.'
pattern = '[0-9]+'
re.findall(pattern, string)

['9', '13', '15']

In [None]:
# Additional Example:
example_string = '1, 22, 333 and 4444 are numbers'
pattern = r'\b\d+\b'
print(re.findall(pattern, example_string))

['1', '22', '333', '4444']


No `for` loops, no `if` statements and it works even when you add complexity.

In [None]:
# Find the numbers in a string with regular expressions
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9]+'
re.findall(pattern, string)

['9', '13', '15']

In [None]:
# Another Example:
test_string = 'Items cost $20, $30, and $50'
pattern = r'\$\d+'
print(re.findall(pattern, test_string))  # Extract prices

['$20', '$30', '$50']


This is the essence of regular expressions. We define a `pattern` and search that pattern inside a `string`.

There are multiple functions in Python's regular expressions library. We will mostly use `re.findall()` in this DataLab but be aware of the others:

https://docs.python.org/3/library/re.html#functions

**But what is the meaning of the pattern `[0-9]+`?**

While they are very powerful, they are a little complicated and their syntax takes
some getting used to. Regular expressions are almost their own little programming language for searching and parsing strings.

Regular expressions are made of special characters. These special characters **match** normal characters we are familiar with. For example `\s` matches whitespace and `.` matches any character. Therefore `\s...\s` will match a sequence of three characters surrounded by spaces e.g. a three letter word.

`[0-9]+` will be explained later.

In [None]:
string = 'I am looking for three letter words.'
pattern = '\s...\s'
re.findall(pattern, string)

Here is a look at character level matches:

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`\s` |`.` |`. `|`.` |`\s` |
||&#8595;|&#8595;|&#8595;|&#8595;|&#8595;|
|String|` ` | `f`|`o` |`r` |` ` |


But if the word is at the beginning or the end, it will not match.

In [None]:
string = 'But if the word is at the beginning...'
pattern = '\s...\s'
re.findall(pattern, string)

Notice that `'But '` is not a match. Also notice that `re.findall()` can return multiple matches.

Here is a look at character level too see why it does not match "But":

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`\s` |`.` |`. `|`.` |`\s` |
||&#10060;|&#8595;|&#8595;|&#8595;|&#8595;|
|String|| `B`|`u` |`t` |` ` |


In [None]:
string = 'It will also match numbers like 123 but not 456'
pattern = '\s...\s'
re.findall(pattern, string)

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`\s` |`.` |`. `|`.` |`\s` |
||&#8595;|&#8595;|&#8595;|&#8595;|&#10060;|
|String| ` `|`4` |`5` |`6` ||

There are many more characters. There is no need to memorize them, you can use cheat sheets such as the one below. This is not exhaustive, but covers most common operations.

**Regular Expressions Cheatsheet**

| Character | Description |
| --- | --- |
|^|Matches the beginning of a line|
|$|Matches the end of the line|
|.|Matches any character|
|\s|Matches whitespace|
|\S|Matches any non-whitespace character|
|\*|Repeats a character zero or more times|
|*?|Repeats a character zero or more times (non-greedy)|
|\+|Repeats a character one or more times|
|+?|Repeats a character one or more times (non-greedy)|
|[aeiou]|Matches a single character in the listed set|
|[^XYZ]|Matches a single character not in the listed set|
|[a-z0-9]|The set of characters can include a range|
|(|Indicates where string extraction is to start|
|)|Indicates where string extraction is to end|

**Key Highlights:**
- Use `.` to match any character except newline.
- Use `*` for zero or more repetitions.
- Use `+` for one or more repetitions.
- Use `?` for zero or one repetition.

Refer to [regex101](https://regex101.com/) for interactive testing of regular expressions.

Additionally, there are tools that explain what a given expression matches. These tools are very useful when working with regular expressions.

**Task 1**
 - Go to https://regexr.com/
 - Enter the following regular expression:
         \s...\s
 - Enter the following text:
         I am looking for three letter words.
 - Check the explanation
 - Roll-over elements in the explanation to highlight in the expression above
 - Click on the elements to open them in reference on the left
 - **Use this tool from now on when you are in doubt**

**Example:**
Try creating a pattern to extract words starting with vowels in the sentence: `'An example of an apple and an orange'`.

Let's continue exploring more characters. In the previous example, you might have noticed whitespaces are returned.

In [None]:
string = 'I am looking for three letter words.'
pattern = '\s...\s'
re.findall(pattern, string)

But we just want the word. We can wrap the section we want to be returned into parentheses.

Parentheses are another special character in regular expressions. When you add
parentheses to a regular expression, they are ignored when matching the string.
But when you are using `re.findall()`, parentheses indicate that while you want the
whole expression to match, you only are interested in extracting a portion of the
substring that matches the regular expression.

In [None]:
string = 'I am looking for three letter words.'
pattern = '\s(...)\s'
re.findall(pattern, string)

Now, let's understand the expression `[0-9]+` we saw previously.

- The first step is to understands sets `[ ]`
- The second step is to understand range `[0-9]`
- The third step is to understand is the `+` character.

**Step 1:** A set matches a single character. For example the set `[0123456789]` matches `0` or `1` or `2` or `3` or `4` or `5` or `6` or `7` or `8` or `9`. It does not
match `0123456789`.

In [None]:
# Find single digits
string = '2, 5 and 7 are odd numbers.'
pattern = '[0123456789]'
re.findall(pattern, string)

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`[0123456789]` |
||&#8595;|
|String| `2`|

What happens if the same pattern is applied to two-digit numbers?

In [None]:
string = '11, 13 and 15 are odd numbers.'
pattern = '[0123456789]'
re.findall(pattern, string)

Since the pattern `[0123456789]` matches single characters, it will match the digits in `11` but as two separate matches. Therefore the return is `['1', '1', ...]` instead of `['11', ...]`.

To match two-digit numbers we can simply use the pattern twice.

In [None]:
string = '11, 13 and 15 are odd numbers.'
pattern = '[0123456789][0123456789]'
re.findall(pattern, string)

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`[0123456789]` |`[0123456789]` |
||&#8595;|&#8595;|
|String| `1`|`1`|

Probably, you are thinking `[0123456789]` is quite verbose. What if we want to match all letters? Do we have to write `[abcdefghijklmnopqrstuvwxyz]`?

**Step 2:** That would be a nightmare but luckily we can define a range for common sets.
- `[0123456789]` can be represented as `[0-9]`
- Lower case letters `[a-z]`
- Upper case letters `[A-Z]`

In [None]:
# Single digit with range
string = '2, 5 and 7 are odd numbers.'
pattern = '[0-9]'
re.findall(pattern, string)

In [None]:
# Two digits with range
string = '11, 13 and 15 are odd numbers.'
pattern = '[0-9][0-9]'
re.findall(pattern, string)

**Task 2**

Here is your second task. Define a pattern that returns lower case and upper case letters in a string.

Example string: `'Word1 woRd2'`

Output: `['W', 'o', 'r', 'd', 'w', 'o', 'R', 'd']`

**Hint:** Use the `re` library and experiment with different patterns such as `[a-z]+` or `[A-Z][a-z]+`.

In [None]:
string = 'Word1 woRd2'
pattern = # YOUR CODE HERE
re.findall(pattern, string)

**Step 3:** But what if there are single and two digit numbers in one string?

In [None]:
# This won't work
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9]'
re.findall(pattern, string)

In [None]:
# This won't work either
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9][0-9]'
re.findall(pattern, string)

`+` character repeats a character one or more times

therefore `[0-9]+` will match
- [0-9]
- [0-9][0-9]
- [0-9][0-9][0-9]
- [0-9][0-9][0-9][0-9]
- ...

In [None]:
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9]+'
re.findall(pattern, string)

Sets can be used to define characters that you don't want to match as well

| Character | Description |
| --- | --- |
|[aeiou]|Matches a single character in the listed set|
|[^XYZ]|Matches a single character not in the listed set|
|[a-z0-9]|The set of characters can include a range|

For example `[^0-9]` would match all the non-number characters:

In [None]:
string = "123 abc 5d"
pattern = '[^0-9]'
re.findall(pattern, string)

**Task 3:** Find the number of apples in a string using regular expressions. The string can contain `'x apples'` any place.

---

Example string
`'There are 15 apples in the basket.'`

Expected output is `['15']`

---

Example string
`'There are 15 apples and 20 oranges in the basket.'`

Expected output is `['15']`

---

Example string
`'5 apples here 10 apples there.'`

Expected output is `['5', '10']`

---

Example string
`'There is only 1 apple left.'`

Expected output is `['1']`

---
For the following case, output can be an empty list.

Example string
`'I have an apple'`

Expected output is `[]`



In [None]:
string = 'There are 15 apples in the basket.'
pattern = # YOUR CODE HERE
re.findall(pattern, string)

**Beginning/end of a string**

Sometimes we are interested in matching beginning or end of a string. There are characters for that as well.

| Character | Description |
| --- | --- |
|^|Matches the beginning of a line|
|$|Matches the end of the line|

In [None]:
# Find the number at the beginning of the string
string = '1-There are 5 apples and 2 oranges in the basket.'
pattern = '^[0-9]+'
re.findall(pattern, string)

**Escape character**

We use special characters in regular expressions such as a `.`. But what if we would like to match a `.` in a string? We need a way to indicate that these characters are “normal” and we want to match the actual character in a string.

We can indicate that we want to simply match a character by prefixing that character with a backslash.

In [None]:
# Find all abbreviated titles
string = "Dr. A, Ms. B, Mr. C"
pattern = '...' # this won't work
re.findall(pattern, string)

In [None]:
# Find all abbreviated titles
string = "Dr. A, Ms. B, Mr. C"
pattern = '..\.' # this will work
re.findall(pattern, string)

**\+ and * are greedy**

\+ and * repeats a character. But it is crucial to understand that they are greedy. Let's examine this behaviour with an example.

Let's say you have the string

`"From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu"`

and you would like to get the name of the first person after `'From:'`

we can define a pattern such as
`'From:\s(.+)@'`

expecting it to give us the name between `From:` and `@`. Let's see what happens.

In [None]:
# + will push until the last @ sign
string = "From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu"
pattern = 'From:\s(.+)@'
re.findall(pattern, string)

We did not get the first name. But this is not because the pattern is not matching. It is because of the greedy behaviour of the `+` character. If we use `+?`, which is non-greedy, it will give us what we want.

In [None]:
# you can use +? for a non-greedy repetition
string = "From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu"
pattern = 'From:\s(.+?)@'
re.findall(pattern, string)

If you read the `re.findall()` documentation it says the following:

"Return all **non-overlapping matches** of pattern in string, as a list of strings or tuples. The string is **scanned left-to-right**, and matches are returned in the order found. Empty matches are included in the result."

The fact that `re.findall()` reads left-to-right and finds non-overlapping matches has important implications.

Take a look at the example below:

In [None]:
string = "123456"
pattern = '...'
re.findall(pattern, string)

`'...'` matches `'234'` but it is not returned, why?

This is because:

- `re.findall()` reads from left-to-right,
- finds the first match (`'123'` in this example),
- continues scanning from the next character after the match (`'4'` in this example),
- therefore "non-overlapping".

**Task 4:** Given an arithmetic operation with nested parentheses, return the innermost parentheses and its contents.

---

Example string `"(5 * (3 + 2)) - 7"`

Expected output is ['(3 + 2)']

---

Example string `"((7 - 2) * (1 + 2)) / 2"`

Expected output is ['(7 - 2)', '(1 + 2)']

In [None]:
string = "((7 - 2) * (1 + 2)) / 2"
pattern = # YOUR CODE HERE
re.findall(pattern, string)

We have covered the fundamentals of regular expressions. But there are many more characters. It is helpful to skim a cheatsheet and see what is possible.

https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf

Now to practice regular expressions, please continue with the following tasks.

# Task 5

Given a list of tweets, remove links, hashtags and user handles from the tweets. Tweet processing will be essential for the creative brief. For this task, check the documentation for `re.sub()`.

Example tweet:

`'@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM'`

Expected output:

`'This one is irresistible :)\nFlipkartFashionFriday'`


For this task you will use a sample twitter dataset from the nltk library.

In [None]:
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
tweets = all_positive_tweets + all_negative_tweets

In [None]:
# Examine the first 10 tweets
tweets[:10]

In [None]:
string = '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM'
processed_tw = # YOUR CODE HERE remove links
processed_tw = # YOUR CODE HERE remove handles
processed_tw = # YOUR CODE HERE remove hashtags

# Task 6

Find all the emails inside the file `'mbox-short.txt'`. It is a collection of email messages and metadata.

If you would like to know more about mbox files, read the following:
https://en.wikipedia.org/wiki/Mbox

In [None]:
# Let's examine the first 10 lines
f = open('mbox-short.txt')
counter = 0
for line in f:
    print(line)
    counter += 1
    if counter>=10:break
f.close()

Designing a pattern matching an email address requires the knowledge of syntax rules. The format of an email address is `local-part@domain`. Syntax rules for the local-part and the domain are different from each other and complex.

For example the following is a valid email address:

`"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com`

You can find all the rules here:
https://en.wikipedia.org/wiki/Email_address#Syntax

**For this task assume that a valid email address can contain:**

- lowercase Latin letters `a` to `z`
- digits `0` to `9`
- dot `.`

and nothing else. You should be able to find a total of 305 emails, 16 of which are unique.

|id|email|count|
|---|---|---|
|0|`source@collab.sakaiproject.org`|135|
|1|`postmaster@collab.sakaiproject.org`|27
|2|`apache@localhost`|27|
|3|`cwen@iupui.edu`|20|
|4|`zqian@umich.edu`|17|
|5|`david.horwitz@uct.ac.za`|17|
|6|`louis@media.berkeley.edu`|12|
|7|`gsilver@umich.edu`|12|
|8|`stephen.marquard@uct.ac.za`|8|
|9|`rjlowe@iupui.edu`|8|
|10|`wagnermr@iupui.edu`|6|
|11|`antranig@caret.cam.ac.uk`|4|
|12|`gopal.ramasammycook@gmail.com`|4|
|13|`ray@media.berkeley.edu`|4|
|14|`hu2@iupui.edu`|2|
|15|`josrodri@iupui.edu`|2|


In [None]:
# YOUR CODE HERE