<a href="https://colab.research.google.com/github/PaulToronto/Math-and-Data-Science-Reference/blob/main/Regular_Expressions_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions with Python

https://www.w3schools.com/python/python_regex.asp

In [1]:
import re

## What is a regular expression?

REGular EXpression or regex:

String containing a combination of normal characters and special metacharacters that describes patterns to find text or positions within a text

Example: `r'st\d\s\w{3,10}'`

- the `r` at the beginning indicates a raw string
    - it is always advisable to use it
- the **normal** characters match themselves
    - in this example, `st` exactly matches an `s` followed by a `t`
- the **metacharacter**, `\`, signals a sequence of **special** characters. It can also be used to escape special characters.
    - `\d` matches a digit: `0-9`
    - `\s` matches white space: `space, tab, \r, \n, \t, form feed character`
    - `\w` a word character: `a-z, A-Z, 0-9, _`
- the **metacharacter**, `{}`, means *exactly the specified number of occurrences*
    - `{3,10}` means the character immediately to the left, in this case `\w` should appear between 3 and 10 times

## Find all matches of a pattern: `re.findall(r'regex', string)`

In [2]:
re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")

['#movies', '#movies']

## Split string at each match: `re.split(r'regex', string)`

In [3]:
re.split(r'!', 'Nice place to eat! I\'ll come back! Excellent meat!')

['Nice place to eat', " I'll come back", ' Excellent meat', '']

## Replace one or many matches with a string: `re.sub(r'regex', new, string)`

In [4]:
re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighbourhood")

'I have a nice car and a nice house in a nice neighbourhood'

## Suported metacharacters

### `\d`: digit

In [5]:
re.findall(r'User\d', "The winners are: User9, UserN, User8, User09")

['User9', 'User8', 'User0']

### `\D`: non-digit

In [6]:
re.findall(r'User\D', "The winners are: UserN, User8, Userm, Usermn, User_, User*")

['UserN', 'Userm', 'Userm', 'User_', 'User*']

### `\w`: word

In [7]:
re.findall(r'User\w', "The winners are: UserN, User8, Userm, Usermn, User_, User*")

['UserN', 'User8', 'Userm', 'Userm', 'User_']

### `\W`: non-word

In [8]:
re.findall(r'\W\d', 'This skirt is on sale, only $59.99 today')

['$5', '.9']

### `\s`: whitespace

In [9]:
re.findall(r'Data\sScience', 'I like Data Science and Data  Science and Data\nScience and Data\fScience')

['Data Science', 'Data\nScience', 'Data\x0cScience']

In [10]:
re.findall(r'Data\s\sScience', 'I like Data Science and Data  Science and Data\nScience and Data\fScience')

['Data  Science']

In [11]:
re.findall(r'Data\s{2}Science', 'Data\nScience Data\n\nScience')

['Data\n\nScience']

In [12]:
re.findall(r'Data\s{2,4}Science', 'Data\nScience Data\n\nScience Data\n\n\tScience Data\n\n\t\t\tScience')

['Data\n\nScience', 'Data\n\n\tScience']

### `\S`: non-whitespace

In [13]:
re.sub(r'ice\Scream', 'ice cream', '1.ice-cream, 2.ice*cream, 3.ice**cream, 4.icecream, 5.ice\tcream')

'1.ice cream, 2.ice cream, 3.ice**cream, 4.icecream, 5.ice\tcream'