<a href="https://colab.research.google.com/github/PaulToronto/Math-and-Data-Science-Reference/blob/main/Regular_Expressions_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions with Python

https://www.w3schools.com/python/python_regex.asp

In [1]:
import re

## What is a regular expression?

REGular EXpression or regex:

String containing a combination of normal characters and special metacharacters that describes patterns to find text or positions within a text

Example: `r'st\d\s\w{3,10}'`

- the `r` at the beginning indicates a raw string
    - it is always advisable to use it
- the **normal** characters match themselves
    - in this example, `st` exactly matches an `s` followed by a `t`
- the **metacharacter**, `\`, signals a sequence of **special** characters. It can also be used to escape special characters.
    - `\d` matches a digit: `0-9`
    - `\s` matches white space: `space, tab, \r, \n, \t, form feed character`
    - `\w` a word character: `a-z, A-Z, 0-9, _`
- the **metacharacter**, `{}`, means *exactly the specified number of occurrences*
    - `{3,10}` means the character immediately to the left, in this case `\w` should appear between 3 and 10 times

## Find all matches of a pattern: `re.findall(r'regex', string)`

In [2]:
re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")

['#movies', '#movies']

## Split string at each match: `re.split(r'regex', string)`

In [3]:
re.split(r'!', 'Nice place to eat! I\'ll come back! Excellent meat!')

['Nice place to eat', " I'll come back", ' Excellent meat', '']

## Replace one or many matches with a string: `re.sub(r'regex', new, string)`

In [4]:
re.sub(r"yellow", 
       "nice", 
       "I have a yellow car and a yellow house in a yellow neighbourhood")

'I have a nice car and a nice house in a nice neighbourhood'

In [5]:
re.sub(r"yellow", 
       "nice", 
       "I have a yellow car and a yellow house in a yellow neighbourhood",
       1)

'I have a nice car and a yellow house in a yellow neighbourhood'

## Supported metacharacters

### `\d`: digit

In [6]:
re.findall(r'User\d', "The winners are: User9, UserN, User8, User09")

['User9', 'User8', 'User0']

### `\D`: non-digit

In [7]:
re.findall(r'User\D', 
           "The winners are: UserN, User8, Userm, Usermn, User_, User*")

['UserN', 'Userm', 'Userm', 'User_', 'User*']

### `\w`: word

In [8]:
re.findall(r'User\w', "The winners are: UserN, User8, Userm, Usermn, User_, User*")

['UserN', 'User8', 'Userm', 'Userm', 'User_']

### `\W`: non-word

In [9]:
re.findall(r'\W\d', 'This skirt is on sale, only $59.99 today')

['$5', '.9']

### `\s`: whitespace

In [10]:
re.findall(r'Data\sScience', 
           'I like Data Science and Data  Science and Data\nScience and Data\fScience, Data\n\nScience')

['Data Science', 'Data\nScience', 'Data\x0cScience']

In [11]:
re.findall(r'\dData\s\sScience', 
           'I like 1Data Science and 2Data  Science and 3Data\nScience and 4Data\fScience')

['2Data  Science']

In [12]:
re.findall(r'\dData\s{2}Science', '1Data\nScience 2Data\n\nScience')

['2Data\n\nScience']

In [13]:
re.findall(r'\dData\s{2,4}Science', 
           '1Data\nScience 2Data\n\nScience 3Data\n\n\t\nScience 4Data\n\n\t\t\tScience')

['2Data\n\nScience', '3Data\n\n\t\nScience']

### `\S`: non-whitespace

In [14]:
re.sub(r'ice\Scream', 'ice cream', '1.ice-cream, 2.ice*cream, 3.ice**cream, 4.icecream, 5.ice\tcream')

'1.ice cream, 2.ice cream, 3.ice**cream, 4.icecream, 5.ice\tcream'

## Repeated characters

A passwords should contain 8 characters followed by 4 numbers: `password1234`

In [15]:
password = 'passw ord1234'

re.search(r'\w\w\w\w\w\w\w\w\d\d\d\d', password)

In [16]:
password = 'password4237'

re.search(r'\w\w\w\w\w\w\w\w\d\d\d\d', password)

<re.Match object; span=(0, 12), match='password4237'>

### Quantifiers

- Quantifiers apply to ONLY the character immediately to the left

#### `{n}` : n times

In [17]:
re.search(r'\w{8}\d{4}', password)

<re.Match object; span=(0, 12), match='password4237'>

#### `+` : one or more times

In [18]:
text = 'Date of start: 4-3. Date of registration: 10-04 and some garbage: 111-2378'

re.findall(r'\d+-\d+', text)

['4-3', '10-04', '111-2378']

#### `*` : zero or more times

In [19]:
my_string = 'The concert was amazing! @ameli!a @joh&&n @mary90 @carlos'

regex = r'@\w+\W*\w+'

re.findall(regex, my_string)

['@ameli!a', '@joh&&n', '@mary90 @carlos']

#### `?` : zero or once

In [20]:
text = 'The color of this image is amazing. However, the colour blue could be brighter'

regex = r'colou?r'

re.findall(regex, text)

['color', 'colour']

#### `{n,m}` : at least n times, at most m times

In [21]:
phone_number = 'John: 1-966-847-3131 Michelle: 54-908-42-42424988'

regex = r'\d{1,2}-\d{3}-\d{2,3}-\d{4,}'

re.findall(regex, phone_number)

['1-966-847-3131', '54-908-42-42424988']

## Two options for matching: `re.search` and `re.match`

- `re.search(r'regex', string)`
- `re.match(r'regex', string)`
    - **anchored** at the beginning of the string

In [22]:
[re.search(r'\d{4}', '4506 people attend the show'), 
 re.search(r'\d+', 'Yesterday I saw 3 shows')]

[<re.Match object; span=(0, 4), match='4506'>,
 <re.Match object; span=(16, 17), match='3'>]

In [23]:
[re.match(r'\d{4}', '4506 people attend the show'), 
 re.match(r'\d+', 'Yesterday I saw 3 shows')]

[<re.Match object; span=(0, 4), match='4506'>, None]

## Special characters

### The `.` metacharacter

- matches any character except newline

In [24]:
my_links = 'Just check out this link: www.amazingpics.com. It has amazing photos!'

re.findall(r'www\..+\.com', my_links)

['www.amazingpics.com']

### The `^` metacharacter

- anchors the **regex** to the start of the string

In [25]:
my_string = 'the 80s music was much better than the 90s'

re.findall(r'the\s\d+s', my_string)

['the 80s', 'the 90s']

In [26]:
re.findall(r'^the\s\d+s', my_string)

['the 80s']

### The `$` metacharacter

- anchors the **regex** to the end of the string

In [27]:
re.findall(r'the\s\d+s$', my_string)

['the 90s']

### `\` escape

In [28]:
my_string = 'I love the music of Mr.Go. However, the sound was too loud.'

print(re.split(r'.\s', my_string))

['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']


In [29]:
print(re.split(r'\.\s', my_string))

['I love the music of Mr.Go', 'However, the sound was too loud.']


### The `|` metacharacter: OR

In [30]:
my_string = "Elephants are the world's largest land animal! I would love to see an elephant one day."

re.findall(r'Elephant|elephant', my_string)

['Elephant', 'elephant']

### Set of characters `[ ]`

In [31]:
re.findall(r'[Ee]lephant', my_string)

['Elephant', 'elephant']

In [32]:
my_string = 'Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3'

re.findall(r'[a-zA-Z]+\d', my_string)

['MaryJohn2', 'Clary3']

In [33]:
my_string = 'My&names&is#John Smith. I%live$in#London.'

re.sub(r'[&#%$]', ' ', my_string)

'My names is John Smith. I live in London.'

#### `^` when used inside `[ ]`

- transforms the expression to negative

In [34]:
my_links = 'Bad website: www.99.com. Favorite site: www.hola.com'

re.findall(r'www[^0-9]+com', my_links)

['www.hola.com']