# Regular Expressions (Regex)
Regular expressions are a way of filtering strings and identifying string pattern. They allow identifying certain pattern in a text and, if desired, to replaced them with other string pattern.

We always start with the import of the module re:

In [2]:
#regular expression operations (https://docs.python.org/3/library/re.html)
import re

Now, let's start using regulare expressions.
## 01 The basics

We are using **re.search(pattern, string)** to find a pattern in a string. The documentation tells us:
>Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.

So let's have a look at some examples.


### Start and end of a text
Regex are used to identify text pattern. That is why, we are interested in the position of the wanted pattern. To find out the position we need a way of identifying the start and the end of a string.
- **^** to identify the beginning of a line, or
- **\$** to identify the end of a line in a string.

In [17]:
#Example 1
#define the regex and the text
text1 = "dogs are adorable"
regex1 = "^dogs"
regex2 = "are$"

#the search function of regex returns a match object, if the string pattern is found in the text
re.search(regex1,text1)

<re.Match object; span=(0, 4), match='dogs'>

In [18]:
#it returns nothing (None), if no match is made
re.search(regex2,text1)

#it can be made visible, using a print() command
print(re.search(regex2,text1))

None


In [10]:
#the regex and text parameter can also inserted directly
re.search("adorable$","dogs are adorable")

<re.Match object; span=(9, 17), match='adorable'>

### Quantifiers
An easy way to find a pattern in a text is to know how often certain elements are present in a string.

- **?** identifies zero or one occurrences of the preceding element, the element is optional
- **\*** identifies zero or more occurrences of the preceding element
- **+** identifies one or more occurrences of the preceding element

In [12]:
#let's say we want to match both ways of writing the word "color" (i.e. color and colour)

print(re.search("colou?r", "color"))
print(re.search("colou?r", "colour"))
#both strings are a match because the "u" is optional

<re.Match object; span=(0, 5), match='color'>
<re.Match object; span=(0, 6), match='colour'>


In [14]:
#it is unknown how often a character appears or if it appears afer all

re.search("a*b*c*", "aaabbbb")

<re.Match object; span=(0, 7), match='aaabbbb'>

In [16]:
#A character appears for sure, but we don't know how often

print(re.search("ab+c", "abc"))
print(re.search("ab+c", "abbc"))

#if the character does not appear, we don't have a match
print(re.search("ab+c", "ac"))

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 4), match='abbc'>
None


### Sets

When a single position in a string is analyzed by regex, it can be tested to fit in a defined pattern. Fore example, the first position of a string can be tested to match one of three characters. To do so, wea are using sets that aredefined by **[ ]**. 

In [20]:
#test if the first character is either an "A" an "F" or an "h"

re.search("[AFh]","hallo")

<re.Match object; span=(0, 1), match='h'>

### Ranges and a wildcard
To identify ranges we are using the aforementioned sets. Regex also has a universal wildcard for every character. 
- **-** is the symbol for a range
- **.** represents any character

In [25]:
#ranges are always reare quested within sets and can be used for numbers, capital and lowercase letters
print(re.search("[0-9][A-Z][a-z]", "5Ag"))

#Asking for every character in a range but exclude a defined intervall, the "^" sign within []  brackets is equivalent to NOT
print(re.search("[0-9][^B-F][a-z]", "5Ag"))

<re.Match object; span=(0, 3), match='5Ag'>
<re.Match object; span=(0, 3), match='5Ag'>


In [30]:
#asking for every character one or more times
re.search(".+", "aeÄ%ria'!gpaenn")

<re.Match object; span=(0, 15), match="aeÄ%ria'!gpaenn">

### Special characters

Special characters idenfify a wide range of characters within a string and often have an alternative but more complex option.
- **\d** identifies any digit
- **\w** identifies any alphanumerical character (digits, upper and lower case characters and the "_"

That output contains a bit too much information. But the result is correct as the search function returns a *match object*.
To make this readable, we can use **.group(0)** The reason is, that a regex can return 1 or more groups, but we will look into that later.

In [42]:
re.search(r"^.*",text).group(0)

'The following text contains a certain amount of characters.'

### Appendix

#### A - Regex pattern and their meaning

| Regex pattern | Explanation | Example |
| --- | --- | --- |
| ^ | Identifies the beginning of a line | re.search("^two","two dogs") |
| \$ | Idenfities the end of a line | re.search("$two","two dogs") |
| ? | Identifies zero or one occurrences of the preceding element, the element is optional | re.search("dok?gs", "dogs") |
| * | Identifies zero or more occurrences of the preceding element | re.search("do*l*g", "doog") |
| + | Identifies one or more occurrences of the preceding element | re.search("do+g", "dog") |
| - | Is used for ranges | re.search() |
| . | Reprsents any character | re.search() |
| \b...\b | Word boundary. Matches the beginning and end of a word | re.search() |
| \w | Word character. Matches a single word itself | re.search() |


#### B - Regex functions
| Regex pattern | Explanation |
| --- | --- |
| re.search(regex, text) | Searches for the defined pattern. Stops at the first occurrence and returns a match object|
| re.findall(regex, text) | Returs all matches in the text as a list |
| re.fullmatch(regex, text) | Returns a match object only if the string and the regex matches completely at all positions |
| re.sub(regex, replace, string) | Replaces a string when a match is found |


### Sources
- https://support.bettercloud.com/s/article/Creating-your-own-Custom-Regular-Expression-bc72153
- https://macromates.com/manual/en/regular_expressions
- Regex course