### Regular Expressions (Regex)
Regular expressions are a way of filtering strings and identifying string pattern. They allow to identify certain pattern in a text and, if desired, to replaced them with other string pattern.

We start with a simple example. Take the following text, and of course, import the module re:

In [27]:
#regular expression operations (https://docs.python.org/3/library/re.html)
import re

text = '''The following text contains a certain amount of characters.
Besides letters, there are also some numbers like 1234 and 5532. 
Additionally, special characters like /\%&_-!? and $§ are part of the text.'''

Now, let's start using regulare expressions.
### The basics
Regex are used to identify text pattern. That is why, we are interested in the position of the wanted pattern. To find out the position we need a way to identify the start and the end of a string.
- **^** to identify the beginning of a line, or
- \\$ to identify the end of a line in a string.

If we want to select solely based on the position, we need additionally a way of matching every random pattern in between.
- **.** matches any single character
- \* matches 0 or more repetitions of the aforemantioned character
- \+ matches 1 or more repetitions of the aforementioned character

Now we know all the necessary pattern. But before we can try those out, we have to learn our first regex function to apply regex.
### Our first match
We are using **re.search(pattern, string)** to find a pattern in a string. The documentation tells us:
>Scan through string looking for the first location where the regular expression pattern produces a match, and return a >corresponding match object.

Let's search for the first line. The following regex delivers the requested result, as we are looking for the beginning of a line and the first occurence is always tracked. It must be followed by 0 to multiple random characters. Notice, that we always transform our regex pattern in rawstring using the **r"..."** to avoid python interprets our characters. (e.g. **\n** turns to newline)

In [40]:
re.search(r"^.*",text)

<re.Match object; span=(0, 59), match='The following text contains a certain amount of c>

That output contains a bit too much information. But the result is correct as the search function returns a *match object*.
To make this readable, we can use **.group(0)** The reason is, that a regex can return 1 or more groups, but we will look into that later.

In [42]:
re.search(r"^.*",text).group(0)

'The following text contains a certain amount of characters.'

Let's specify our search. We are only interested in the first word.
We can use the following two methods:
- **\b** is a word boundary. We can specify that we are interested in the string between the boundary before and after the first word. In this case "The"
- **\w** matches a single word.

To match the first word using **\b** we define "The" in the middle.
To match the first word using **\w** we are looking for the first occurence of a word. The **+** indicates that the word has to occure one or multiple times. Our first match is "The".

In [56]:
using_b = re.search(r"^\bThe\b",text).group(0)
using_w = re.search(r"^\w+", text).group(0)

print("Using b:", using_b)
print("Using w:", using_w)

Using b: The
Using w: The


In [6]:
data ='''Buchungsdatum;Betrag;Buchungs-Info;category
03.01.2022;-262,15;Miete;
07.01.2022;-10,17;POS 10,17 DE K1 06.01. 19:44 REWE WIESBADEN LILIENC WIESBADEN 65189 280;
07.01.2022;-16,21;POS 16,21 DE K1 05.01. 17:51 REWE WIESBADEN, BLEICH WIESBADEN 65185 280;
10.01.2022;-56;go green energy GmbH+Co KG Summenbu chung faelliger Positionen Kun.Nr.4 23717;
10.01.2022;-13,5;POS 13,50 DE K1 08.01. 13:48 WIESBADENER NORDWAND IDSTEIN 65510 280;'''

print(data)

Buchungsdatum;Betrag;Buchungs-Info;category
03.01.2022;-262,15;Miete;
07.01.2022;-10,17;POS 10,17 DE K1 06.01. 19:44 REWE WIESBADEN LILIENC WIESBADEN 65189 280;
07.01.2022;-16,21;POS 16,21 DE K1 05.01. 17:51 REWE WIESBADEN, BLEICH WIESBADEN 65185 280;
10.01.2022;-56;go green energy GmbH+Co KG Summenbu chung faelliger Positionen Kun.Nr.4 23717;
10.01.2022;-13,5;POS 13,50 DE K1 08.01. 13:48 WIESBADENER NORDWAND IDSTEIN 65510 280;


### Summary

| Regex pattern | Explanation |
| --- | --- |
| ^ | Identifies the beginning of a line |
| $ | Idenfities the end of a line |
| \b...\b | Word boundary. Matches the beginning and end of a word |
| \w | Word character. Matches a single word itself |

### Sources
- https://support.bettercloud.com/s/article/Creating-your-own-Custom-Regular-Expression-bc72153
- https://macromates.com/manual/en/regular_expressions
