# Introduction to Regular Expression 
**Regular Expression** or **regex**  is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory (see. [wiki](https://en.wikipedia.org/wiki/Regular_expression) for more). 

Regular expressions are used overall in the IT-Field. But because we are already familiar with `python`, we are going to learn it in practice with python. 


## Specify a pattern of text to search for you: 
Let's start with an example. Let's say you want to identify a phone number by its pattern. You don't know exaclty what all the possibles number is but you know it will be three digits, followed by a minus, and then four more digits (and optionally, a three-digit area code at the start). 

Example: 
`015-445-1234` is a phone number 
`6.849.8789.4.23`is not a phone number 

It is very usefull and essential for text searching but also for programming. 

### 1. The traditional way: 

So let's do some exercice and try to find a phone number in a string. You know the pattern already: three numbers, a minus, trhee numbers, and four numbers. (another example: `015-150-4789`. 

Lets's write a function called `is_phone_number()`. The function should take a text as an argument and return a boolean `True` if the text is a valid phone number and `False` if it's not a valid phone number: 



In [8]:
# i = 5
# i.isdecimal() # will return true because 5 is decimal. 

def is_phone_number(text): 
    if len(text) != 12: 
        return False
    for i in range(0,3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range (8,12):
        if not text[i].isdecimal():
            return False
    return True


In [9]:
print(is_phone_number("015-150-4789"))
print(is_phone_number("154-000-8897"))
print(is_phone_number("154-123-abcd"))
print(is_phone_number("Hallo Aina!!"))
print(is_phone_number("000"))
print(is_phone_number("hahaha"))

True
True
False
False
False
False


Now try to use our function to check the two phone numbers in the following text? , how would you do it? 

`
message = "Call me at 415-555-1011 (home) tomorrow. My office number is 415-555-9999!"
`



In [11]:
message = "Call me at 021-636-2906 (home) tomorrow. My office number is 021-666-4569!"

# your code here 
for i in range(len(message)):
    part = message[i:i+12]
    # print(part)
    if is_phone_number(part):
        print("Phone number recognized: {}".format(part))
print("Done")

Phone number recognized: 021-636-2906
Phone number recognized: 021-666-4569
Done


### Let's try with a pattern: 
Our previous phone number matching works but it uses a lot of code for a minimal matching. What if we want also to match numbers like `(049) 0421-4778-99` or `+49.489.889.12`. You would add more code for those additional patterns and it will be a lot of code. 

As already mentioned above **regex** are descriptions for a pattern of text. a `\d` in a regex stands for a digit character, it means for a single numeral from 0 to 9. The regex `\d\d\d-\d\d\d-\d\d\d` is used to match the same text in our `is_phone_number()` function. 

But they are more powerfull than that. For example, adding a 3 in curly brackets `{3}` after a pattern is equivalent to matching a pattern **three times** . So we can write our version like `\d{3}-\d{3}-\d{4}` . 

### Regex in python
In python, all regex functions are in the `re` module. We can import it with `import re` . 

P.S: Backslash are used in python to escape character. But in regex like `\d`, it represent a digit. `\n` represent a new line but if we want to print `\n` as we see it, we have to preceed it with another backslash like `\\n` **or** preceed it with an `r` like in the 4th line of our example. 



In [12]:
print("Hallo \n")
print("Hallo \\n")
print(r"Hallo \n")
print(r'Hallo \n') # note the r before the string. 

Hallo 

Hallo \n
Hallo \n
Hallo \n


To create a regex object, we just have to call `re.compile()` and pass our **regex pattern** in the `compile` function. It will return a regex object. 

Now that we have our object, we can use the methods that it provides. Let's start with the method `search()` . The method `search()`take a string as an input and return `None` if it doesn't find any match. If there is a match, it will return the mathing **object**. The **Match** object have a `group()`  (explained later) mehtod that will return the actual matched text from the searched string.   

In [21]:
# import the regular expression module
import re 

# create our regular expression object with re.compile()
phoneNbRegex = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d") # alternative?
phoneNbRegex2 = re.compile(r"\d{3}-\d{3}-\d{4}")


# use our method search with our string. We will receive a MatchObject if there is a match
matchObject = phoneNbRegex2.search("I am a phone number 123-456-7890 dfaldifs")

# check our result 
try:
    print(matchObject.group())
except: 
    print("Error")

123-456-7890


#### Regex in python (steps): 
1. import the regex module with `import re`
2. create a regex object with `re.compile()` . (use raw string) 
3. pass the string you want to search into the Regex object's `search()` method. This returns a Match object. 
4. Call the Match object's `group()` method to return a string of the actual matched text.  

# More pattern matching with Regex 

Now that you understand the basics of the pattern, we can now learn the really power of pattern-matching. Remember, you also can use it in text editor like word, excel, notepad etc... 

## Grouping with Parentheses (remember `group()` ? ) 

Say you want to separate the first part of the phone number with the rest. Adding a parentheses will create groups in the regex. After that, you can use the `group()` method to grab the matching text from just one group. 

In [26]:
import re 

phoneNbRegex = re.compile(r"(\d\d\d)-(\d\d\d-\d\d\d\d)") # alternative ? 
# phoneNbRegex = re.compile(r"(\d{3})-(\d{3})-(\d{4})")


# use our method search with our string. We will receive a MatchObject if there is a match
matchObject = phoneNbRegex.search("123-456-7890")

# check our result 
print(matchObject.group())
print(matchObject.group(1))
print(matchObject.group(2))
print(matchObject.group(0))

123-456-7890
123
456-7890
123-456-7890


In [31]:
# Try to make three groups from the number (separated with dash)
import re

phoneNbRegex3Groups = re.compile(r"(\d\d\d)-(\d\d\d)-(\d\d\d\d)")
matchObject3Groups = phoneNbRegex3Groups.search("123-456-7890")
# check our result 
print(matchObject3Groups.group())
print(matchObject3Groups.group(1))
print(matchObject3Groups.group(2))
print(matchObject3Groups.group(3))


123-456-7890
123
456
7890


The method `groups()` allows you to retrieve all the groups in one line: 


In [27]:
part1, part2 = matchObject.groups()

print(part1)
print(part2)

123
456-7890


In [32]:
group1, group2, group3 = matchObject3Groups.groups()

print(group1)
print(group2)
print(group3)

123
456
7890


**Parentheses?**. As you might have expected, you have to escape the parentheses with `\(` and `\)`  if you want to add them in your pattern. 

In [33]:
import re 

# phoneNbRegex = re.compile(r"(\(\d\d\d\))-(\d\d\d-\d\d\d\d)") # alternative ? 
phoneNbRegex = re.compile(r"(\(\d{3}\))-(\d{3})-(\d{4})")

# use our method search with our string. We will receive a MatchObject if there is a match
matchObject = phoneNbRegex.search("(123)-456-7890")

# check our result 
print(matchObject.group(1))
print(matchObject.group(2))
print(matchObject.group(0))


(123)
456
(123)-456-7890


In [None]:
# Match your 3 groups with a parenthese also 
phoneNb = "(123)-(456)-7890"

# good luck


## Matching Multiple Groups with the Pipe `|` 

The `|` character is called a **pipe**. You can use it if you want to match one of many expression. If you use the pattern `r"Tea|Coffee"` for example, it will match either **Tea** or **Coffee** . 
Example: 

In [35]:
import re 

# sentence = "If What You Gave Me Last Was Tea, I Want Coffee. If It Was Coffee, I Want Tea."
#sentence = "I don't like Coffee."
sentence = "I like Coffee."

#create our object 
coffeeOrTea = re.compile(r"Tea|Coffee|Water") # just like or 

# get MatchObject
mo = coffeeOrTea.search(sentence)

print(type(mo)) #As you see, a match object
print(mo.group())

<class 're.Match'>
Tea


You can also use *pipe* within your string: 

In [37]:
import re 

sentence = "Ich mag meine Wohnung und nicht Wohnaccessoire"

wohnRegex = re.compile(r"Wohn(mobile|ung|bau|accessoire)")

mo = wohnRegex.search(sentence) 

print(mo.group(0))
print(mo.group(1))


Wohnaccessoire
accessoire


## Optional Matching with the Question Mark `? `

Let's say, we want to validate a phone number that could start with `(+49)`, so both `421-009-8765` and `(+49) 421-009-8765` shoud be valid. We can achieve this with `?`in regex. 

PS: Again, if you want to match the question mark as it is (raw string), you have to escape it first (with `\?`) 

In [41]:
import re 

phone_1 = "421-009-8765"
phone_2 = "421-(017)-009-8765"
phone_3 = "+49 151 4884 9484"

#Note the 
phoneRegex = re.compile(r"\d{3}-(\(\d{3}\)-)?\d{3}-\d{4}")

mo1 = phoneRegex.search(phone_1)
mo2 = phoneRegex.search(phone_2)

print(mo1.group())
print(mo2.group())

421-009-8765
421-(017)-009-8765


In [None]:
# make also the third part optional so that the following phone_nbs are correct 
phone_1 = "421-009-8765"
phone_2 = "421-(017)-009-8765"

#here 
phone_3 = "421-8765"

 
# Good Luck

## Matching Zero or more with the Star `*` 
The `*` means "match zero or more". It can occur any number of times and the match will be valid. 

Let's take our phone number again: 


In [45]:
import re 

phone_1 = "421-(017)-009-8765"

phone_2 = "421-(017)-(015)-(102)-009-8765"

phone_3 = "421-009-8765"

phoneStarRegex = re.compile(r"\d{3}-(\(\d{3}\)-)*\d{3}-\d{4}")

matchObject_1 = phoneStarRegex.search(phone_1)
matchObject_2 = phoneStarRegex.search(phone_2)
matchObject_3 = phoneStarRegex.search(phone_3) 

print(matchObject_1.group())
print(matchObject_2.group())
print(matchObject_3.group())


421-(017)-009-8765
421-(017)-(015)-(102)-009-8765
421-009-8765


## Matching One or More with the Plus
The `+` means "match one or more". It works exactly like start `*` but one must be matched in order to pass. 

In [None]:
# write an analogy from the * (what doesn't work? why?)


## Matching characters with repetition 
We already used the `{}` in order to specify the repetition in `\d{3}` for example. This could also be used for special characters.

The regex `(Ho){3}` will for example match the string "HoHoHo" but not "HoHoHoHoHo" since there is 2 "HoHo" more.  
So, instead of writting someting like `(Ho){3}|(Ho){4}|(Ho){5}`, we can just write `(Ho){3,5}`. It will match "HoHoHo", "HoHoHoHo", "HoHoHoHoHo". And like in list, you can leave the first or the second argument empty if you have an undefined **maximum** or **minimum** . For example  `(Ho){3,}` will match 3 or more "Ho" and `(Ho){,5}` will match zero to five "Ho". 

In [46]:
#the following syntaxes are equivalent 

reg_3 = re.compile(r'(Ho)(Ho)(Ho)|(Ho)(Ho)(Ho)(Ho)|(Ho)(Ho)(Ho)(Ho)(Ho)')
reg_2 = re.compile(r'(Ho){5}|(Ho){4}|(Ho){3}?')
reg_1 = re.compile(r'(Ho){3,5}') # greedy by default non-greedy: (with question mark) -->  r'(Ho){3,5}? 

sentence = "HoHoHoHo"
mo = reg_1.search(sentence)
print(mo.group())

print(reg_1.search("HoHoHoHo").group())

HoHoHoHo
HoHoHoHo


## The`findall()` Method 
in addition to the `search()` method. The Regex object have another method called `findall()` method. The difference between search() and findall() is that *search()* returns a *Match object* of the **FIRST** matched text in the searched string and *findall()* returns the strings of **EVERY** match in the searched string. 

Let's take our phone number example again: 


In [49]:
import re 

message = "Call me at 021-636-2906 (home) tomorrow. My office number is 021-666-4569!"
phoneRegex = re.compile(r'\d{3}-\d{3}-\d{4}')

#search only returns the first matched string
# return a Match Object
mo = phoneRegex.search(message)

print(mo.group())


021-636-2906


*findall()* does not return a match object but a **list** of strings. 

In [50]:
import re 

message = "Call me at 021-636-2906 (home) tomorrow. My office number is 021-666-4569!"
phoneRegex = re.compile(r'\d{3}-\d{3}-\d{4}')

# findall() return a list of strings 
results = phoneRegex.findall(message)

print(len(results))
print(results)

2
['021-636-2906', '021-666-4569']


In [53]:
# if the pattern is groupped, findall() return a list of tupples 
import re 

message = "Call me at 021-636-2906 (home) tomorrow. My office number is 021-666-4569!"
phoneRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})')

# findall() return a list of strings 
results = phoneRegex.findall(message)

print(len(results))
print(results)
print(results[1])

2
[('021', '636-2906'), ('021', '666-4569')]
('021', '666-4569')


# Character Classes
We learned earlier that `\d` stands for any numeric digit. It means, `\d` is equivalent to `(0|1|2|3|4|5|6|7|8|9)` . There are many of such **shorthand character classes** in reges. Here is a list of them: 

| Shorthand character class        | Represents                                                                      |
| -------------------------------- |:-------------------------------------------------------------------------------:|
|\d                                | Any numeric digit from 0 to 9.                                                  |
|\D                                | Any character that is not a numeric digit from 0 to 9.                          |
|\w                                | Any letter, numeric digit, or the underscore character.                         |
|\W                                | Any character that is not a letter, numeric digit, or the underscore character. |
|\s                                | Any space, tab, or newline character.                                           |
|\S                                | Any character that is not a space, tab, or newline.                             | 


Character classes are nice for shortening regular expressions. The character class `[0-5]` will match only the numbers 0 to 5 and it is much shorter than typing `(0|1|2|3|4|5)`. 

In [54]:
# example? 
import re 

gifts = "12 drummers for Mike, 11 pipers (from oncly Goerge), 10 lords (4$), 7 snowballs, 6 remote_controll, 5 rings, \
4 birds, 3 cats, 2 doves, 1 tree as a decoration"

christmasRegex = re.compile(r'\d{1,2}\s\w+')

gift_list = christmasRegex.findall(gifts)
print(gift_list)

['12 drummers', '11 pipers', '10 lords', '7 snowballs', '6 remote_controll', '5 rings', '4 birds', '3 cats', '2 doves', '1 tree']


## The Caret and Dollar Sign Characters
You can also use the carret symbol `^` at the start of the regex to indicate that a match must occur at the **beginning** of the searched text. Likewise, you can put a dollar sign `$` at the end of the regex to indicate that the string must **end** with this regex pattern. 

In [None]:
# Example 
import re
fromGermany = re.compile(r'^\+49')

phoneNumberGermany = "+491893876"
phoneNumberFrance = "338876549"

res_1 = fromGermany.search(phoneNumberGermany)
res_2 = fromGermany.search(phoneNumberFrance)

print(res_1.group())

In [None]:
# A phrase should end with a dot or an exclamationmark or a question-mark 
import re 
completePhrase = re.compile(r'[.|!|?]$')
sentence = "Hello world?"

print(completePhrase.search(sentence).group())

## The Wildcard Character
The `.` (or dot) character in a regular expression is called a wildcard and will match any character except for a newline. 

PS: Dot `.` will match just one character. 

In [56]:
import re 

atRegex = re.compile(r'.at') # notice flat

sentence = "The cat in the hat sat on the flat mat."

res = atRegex.findall(sentence) 

print(res)

[' cat', ' hat', ' sat', 'flat', ' mat']


# Utility? 
For creating a powerful regex, you can use tools like this website for example: 

[regexr](https://www.regexr.com)


## Exercice Lab 1:
Write a Python program that matches a string that has an __i__ followed by anything and ends with __z__. It should print "Found a match" or "Did not found a match" 



In [1]:
import re
def text_match(text):
        patterns = 'i.*?z$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("iizzzzd"))
print(text_match("iizAbbbc"))
print(text_match("iccddbbjjjz"))
print(text_match("ccddibbjjjz"))



Not matched!
Not matched!
Found a match!
Found a match!


## Exercice Lab 2: 
Write a Python program to check for a number at the end of a string. It should return True if there is a match, or false if there is no match. 



In [2]:
import re
def end_num(string):
    text = re.compile(r".*[0-9]$")
    if text.match(string):
        return True
    else:
        return False
    


In [4]:
print(end_num("fddds124"))

True


## Exercice Lab 3: 
Write a function that extract year, month and date from an url. It should return the Year, Month and Date as a list.



In [2]:
import re
def extract_date(url):
        return re.findall(r'/(\d{4})/(\d{1,2})/(\d{1,2})/', url)
url1= "https://www.washingtonpost.com/news/football-insider/wp/2016/09/02/odell-beckhams-fame-rests-on-one-stupid-little-ball-josh-norman-tells-author/"
print(extract_date(url1))


[('2016', '09', '02')]


## Case-Insensitive Matching 
Normally, regular expression match text with the exact casing you specify. The following regex patterns match completely different strings: 


In [None]:
regex1 = re.compile('Christmas')
regex2 = re.compile('CHRISTMAS')
regex3 = re.compile('CHRISTmas')
regex4 = re.compile('christmas')

In order to not care about case, you can pass the `re.IGNORECASE` or `re.I` as a second argument to `re.compile()` 

In [None]:
regexChristmas = re.compile(r'christmas', re.IGNORECASE)
sentence_1 = "Christmas is comming! "
sentence_2 = "I love CHRISTMAS, it's the best time of the year christmas."

res = regexChristmas.findall(sentence_2) 
print(res)
