## Basic Regular Expressions 

- `^str`: starts with str
- `str`: contains str
- `str$`: ends with str
- `^.s`: second letter s
- `c*`: contains c chatacter as many times we want (including zero)
- `c+`: contains c character as many times we want(with no zero)
- `c?`: contains c max one time
- `c{n}`: contains c exactly n times
- `c{n,}`: contains c at least n times
- `c{n, m}`: contains c at least n times and max m
- `[abc]`: contains either a, or b, or c
- `[^abc]`: is not containing a, b and c
- `[0-9]`: contains one number in the range zero to nine (likewise [^0-9])
- `[a-z]`: contains a character from a to z (likewise [^a-z]) (not case sensitive)
- `str1|str2`: either str1 or else str2
- `\\.`: contains `.`
- `\\\`: contains `\` (`\\special_char`)
- `\\d`: contains one digit

- `Class characters`
- Syntax: WHERE string_column REGEXP `...[[:class:]]...`
- Provides us more options with ranges
    - `alnum`: alphanumeric characters
    - `alpha`: alphabetic characters
    - `blank`: Whitespace characters
    - `digit`: Digit characters
    - `lower/upper`: lower/upper case characters

## Regular Expressions in Python

We can use Regular Expression in Python using the build-in module `re`
The function that we have available are:
* `findall(expr, text)`: Returns a list containing all matches in the text.
* `search(expr, text)`: Returns a Match object if there is a match anywhere in the text.
* `split(expr, text)`: Returns a list where the string has been split at each match (contains max_splits as another argument)
* `sub(expr, repl, text)`: Replaces one or many matches with a string (conatins another argument for the total repalaces)

## Use in NLP


Regular Expressions is all about matching patterns in a text and capture/return key information out of it.

In many cases, when dealing with eazier problems we can solve them entirely using Regular Expression, not by creating models to solve them. We don't need to raise more obstactles to ourselfes.

For example a simple NLP problem is a chatbot that base on what you asking it, it should suggest you some actions.

In [None]:
import re

# For creating regualr expression we can ease ourselfs using: https://regex101.com/

## Regular Expressions for ChatBots

Let's say that base on some text, we want to extract the phone number, that can have 2 representations (encodings):
1. 1234567891
2. +12... 1234567891
3. (123)-456-7891

The same analysis can be done for various other examples, like mail format.

In [None]:
text_1 = "I live in Greece and my phone number is 6972366276 and 6994044141"
pattern_1 = "\d{10}"

# To get the phone number of we use:
phone_number = re.findall(pattern_1, text_1)

print(phone_number)

['6972366276', '6994044141']


In [None]:
text_2 = "I live in Greece and my phone number is +30 6972366276"
pattern_2 = "\+\d{1,} \d{10}"

# To get the phone number of we use:
phone_number = re.findall(pattern_2, text_2)

print(phone_number)

['+30 6972366276']


In [None]:
text_3 = "I live in Greece and my phone number is (697)-236-6276"
pattern_3 = "\(\d{3}\)-\d{3}-\d{4}"

# To get the phone number of we use:
phone_number = re.findall(pattern_3, text_3)

print(phone_number)

['(697)-236-6276']


In [None]:
# Capturing at the same time the different phone number formats

text = text_1 + '\n' + text_2 + '\n' + text_3
pattern = pattern_1 + '|' + pattern_2 + '|' + pattern_3

phone_numbers = re.findall(pattern, text)

print(phone_numbers)

['6972366276', '6994044141', '+30 6972366276', '(697)-236-6276']


In [None]:
# Giving a maching example for mail

text_1 = "pjf@gmail.com"
test_2 = "pjf@gmail.gr"
text_3 = "pjFloratos@gmail.com"
text_4 = "p_jFlo@ntua.mail.gr"
text_5 = "pjf009@yahoo.gr"

text = text_1 + '\n' + text_2 + '\n' + text_3 + '\n' + text_4 + '\n' + text_5
base_pattern = "[a-zA-Z0-9_]+@[a-z]+\."
pattern =  base_pattern + "[a-z]+" + '|' + base_pattern + "[a-z]+\." + "[a-z]+"

email = re.findall(pattern, text)

print(email)

['pjf@gmail.com', 'pjFloratos@gmail.com', 'pjFloratos@gmail.com', 'p_jFlo@ntua.mail', 'pjf009@yahoo.gr']


Let's see a more difficult example:
We have a text and we want to get the order number of some order.

In [None]:
text_1 = "Hallo, I am having issue with my order that has the number 369369369"
text_2 = "Hallo, I am having issue, 369369369 is my order"

text = text_1 + text_2 
pattern = "[oO]rder[^\d]*(\d{1,})" + '|' + "(\d{1,})[^\d]*[oO]rder"

# When using multiple '()' the mathing result with be a tuple with total elements the same as the number of '()'
order = []
match_1 = re.findall(pattern, text)
for m in match_1:
    if m[0] != '':
        order.append(m[0])
    if m[1] != '':
        order.append(m[1])

print(order)

['369369369', '369369369']


## Regular Expression for Information Extraction

In [None]:
text = """Born	Elon Reeve Musk
        June 28, 1971 (age 51)
        Pretoria, Transvaal, South Africa
Education University of Pennsylvania (BA, BS)
Title	Founder, CEO and chief engineer of SpaceX
        CEO and product architect of Tesla, Inc.
        CEO of Twitter, Inc.
        President of the Musk Foundation
        Founder of the Boring Company
        Co-founder of Neuralink, OpenAI, Zip2 and X.com (now part of PayPal)
Spouses	Justine Wilson​
        ​(m. 2000; div. 2008)​
        Talulah Riley
        ​(m. 2010; div. 2012)​
        ​(m. 2013; div. 2016)​
Partner  Grimes (2018–2021)[1]
Children 10[a][3]
Parents	 Errol Musk (father)
         Maye Musk (mother)
"""

print(text)

Born	Elon Reeve Musk
        June 28, 1971 (age 51)
        Pretoria, Transvaal, South Africa
Education University of Pennsylvania (BA, BS)
Title	Founder, CEO and chief engineer of SpaceX
        CEO and product architect of Tesla, Inc.
        CEO of Twitter, Inc.
        President of the Musk Foundation
        Founder of the Boring Company
        Co-founder of Neuralink, OpenAI, Zip2 and X.com (now part of PayPal)
Spouses	Justine Wilson​
        ​(m. 2000; div. 2008)​
        Talulah Riley
        ​(m. 2010; div. 2012)​
        ​(m. 2013; div. 2016)​
Partner  Grimes (2018–2021)[1]
Children 10[a][3]
Parents	 Errol Musk (father)
         Maye Musk (mother)



In [None]:
# Getting some information

In [None]:
age = re.findall("age (\d{1,})", text)

print(age)

['51']


In [None]:
name = re.findall("^[a-zA-Z]*\s+([^\n]+)", text)

print(name)

['Elon Reeve Musk']


In [None]:
birst_date = re.findall("[a-zA-Z]* \d{2}, \d{4}", text)

print(birst_date)

['June 28, 1971']


In [None]:
birst_place = re.findall(".*\)\n\s*([a-zA-Z]+, [a-zA-Z]+, [a-zA-Z]+)", text)

print(birst_place)

['Pretoria, Transvaal, South']
