# Regex in python

## basics

- useful links:
    - [pythex](https://pythex.org/)

In [2]:
# importing the regex module
import re

In [3]:
# example of basic regex for phone number
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
# \d identify if in the evaluting position there is any number

In [4]:
# now the variable contains the regex object
mo = phoneNumRegex.search('my number is 415-555-4242.')

the search method search the string in the regex, 
<br>if not found return None
<br>if found return a match object which has the group() method that
<br>returns the string

In [5]:
# mo is a generic variable that contains the object
print(f'Phone number is {mo.group()}')

Phone number is 415-555-4242


the group() method returns the matched string
<br>if not found group() returns error

## parenthesis

In [6]:
# parenthesis will create groups in the regex
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')

In [7]:
mo = phoneNumRegex.search('my phone number is 415-555-4242')

In [8]:
mo.group(1), mo.group(2), mo.group(3)

('415', '555', '4242')

In [9]:
mo.group(0)

'415-555-4242'

In [10]:
mo.groups()
# returns a tuple

('415', '555', '4242')

In [11]:
# we can unpack them too
areaCode, mainNumber_1, mainNumber_2 = mo.groups()

In [12]:
print(areaCode), print(mainNumber_1), print(mainNumber_2)

415
555
4242


(None, None, None)

## the escape character

in regex the backslash \ is an important escape character
<br> that identify the character next to him as a normal 
<br> one instead of the special meaning that it could have.
<br> If a phone number is similar to (415) 555-4242
<br> we'll have to indicate that ( and ) aren't special character
<br> but normal one to identify

In [13]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')

In [14]:
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')

In [15]:
mo.group(1), mo.group(2)

('(415)', '555-4242')

\( and \) in the raw string matches the actuale parethesis.
<br> speacial characters with special menings are:

In [16]:
print('. ^ $ + ? {  }  [  ]  \  |  (  )')

. ^ $ + ? {  }  [  ]  \  |  (  )


if you want to match them you need to escape them

In [17]:
print('\. \^ \$ \+ \? \{ \} \[ \] \\ \| \( \)')

\. \^ \$ \+ \? \{ \} \[ \] \ \| \( \)


escape characters are simple but confusionary,
<br> be cautious when you use them

## the pipe regex

| is called pipe and is used to match one of many expressions:

In [18]:
first_question = 'Do you know where Bob is? And what about Alice?'
second_question = 'Have you seen Alice recently? Bob\'s girlfriend'

In [19]:
pipeRegex = re.compile(r'Bob|Alice')
searching_bob = pipeRegex.search(first_question)

In [20]:
searching_bob.group()

'Bob'

In [21]:
searching_alice = pipeRegex.search(second_question)
searching_alice.group()

'Alice'

we can even get one of various words which begins with the same characters using parenthesis

In [22]:
question = 'Is that a serpent?'

In [23]:
pipeRegex = re.compile(r's(nake|erpent)')
mo = pipeRegex.search(question)
mo.group()

'serpent'

In [24]:
mo.group(1)

'erpent'

## the utility of the question mark

using the question mark ? we can match a string whether a piece of text is in it
<br> An useful case may be a phone number

In [25]:
contact1 = 'You can contact me at +39 567 789 3409'
contact2 = 'You can contact me at 567 789 3409'

In [26]:
phoneNumRegex = re.compile(r'(\+\d\d )?\d\d\d \d\d\d \d\d\d\d')

In [27]:
phoneNum1 = phoneNumRegex.search(contact1)
phoneNum1.group()

'+39 567 789 3409'

In [28]:
phoneNum2= phoneNumRegex.search(contact2)
phoneNum2.group()

'567 789 3409'

## Specific repetitions

we can match specific patterns in our strings using braces { }
<br> we put between parenthesis our regex and next to them 
<br> the braces with the number of times they should be repeated.
<br> we can use it to better our regex for phone numbers

In [29]:
betterRegex = re.compile(r'(\+\d\d )?(\d\d\d ){2}\d\d\d\d')

In [30]:
phoneNum1 = betterRegex.search(contact1)
phoneNum1.group()

'+39 567 789 3409'

let's see a simpler version

In [31]:
repRegex = re.compile(r'(\d\d\d){2}')
repRegex.search('random random 123456 rand').group()

'123456'

writing just the number between the brackes will match
<br> a set of character only if it will be the specific number of times it will appear
<br> writing {,5} will match any set that appear between 0 and 5 times,
<br> {5,} will match only 5 or more sets and 
<br> {3, 5} will match between 3 and 5 sets

## greedines

we saw that (\d\d\d){3,5} would match 3, 4 or 5 sets of three letters.
<br> In ambiguos situations python's regex will match the longest
<br> match because it is greedy by default, while a regex that
<br> matches the shortest is called lazy or non-greedy
<br> a lazy regex in python is obtained by putting a question 
<br> mark at the end of the regex.
<br> note that in this cases it is not related to the
<br> repetitions matching

In [32]:
numbers = '123456789'

# searching between 3 and 6 times for
# a set of 2 letters
greedyRegex = re.compile(r'(\d\d){3,6}')
greedyRegex.search(numbers).group()

'12345678'

In [33]:
# creating a non-greedy regex
noGreedyRegex = re.compile(r'(\d\d){3,6}?')
noGreedyRegex.search(numbers).group()

'123456'

## the findall method

using the findall method the regex returns not the first match 
<br> but a tuple of all the matches in the string (if there is more then one)

In [34]:
twoNumbers = '2 multiplied by 10 will be 20'

In [35]:
findallRegex = re.compile(r'\d\d')
findallRegex.findall(twoNumbers)

['10', '20']

but a better example is one with phone numbers:

In [36]:
phoneNumbers = 'my personal number is 123 789 1256 while the one for work is +39 745 856 1965.'

In [37]:
phoneNumbersRegex = re.compile(r'(\+\d\d )?(\d\d\d ){2}\d\d\d\d')
phoneNumbersRegex.findall(phoneNumbers)
# using our version of the phoneNumbers regex will return wrong tuples

[('', '789 '), ('+39 ', '856 ')]

In [38]:
# with the findall method it's better to use a more explicit regex
# with groups in parenthesis
phoneNumbersRegex = re.compile(r'(\+\d\d )?(\d\d\d )(\d\d\d )(\d\d\d\d)')
phoneNumbersRegex.findall(phoneNumbers)

[('', '123 ', '789 ', '1256'), ('+39 ', '745 ', '856 ', '1965')]

## character classes

till now we have used the \d shortcut that behave 
<br>like (0|1|2|3|4|5|6|7|8|9)

shortcut | behavior |
--- | ---  
\d | match any digit
\D | match anything isn't a digit
\s | match a blank space
\S | match a non-blank 
\w | match any alphanumeric character
\W | match anything isn't an alphanumeric character

In [39]:
# regex that match all the vowels
vRegex = re.compile(r'[aeiou]')
vRegex.findall('alphabet')

['a', 'a', 'e']

In [40]:
# if we want to match any digit between 1 and 6 we would simply type:
Regex = re.compile(r'[1-6]')
# and the same is true for letters:
capitalRegex = re.compile(r'[A-Z]')
# creating your own classes you don't need any \ for special 
# character such as ? ! etc...

In [46]:
# putting ^ as first item of our personalized class 
# will match anything that isn't in it
nonCapitalRegex = re.compile(r'[^A-Z]')
nonCapitalRegex.findall('Welcome')

['e', 'l', 'c', 'o', 'm', 'e']

we can indicate that a pattern must be at the 
<br>begin or at the end of a string with the ^ (caret) and $ characters

In [54]:
string = 'Welcome to Arizona'

In [55]:
welcomeFirst = re.compile(r'^Welcome')
welcomeFirst.search(string)

<re.Match object; span=(0, 7), match='Welcome'>

In [56]:
arizonaLast = re.compile(r'Arizona$')
arizonaLast.search(string)

<re.Match object; span=(11, 18), match='Arizona'>