# Regular Expressions

***

https://docs.python.org/3/library/re.html

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/

https://developers.google.com/edu/python/regular-expressions

https://librarycarpentry.org/lc-data-intro/01-regular-expressions/



## Python's re module
Included by default in Python, Lengthy documentation available. 
Re has lots of different methods and functions. re.split very handy. 

Regular expressions find patterns in text. 

In [1]:
import re

In [2]:
# A string to be manipulated
original = "Words, words, words."

# The pattern/regular expression to use on the above string
pattern_1 = r'\W+'

# Splits a string into substrings using a regular expression. 
result_1 = re.split(pattern_1, original)

# Print the result
print(result_1)

['Words', 'words', 'words', '']


Backslash in strings usually means 'escape the following character' 
\r carriage return
\n new line

An r in front of your string as in `pattern` above denotes 'raw'. It means 'do not escape any of your characters'

Split takes a string and returns a list of strings

\W generally means white space. Matches any character which is not a word character.

\W = ``[^a-zA-Z0-9_]`` will not match  any character, lower case, upper case, single digit and underscore. The match will happen with the single character

The ^ means NOT

\w = ``[a-zA-Z0-9_]`` will match any character, lower case, upper case, single digit and underscore. The match will happen with the single character

`+` is a quantifier. It means how many of the previous thing you want to match... one or more of the previous thing. It's greedy - it will match as many as it can. 

The final substring in the list is empty because the '.' matches \W. The split occurs but it is empty. 

In [3]:
original = "Words, words, words."

# The pattern/regular expression to use on the above string
pattern_2 = r'(\w+)'

# Splits a string into substrings using a regular expression. 
result_2 = re.search(pattern_2, original)
print(result_2)

<re.Match object; span=(0, 5), match='Words'>


In [4]:
re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [5]:
# for characters a-f in any case, in the given expression, returns the integers.
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

## Real Python
Following this article https://realpython.com/regex-python/
`re.search(<regex>, <string>)`
A regular expression that is supposed to match a string.  Can be ambiguous. 

Sometimes you will want the whole string has to match the regular expression. 

Sometimes you will want to match the substring.

A set of characters specified in square brackets ([]) makes up a character class.The special characters match any single character that is in the class


In [6]:
'abc' == 'abc'

True

In [7]:
'cda' == 'pda'

False

In [8]:
'abc' in 'cab'

False

In [9]:
'abc' in 'cababc'

True

In [10]:
'cbabac'.index('a')

2

In [11]:
'cbabac'[2]

'a'

In [12]:
'cbaabc'.find('aa')

2

In [13]:
s = 'foo123bar'

re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In [14]:
s[3:6]

'123'

In [15]:
s = 'foo123bar'
re.search(r'[0-9][0-9][0-9]', s)

<re.Match object; span=(3, 6), match='123'>

In [16]:
re.search(r'[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [17]:
re.search(r'[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [18]:
re.search(r'[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

In [19]:
print(re.search(r'[0-9][0-9][0-9]', '12foo34'))

None


In [20]:
s = 'foo123bar'
re.search('1.3', s)


<re.Match object; span=(3, 6), match='123'>

In [21]:
s = 'foo13bar'
print(re.search('1.3', s))

None


## Google Education
***
https://developers.google.com/edu/python/regular-expressions


In [22]:
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')

found word:cat


In [23]:
string = 'aaaaaa'
pattern = r'a+'

re.search(pattern, string)

<re.Match object; span=(0, 6), match='aaaaaa'>

In [24]:
string = 'aaaabaa'
pattern = r'a+'

re.search(pattern, string)

<re.Match object; span=(0, 4), match='aaaa'>

#### Repetitions

In [25]:
# i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
match

<re.Match object; span=(0, 4), match='piii'>

In [26]:
# Finds the first/leftmost solution, and within it drives the +
# as far as possible (aka 'leftmost and largest').
# In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
match

<re.Match object; span=(1, 3), match='ii'>

In [27]:
# \s* = zero or more whitespace chars
# Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
match

<re.Match object; span=(2, 9), match='1 2   3'>

In [28]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
match

<re.Match object; span=(2, 7), match='12  3'>

In [29]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
match

<re.Match object; span=(2, 5), match='123'>

In [30]:
# ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
match

In [31]:
# but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
match

<re.Match object; span=(3, 6), match='bar'>

#### Email example

In [32]:
# Common use of regex to search for email addresses
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'

b@google


In [33]:
# use [] to indicate a group of characters or metacharacters 
# except dot (.) means literal dot (.) rather than matching any character 
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


### Regex resources 
Regex golf 
https://alf.nu/RegexGolf?world=regex&level=r00

Regex 101 explains regular expression entered
https://regex101.com/

Enter reg expression for graphical analysis 
https://regexper.com/

### Exercise 1 
***
Write a Python function to remove all non-alphanumeric characters from a string

In [34]:
# Adaped from https://bobbyhadz.com/blog/python-remove-non-alphanumeric-characters-from-string
# For a given string
my_str = 'Coding_ like- poetry| should{be}short#and; concise'

# Remove all non-alphanumeric characters and white space from string 
new_str = re.sub(r'[\W_]', '', my_str)

print(new_str)

Codinglikepoetryshouldbeshortandconcise
