<a href="https://colab.research.google.com/github/OSGeoLabBp/tutorials/blob/master/english/python/regexp_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Regular expressions in Python

Regular expression (regexp) is a powerful tool to handle diverse text patterns in text processing. Several text editors (e.g Notepad++, vi) and programming languages have regexp functionality.

To define text patterns, a special meaning is assigned to some characters. You can find below a very short and incomplete list of special regexp characters:

|character(s)|explanation                                                      |
|------------|-----------------------------------------------------------------|
|. (dot)     | any character except new line                                   |
|^           |beginning of the line                                            |
|$           |end of the line                                                  |
|[abc]       |any character from the set in the brackets                       |
|[^abc]      |none of the characters in the set in brackets                    |
|[a-z]       |any character from the range in brackets (inclusive)             |
|[^a-z]      |none of the characters in the range in brackets                  |
|( )         |make group in pattern                                            |
|{min,max}   | repetition of the previous character or group, max part is optional|
|p1 \| p2    |p1 pattern or p2 pattern                                         |
|p\*         |any number of repetition of p pattern, including zero equivalent to p{0,}|
|p+          |one or more repetition of p pattern, equivalent to p{1,}         |
|p?          |zero or one repetition of p pattern, equivalent to p{0,1}        |
|\           |escape the special meaning of the next character (e.g. \. the dot character, not any character)|


Python has a special package named *re* to handle regular expressions. To use it, it is necessary to import it, as follows:

In [1]:
import re

Let's make some examples using regexps

##Pattern in string

In [10]:
text = """Python is an interpreted high-level general-purpose programming language. 
Its design philosophy emphasizes code readability with its use of significant indentation. 
Its language constructs as well as its object-oriented approach aim to help programmers write clear, 
logical code for small and large-scale projects."""   # citation from Wikipedia

*re.match* searches for the pattern only at the beginning of string. It returns an object or *None* if the pattern not found.

In [6]:
re.match("Python", text)      # is Python at the beginning of the text?

<re.Match object; span=(0, 6), match='Python'>

In [8]:
if re.match("[Pp]ython", text): # is Python or python at the beginning of the text?
  print('text starts with Python')

text starts with Python


In [16]:
result = re.match("[Pp]ython", text)
result.span(), result.group(0)


((0, 6), 'Python')

*re.search* searches the first occurence of the pattern in the string.

In [18]:
re.search('prog', text)

<re.Match object; span=(52, 56), match='prog'>

In [20]:
re.search('levels?', text)        # optional 's' after level

<re.Match object; span=(30, 35), match='level'>

In [21]:
re.findall('pro', text)

['pro', 'pro', 'pro', 'pro']

*r* preface is often used for regular expression

In [25]:
re.findall(r'[ \t\r\n]a[a-zA-Z0-9_][ \t\r\n]', text) # two letter words starting with letter 'a'

[' an ', ' as ', ' as ']

In [24]:
re.findall(r'\sa\w\s', text)   # the same as above but shorter       

[' an ', ' as ', ' as ']

In [26]:
re.findall(r'\sa\w*\s', text)    # words strarting with 'a'

[' an ', ' as ', ' as ', ' approach ', ' and ']

We can use regexp to find/match functions to validate input data. In the example below, is a string a valid number?

In [81]:
int_numbers = ('12356', '1ac', 'twelve', '23.65', '0', '-768')
for int_number in int_numbers:
  if re.match(r'[+-]?(0|[1-9][0-9]*)$', int_number):
    print(f'{int_number} is an integer number')

12356 is an integer number
0 is an integer number
-768 is an integer number


In [82]:
float_numbers =('12', '0.0', '-43.56', '1.76e-1', '1.1.1', '00.289')
for float_number in float_numbers:
  if re.match(r'[+-]?(0|[1-9][0-9]*)(\.[0-9]*)?([eg][+-]?[0-9]+)?$', float_number):
    print(f'{float_number} is a float number')

12 is a float number
0.0 is a float number
-43.56 is a float number
1.76e-1 is a float number


There is another approach to check numerical values without regexp, as follows:

In [83]:
for float_number in float_numbers:
  try:
    float(float_number)     # try to convert to float number
  except ValueError:
    continue                # can't convert skip it
  print(f'{float_number} is a float number')

12 is a float number
0.0 is a float number
-43.56 is a float number
1.76e-1 is a float number
00.289 is a float number


Email address validation: We'll use the precompiled regular expression (*re.compile*). This alternative is faster than the alternative of using the same regexp evaluated several times:

In [69]:
email = re.compile(r'^[a-zA-Z0-9.!#$%&\'*+/=?^_`{|}~-]+@[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$')
addresses = ['a.b@c', 'siki.zoltan@emk.bme.hu', 'plainaddress', '#@%^%#$@#$@#.com', '@example.com', 'Joe Smith <email@example.com>',
            'email.example.com', 'email@example@example.com', 'email@123.123.123.123']
valid_addresses = [addr for addr in addresses if email.search(addr)]
print('valid email addresses:\n', valid_addresses)
invalid_addresses = [addr for addr in addresses if not email.search(addr)]
print('invalid email addresses:\n', invalid_addresses)

valid email addresses:
 ['a.b@c', 'siki.zoltan@emk.bme.hu', 'email@123.123.123.123']
invalid email addresses:
 ['plainaddress', '#@%^%#$@#$@#.com', '@example.com', 'Joe Smith <email@example.com>', 'email.example.com', 'email@example@example.com']


#Other functions

*re.sub* replaces the occurrence of a regexp with a given text in a string.

In [87]:
print(re.sub(r'  *', ' ', 'Text     with     several unnecessary    spaces')) # truncate adjecent spaces to a single space
print(re.sub(r'[ \t,;]', ',', 'first,second;third fourth fifth'))             # unify separators

Text with several unneccesary spaces
first,second,third,fourth,fifth


*re.split* splits a text into a list of parts, where separators are given by regexp.

In [30]:
words = re.split(r'[, \.\t\r\n]', text)   # word separators are space, dot, tabulator and EOL
words

['Python',
 'is',
 'an',
 'interpreted',
 'high-level',
 'general-purpose',
 'programming',
 'language',
 '',
 '',
 'Its',
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'use',
 'of',
 'significant',
 'indentation',
 '',
 '',
 'Its',
 'language',
 'constructs',
 'as',
 'well',
 'as',
 'its',
 'object-oriented',
 'approach',
 'aim',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 '',
 '',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects',
 '']

Please note that the previous result contains some empty words where two or more separators are adjecent. Let's correct it:

In [32]:
words = re.split(r'[, \.\t\r\n]+', text)  # join adjecent separators
words

['Python',
 'is',
 'an',
 'interpreted',
 'high-level',
 'general-purpose',
 'programming',
 'language',
 'Its',
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its',
 'use',
 'of',
 'significant',
 'indentation',
 'Its',
 'language',
 'constructs',
 'as',
 'well',
 'as',
 'its',
 'object-oriented',
 'approach',
 'aim',
 'to',
 'help',
 'programmers',
 'write',
 'clear',
 'logical',
 'code',
 'for',
 'small',
 'and',
 'large-scale',
 'projects',
 '']

Why is there an empty word at the end?

##Complex example

Let's make a complex example: Find the most frequent four-letter word starting with "s" in Kipling's The Jungle Book.

In [88]:
import urllib.request
url = 'https://www.gutenberg.org/files/236/236-0.txt'
words = {}
with urllib.request.urlopen(url) as file:
  for line in file:
    ws = re.split(r'[, \.\t\r\n]+', line.decode('utf8'))
    for w in ws:
      w = w.lower()
      if re.match('[sS][a-z]{3}', w):
        if w in words:
          words[w] += 1
        else:
          words[w] = 1
print(f'{len(words.keys())} different four letter words starting with "s"')
m = max(words, key=words.get)
print(f'{m}: {words[m]}')


751 different four letter words starting with "s"
said: 426


*Tasks*

*   Analyse and try to understand the used regular expressons for float and email
*   Create a regular expression for phone numbers
*   Which is the longest word in Kipling's book?
*   Are there words in the book with all the different vowels (aeiou) of the English ABC?
*   How could we handle plurals and other non-dictionary forms (e.g. Maugli's, sees, saw, seen, etc)

