# Regular expressions - re

A regular expression is a special sequence of characters that helps to match or find other strings or sets of strings, using a specialized syntax held in a pattern.
Regular expressions are text matching patterns, described with a formal syntax. They are often referred to as regex.

In [6]:
# Search for a pattern in a text:

import re

patterns = ['term1', 'term2']
text = 'This is a string with term1 but not the other term'

In [7]:
for pattern in patterns:
    print('Searching for "%s" in: \n"%s"' % (pattern, text))
    
    # Check for a match:
    if re.search(pattern, text):
        print('\n')
        print('Match was found.\n')
    else:
        print('\n')
        print('No match was found. \n')

Searching for "term1" in: 
"This is a string with term1 but not the other term"


Match was found.

Searching for "term2" in: 
"This is a string with term1 but not the other term"


No match was found. 



In [14]:
print(re.search('a', 'b'))

None


In [15]:
match = re.search(patterns[0], text)

In [16]:
# What is the type of match:
type(match)

re.Match

In [20]:
# Find the index of where the pattern starts to match the text:

match.start()

22

In [21]:
match.end()

27

In [22]:
match.group()

'term1'

In [25]:
split_term = '@'

phrase = 'What is your email: hello@gmail.com?'

In [26]:
# This results in a split phrase at the index of the term:

re.split(split_term,phrase)

['What is your email: hello', 'gmail.com?']

In [27]:
# Find all instances of a pattern:

re.findall('match', 'Here is one match, here another match')

['match', 'match']

In [28]:
# This function takes in a list of regex pattens and prints a list of all matches:

def multi_re_find(patterns, phrase):
    
    for pattern in patterns:
        print("Searching the phrase using the re check: %r" %pattern)
        print(re.findall(pattern,phrase))
        print('\n')
    

## Repetition syntax

There are five ways to express repetition in a pattern:

1) A pattern followed by the metacharacter * is repeated zero or more times.

2) With the +, a pattern must appear at least once.

3) Using ? means the pattern appears zero or one time.

4) For a specific number of occurances, use {m} after the pattern (m is the number of times the pattern should repeat)

5) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n (so using {m, }) means the value appears at least m times, with no maximum.

In [30]:
# This text is used to show the REPETITION QUALIFIERS: *, +, ?, {}

test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',    # means: s followed by zero or more d's
                'sd+',      # means: s followed by one or more d's
                'sd?',      # s followed by zero or one d's
                'sd{3}',    # s followed by three d's
                'sd{2,3}']  # s followed by two to three d's

multi_re_find(test_patterns, test_phrase)



Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']




## Character sets

Character sets are used to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs: [abcd] means that occurences of either a, b, c or d are searched for.

In [33]:
# Example of searching with character sets:

test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ '[sd]',   # search for either s or d
               's[sd]+']   # search for s followed by one or more s's or d's

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




## Exclusion

Use ^ (caret) to exclude terms and incorporate this into the bracket syntax notation (example: [^...] will match any single character not in the brackets.

In [34]:
test_phrase = 'This is a string? But it has punctuation! How can we remove it?'

In [36]:
re.findall('[^ ]+', test_phrase)    # This checks for matches that are not a space and the + checks that the match 
                                    # appears at least once

['This',
 'is',
 'a',
 'string?',
 'But',
 'it',
 'has',
 'punctuation!',
 'How',
 'can',
 'we',
 'remove',
 'it?']

## Character ranges

When character sets are large, a more compact format using character ranges makes it possible to define a character set to include all of the contiguous characters between a start and stop point.

The format used is [start-end].

Searching for a specific range of letters in the alphabet (example: a-f) would return matches with any instance of letters between these letters.

In [40]:
test_phrase = "This is an example sentence. Let's see if we can find some letters."

test_patterns = ['[a-z]+',      # pattern of sequence of lower case letters
                 '[A-Z]+',      # pattern of sequence of upper case letters
                 '[a-zA-Z]+',   # pattern of sequence of lower or upper case letters
                 '[A-Z][a-z]+'  # pattern of une upper case letter followed by lower case letters
                ]
multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'et', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z]+'
['T', 'L']


Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Let', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Let']




## Escape codes

Special escape codes can be used to find specific TYPES of patterns in data: digits, non-digits, whitespace etc.

Always use a \ (backslash). Because \ is the general escape code in Python, use r'' (example: r'\d+' or r'\s+')

Code 	Meaning
\d 	    a digit
\D 	    a non-digit
\s  	whitespace (tab, space, newline, etc.)
\S  	non-whitespace
\w  	alphanumeric
\W  	non-alphanumeric

In [41]:
# Using escape codes:

test_phrase = "This is a string with some numbers 1233 and a symbol #hashtag."

test_patterns = [r'\d+',   # sequence of digits
                 r'\D+',   # sequence of non-digits
                 r'\s+',   # sequence of whitespace
                 r'\S+',   # sequence of non-whitespace
                 r'\w+',   # alphanumeric characters
                 r'\W+']   # non-alphanumeric characters

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: '\\d+'
['1233']


Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag.']


Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag.']


Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #', '.']


