# Regular Expressions

Regular expressions are text-matching patterns described with a formal syntax. Regular expressions can include a variety of rules, from finding repetition, to text-matching etc. As we advance in Python, a lot of your parsing problems can be solved with regular expressions.

In [1]:
import re

## Searching for Patterns in Text

In [2]:
# List of patterns to search for
patterns = ['hot', 'cold']

# Text to parse
text = 'This is a string with the word hot, but not the other one.'

for pattern in patterns:
    print('Searching for "%s" in:\n "%s"\n' %(pattern,text))
    
    #Check for match
    if re.search(pattern,text):
        print('Match was found. \n')
    else:
        print('No Match was found.\n')

Searching for "hot" in:
 "This is a string with the word hot, but not the other one."

Match was found. 

Searching for "cold" in:
 "This is a string with the word hot, but not the other one."

No Match was found.



We can seen that <code>re.search()</code> will take the pattern, scan the text, and then return a **Match** object. If no pattern is found, **None** is returned.

In [11]:
# List of patterns to search for
pattern = 'hot'

# Text to parse
text = 'This is a string with the word hot, but not the other one.'

match = re.search(pattern,text)

type(match)

re.Match

This **Match** object returned by the search( ) method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match.

In [8]:
# Show start of match
match.start()

31

In [9]:
# Show end
match.end()

34

<function Pattern.finditer(string, pos=0, endpos=9223372036854775807)>

## Split with regular expressions

we can also split with the re syntax. It is similar to what we used the split( ) method with strings.

In [14]:
# Term to split on
split_term = '@'

phrase = 'What is the domain name of someone with the email: python@gmail.com'

# Split the phrase
re.split(split_term,phrase)

['What is the domain name of someone with the email: python', 'gmail.com']

Note how <code>re.split()</code> returns a list with the term to split on removed and the terms in the list are a split up version of the string.

## Finding all instances of a pattern

Wecan use <code>re.findall()</code> to find all the instances of a pattern in a string.

In [16]:
re.findall('Python','we are learning regex in Python today.')

['Python']

In [17]:
text = 'sat,hat,cat,mat,pat'
for i in re.findall('[shcmp]at',text):
    print(i)

sat
hat
cat
mat
pat


## Repetition Syntax

There are five ways to express repetition in a pattern:

   1. A pattern followed by the meta-character <code>*</code> is repeated zero or more times. 
   2. Replace the <code>*</code> with <code>+</code> and the pattern must appear at least once. 
   3. Using <code>?</code> means the pattern appears zero or one time. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.

let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [19]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %(pattern))
        print(re.findall(pattern,phrase))
        print('\n')

In [20]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',        # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']




## Character Sets

Character sets are used when we match any one from the group of characters. Brackets are used to construct character set inputs. For example: the input <code>[ab]</code> searches for occurrences of either **a** or **b**.

In [21]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = ['[sd]',    # either s or d
                's[sd]+']   # s followed by one or more s or d

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




The first input <code>[sd]</code> returns every instance of s or d. Also, the second input <code>s[sd]+</code> returns any full strings that begin with an s and continue with s or d characters until another character is reached.

## Exclusion

We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets.

In [24]:
test_phrase = 'This is regular expression - python, I know its tough! This sentence has punctuations. How to remove them ?'

Using <code>[^!-.? ]</code> to check for matches that are not a !,-,.,?, or space. Add a <code>+</code> to check that the match appears at least once.

In [25]:
re.findall('[^!-.? ]+',test_phrase)

# We removed all the punctuations

['This',
 'is',
 'regular',
 'expression',
 'python',
 'I',
 'know',
 'its',
 'tough',
 'This',
 'sentence',
 'has',
 'punctuations',
 'How',
 'to',
 'remove',
 'them']

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the characters between a start and stop point. The format used is <code>[start-end]</code>.

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-z]</code> would return matches with any occurrence of letters between a and z. 

In [30]:

test_phrase = 'This, is an example for Character Ranges! Lets see if we can find some Letters.'

test_patterns=['[a-z]+',      # sequences of lower case letters (excludes uppercase letters)
               '[A-Z]+',      # sequences of upper case letters (prints only uppercase letters)
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters (words with both upper and lower cases)
                
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'for', 'haracter', 'anges', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'etters']


Searching the phrase using the re check: '[A-Z]+'
['T', 'C', 'R', 'L', 'L']


Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'for', 'Character', 'Ranges', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'Letters']


Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Character', 'Ranges', 'Lets', 'Letters']




## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash <code>\</code>. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

In [31]:
test_phrase = 'This is a string with some numbers 1233 and a few symbols #,@,$'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '\\d+'
['1233']


Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a few symbols #,@,$']


Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'few', 'symbols', '#,@,$']


Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'few', 'symbols']


Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #,@,$']




## More Examples

In [33]:
# Findall = find how many words#
doc = ("'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'")

allinform = re.findall('wood',doc)
print(allinform)
count =0
for val in allinform:
    print(val)
    count= count+1
print(count)

['wood', 'wood', 'wood', 'wood']
wood
wood
wood
wood
4


In [34]:
# finditer= find iterations#
for i in re.finditer('wood',doc):
    print(i.span())

(10, 14)
(23, 27)
(44, 48)
(66, 70)


In [37]:
funny = 'This and that and those'
re.findall(r'th\w+', funny)

# The word with uppercase is missing

['that', 'those']

In [38]:
foo = 'This and that and those'
re.findall(r'th\w+', foo ,re.IGNORECASE)

['This', 'that', 'those']

In [40]:
woody='How much wood would a woodchuck chuck if a woodchuck could chuck wood?'
re.findall(r'wood\w',woody)

['woodc', 'woodc']

In [41]:
woody

'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'

In [42]:
re.sub(r'[aeiou]+', '-',woody)

# Excluding a,e,i,o,u and replacing with - 

'H-w m-ch w-d w-ld - w-dch-ck ch-ck -f - w-dch-ck c-ld ch-ck w-d?'

In [43]:
myre =re.compile(r'\w+ou\w+')
myre.findall(woody)

# Filtering words that have 'ou'

['would', 'could']

In [44]:
myre.findall('the thirty-three thieves thought they thrilled the throne through Thursday')

['thought', 'through']

In [54]:
prog = re.compile(r'y')
prog.match('python',pos=1)

# matching with item position

<re.Match object; span=(1, 2), match='y'>

In [55]:
prog= re.compile(r'thon')
prog.match('python', pos=2)

<re.Match object; span=(2, 6), match='thon'>

In [56]:
prog = re.compile(r'ing')
words= ['Spring','Cycling','Ringtone', 'pinging']
for w in words:
    mt = prog.search(w)
    #span returns a tuple of a start and end position of a match
    start_pos = mt.span()[0]
    end_pos=mt.span()[1]
    print("the word'{} contains 'ing' in the position{}-{}".format(w,start_pos,end_pos))

the word'Spring contains 'ing' in the position3-6
the word'Cycling contains 'ing' in the position4-7
the word'Ringtone contains 'ing' in the position1-4
the word'pinging contains 'ing' in the position1-4
