# How to use regular expresion

This is a quick guide on how to use regular expression with python.

Some tricks into using markdown, how to insert spaces:

- Type nbsp to add a single space.
- Type ensp to add 2 spaces.
- Type emsp to add 4 spaces.

You can use non-breaking space (nbsp) 4 times to insert a tab.

Please append <span style = "color:magenta"> & </span>  in the beginning and <span style = "color:magenta"> ; </span> in the end of the above suggested space syntax.

## Regular Expression Quick Guide

- ^ &nbsp; Matches the <span style = "color:magenta"> beginning </span> of a line
- $ &nbsp; Matches the <span style = "color:magenta"> end </span> of the line
- . &nbsp; Matches <span style = "color:magenta"> any </span> character
- \s &nbsp; Matches <span style = "color:magenta"> whitespace </span>
- \S &nbsp; Matches any <span style = "color:magenta"> non-whitespace </span> character
- \* &nbsp; <span style = "color:magenta"> Repeats </span> a character zero or more times
- \*? &nbsp; <span style = "color:magenta"> Repeats </span> a character zero or more times (non-greedy)
- \+ &nbsp; <span style = "color:magenta"> Repeats </span> a character one or more times
- \+? &nbsp; <span style = "color:magenta"> Repeats </span> a character one or more times (non-greedy)
- [aeiou] &nbsp; Matches a single character <span style = "color:magenta"> in </span> the listed <span style = "color:magenta"> set </span>
- [^XYZ] &nbsp; Matches a single character <span style = "color:magenta"> not in </span> the listed <span style = "color:magenta"> set </span>
- [a-z0-9] &nbsp; The set of characters can include a <span style = "color:magenta"> range </span>
- ( &nbsp; Indicates where string <span style = "color:magenta"> extraction </span> is to start
- ) &nbsp; Indicates where string <span style = "color:magenta"> extraction </span> is to end

## The Regular Expression Module

- Before oyu can use regular expressions in your program, you must import the library using "<span style = "color:lime">import re</span>".
- You can use <span style = "color:lime">re.search()</span> to see if a string matches a regular expression, similar to using the <span style = "color:magenta">str.find()</span> method for strings.
- You can use <span style = "color:lime">re.findall()</span> to extract portions of a string that matchs your regular expression, similar to a combination of <span style = "color:magenta">str.find()</span> and slicing.

The structure of the following code is to show you different patterns as an example on how you may use regular expressions. The structure will be as following, each cell will contain a pattern an each one will print each match of the regular in the same line, what a i mean with that is, if there is 4 match of the regular expression in the same line, the output will be a list with each match:

- ['match 1','match 2','match 3','match 4']

hence, each matched-list correspond to a individual line.

In [11]:
import re

In [12]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

print('The text we are working with is:\n')
for i in text:
    print(i.rstrip())

The text we are working with is:

1
12
123
1234
12345
12345f6789
12d345g676h
12d345g676her
12d345g676her33654ytf
hi, this is a bunch of letter
worddddd
123 word 123343


### Example 1

Matching numbers one or more time followed by a letter.

In [13]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

c1 = 0
c2 = 0
pattern = '\d+[a-z]'
pattern_text = 'Match one or more digits but ends in a word:'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match one or more digits but ends in a word: ['12345f']
Match one or more digits but ends in a word: ['12d', '345g', '676h']
Match one or more digits but ends in a word: ['12d', '345g', '676h']
Match one or more digits but ends in a word: ['12d', '345g', '676h', '33654y']

Rows that do not match 8
Rows that do match 4
Total number of rows: 12


### Example 2

Matching numbers cero or more time followed by a letter.

In [14]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

c1 = 0
c2 = 0
pattern = '\d*[a-z]'
pattern_text = 'Match zero or more digits but ends in a word:'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match zero or more digits but ends in a word: ['12345f']
Match zero or more digits but ends in a word: ['12d', '345g', '676h']
Match zero or more digits but ends in a word: ['12d', '345g', '676h', 'e', 'r']
Match zero or more digits but ends in a word: ['12d', '345g', '676h', 'e', 'r', '33654y', 't', 'f']
Match zero or more digits but ends in a word: ['h', 'i', 't', 'h', 'i', 's', 'i', 's', 'a', 'b', 'u', 'n', 'c', 'h', 'o', 'f', 'l', 'e', 't', 't', 'e', 'r']
Match zero or more digits but ends in a word: ['w', 'o', 'r', 'd', 'd', 'd', 'd', 'd']
Match zero or more digits but ends in a word: ['w', 'o', 'r', 'd']

Rows that do not match 5
Rows that do match 7
Total number of rows: 12


### Example 3

Matching numbers one or more time followed by two letters.

In [15]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

c1 = 0
c2 = 0
pattern = '\d+[a-z][a-z]'
pattern_text = 'Match one or more digits but ends in two words:'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match one or more digits but ends in two words: ['676he']
Match one or more digits but ends in two words: ['676he', '33654yt']

Rows that do not match 10
Rows that do match 2
Total number of rows: 12


### Example 4

Matching numbers one or more time followed by tow letters, the first one is obligatory, the second one is optional.

In [16]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

c1 = 0
c2 = 0
pattern = '\d+[a-z][a-z]?'
pattern_text = 'Match one or more digits follow by a word and ends with one word o no (the second word is optional):'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match one or more digits follow by a word and ends with one word o no (the second word is optional): ['12345f']
Match one or more digits follow by a word and ends with one word o no (the second word is optional): ['12d', '345g', '676h']
Match one or more digits follow by a word and ends with one word o no (the second word is optional): ['12d', '345g', '676he']
Match one or more digits follow by a word and ends with one word o no (the second word is optional): ['12d', '345g', '676he', '33654yt']

Rows that do not match 8
Rows that do match 4
Total number of rows: 12


### Example 5

Matching numbers one or more time followed by any letter and a second letter witch is a 's'.

In [17]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

c1 = 0
c2 = 0
pattern = '\d+[a-z]s?'
pattern_text = 'Match one or more digits follow by a word and ends with a s (the s is optional):'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match one or more digits follow by a word and ends with a s (the s is optional): ['12345f']
Match one or more digits follow by a word and ends with a s (the s is optional): ['12d', '345g', '676h']
Match one or more digits follow by a word and ends with a s (the s is optional): ['12d', '345g', '676h']
Match one or more digits follow by a word and ends with a s (the s is optional): ['12d', '345g', '676h', '33654y']

Rows that do not match 8
Rows that do match 4
Total number of rows: 12


### Example 6 

Using {} to specify the number of digits to be matched. The syntax of the curtly braces is as follows: {minimum, maximum}

In [18]:
text = open('text_files_for_python_notebook/text_file_1.txt', 'r')

c1 = 0
c2 = 0
pattern = '\d{2,10}'
pattern_text = 'Match tow digits numbers (minumun two, maximun ten):'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match tow digits numbers (minumun two, maximun ten): ['12']
Match tow digits numbers (minumun two, maximun ten): ['123']
Match tow digits numbers (minumun two, maximun ten): ['1234']
Match tow digits numbers (minumun two, maximun ten): ['12345']
Match tow digits numbers (minumun two, maximun ten): ['12345', '6789']
Match tow digits numbers (minumun two, maximun ten): ['12', '345', '676']
Match tow digits numbers (minumun two, maximun ten): ['12', '345', '676']
Match tow digits numbers (minumun two, maximun ten): ['12', '345', '676', '33654']
Match tow digits numbers (minumun two, maximun ten): ['123', '123343']

Rows that do not match 3
Rows that do match 9
Total number of rows: 12


### Example 7 

Using [] to create a specific match

In [19]:
text = open('text_files_for_python_notebook/telephone_numbers.txt', 'r')

c1 = 0
c2 = 0
# pattern = '[+]?(\d{2,4}[ ]?){2,4}'
pattern = '(\d+[\-\._ ]?)'
pattern_text = 'Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol):'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['6665473892']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['665-', '6785-', '345']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['657.', '6787.', '456']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['56789-', '543456']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['56789 ', '65422']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['56785.', '67788']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['13456_', '78995']
Match one or more numbers following by a - or a . or a _ or a space wich is optional (the ? symbol): ['54 ', '388 ', '5377055']

Rows that do not m

### Example 8

How to use the '?' character. By default regular expressions are greedy, that mean that go for everything that can find, for example the expression '.+,' reads find any character one or more times until you find a comma, as you can se en the following examples, the firs one is one match, but, the second one are individual matches.

In [20]:
text = open('text_files_for_python_notebook/text_file_2.txt', 'r')

c1 = 0
c2 = 0
# pattern = '[+]?(\d{2,4}[ ]?){2,4}'
pattern = '.+,'
pattern_text = ':'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

: ['123, 456, 789, 120,']
: ['abf, asd, fgh,']
: ['fr43, ghy6, hju7,']

Rows that do not match 0
Rows that do match 3
Total number of rows: 3


Now using the '?' character:

In [21]:
text = open('text_files_for_python_notebook/text_file_2.txt', 'r')

c1 = 0
c2 = 0
# pattern = '[+]?(\d{2,4}[ ]?){2,4}'
pattern = '.+?,'
pattern_text = ':'

for line in text:
    line = line.rstrip()
    if not re.search(pattern, line):
        c1 = c1 + 1
        continue
    c2 = c2 + 1
    print(pattern_text, re.findall(pattern, line))

print('\nRows that do not match', c1)
print('Rows that do match', c2)
print('Total number of rows:', c1 + c2)

: ['123,', ' 456,', ' 789,', ' 120,']
: ['abf,', ' asd,', ' fgh,']
: ['fr43,', ' ghy6,', ' hju7,']

Rows that do not match 0
Rows that do match 3
Total number of rows: 3
