
## Regular expressions
In this notebook we'll be demonstrating some regex examples to give you an intuition of how to use it.

In [None]:
import re

In [None]:
text = 'This is just a string, isn\'t it?'
p = r'is'
res = re.findall(p, text)
print(res)

['is', 'is', 'is']


Match text containing abc

In [None]:
text = ['abcdefg',
       'abcde',
       'abc',
       'xyz']
p = r'abc.*'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['abcdefg'], ['abcde'], ['abc'], []]


Match text containing digits

In [None]:
text = ['abc123xyz',
       'zipcode "114632"',
       'x = 9',
       'xyz']
p = r'.*[\d]+.*'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['abc123xyz'], ['zipcode "114632"'], ['x = 9'], []]


Match text only ending with a period.

In [None]:
text = ['abc.',
       '654.',
       '@#$.',
       'xyz!']
p = r'.*\.'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['abc.'], ['654.'], ['@#$.'], []]


Match the first three only.

In [None]:
text = ['cat',
       'rat',
       'fat',
       'hat',
       'lat',
       'mat']
p = r'[crf]at'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['cat'], ['rat'], ['fat'], [], [], []]


No rats!

In [None]:
text = ['cat',
       'rat',
       'fat',
       'hat',
       'lat',
       'mat']
p = r'[^r]at'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['cat'], [], ['fat'], ['hat'], ['lat'], ['mat']]


Match names only, assuming that they start with a capital letter.

In [None]:
text = ['Muhammed',
       'Ziyad',
       'Sara',
       'cmd',
       'folder']
p = r'[A-Z].*'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['Muhammed'], ['Ziyad'], ['Sara'], [], []]


Dealing with repetitions, match 'hello' that containted 3 to 5 l letters

In [None]:
text = ['helo',
       'hello',
       'helllo',
       'helllllo',
       'hellllllo']
p = r'hel{3,5}o'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[[], [], ['helllo'], ['helllllo'], []]


Match first three only

In [None]:
text = ['996677',
       '9777',
       '99966667777',
       '9']
p = r'9+6*7+'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['996677'], ['9777'], ['99966667777'], []]


Match text where patterns are found

In [None]:
text = ['1 pattern found.',
       '2 patterns found.',
       '24 patterns found',
       'No patterns found']
p = r'\d+.*'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['1 pattern found.'], ['2 patterns found.'], ['24 patterns found'], []]


Match lines containing white space.

In [None]:
text = ['1- 123',
       '2-     123',
       '3-                123',
       '4-123']
p = r'\d-\s+123'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['1- 123'], ['2-     123'], ['3-                123'], []]


Match lines having 'Udacity' at the start only

In [None]:
text = ['Udacity',
       'Welcome to Udacity',
       'The Udacity Advantage']
p = r'^Udacity.*'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['Udacity'], [], []]


Match file names that start with 'IMG', has jpg as their extension while capturing only file name 

In [None]:
text = ['IMG_12072015.jpg',
       'IMG_testfile.jpg',
       'image_file.jpg',
       'IMG_file.jpg.zip',
       'temp_IMG_654.jpg']
p = r'^IMG\w+\.jpg$'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[['IMG_12072015.jpg'], ['IMG_testfile.jpg'], [], [], []]


Match file names starting with 'IMG', has an image number, with jpg as their extension.

Capture both the file name and the image number

In [None]:
text = ['IMG_12072015.jpg',
       'IMG_testfile.jpg',
       'image_file.jpg',
       'IMG_file.jpg.zip',
       'temp_IMG_654.jpg']
p = r'^(IMG_(\d+)\.jpg)'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[[('IMG_12072015.jpg', '12072015')], [], [], [], []]


Match first two lines only, extracting the item liked as well.

In [None]:
text = ['I like pizza',
       'I like pasta',
       'I like steaks',
       'i don\'t like pizza']
p = r'^(I like (p.*))'
results = []
for t in text:
    results.append(re.findall(p, t))
print(results)

[[('I like pizza', 'pizza')], [('I like pasta', 'pasta')], [], []]


The resolutions are typed with hight and width reversed, fix them.

You can use re.sub to replace matching text

In [None]:
texts = ['600x800',
       '768x1024',
       '1024x1280']
p = r'(\d+)x(\d+)'

for text in texts:
    result = re.sub(p, r'\2x\1', text); # 2 and 1 are the reference of the 1st (\d+) and 2nd (\d+)
    print (result)

800x600
1024x768
1280x1024
