# Regular Expressions in Python

> Some people,when confronted with a problem,think,"I know, I'll use regular expressions."Now they have two problems. - Jamie Zawinski

## First day: quick overview

This first day we will explore the basics of the `re`(standard library)modules so you can start adding this powerful skill to your python toolkit.

In [1]:
import re

### When not to use regexes?

Basically when regular string manipulations will do, for example:

In [2]:
text = 'Awesome, I am doing the #100DaysOfCode challenge'

Does text start with 'Awesome'?

In [3]:
text.startswith('Awesome')

True

Does text end with 'challenge'?

In [4]:
text.endswith('challenge')

True

Does text contain '100daysofcode'(case insensitive)

In [5]:
'100daysofcode' in text.lower()

True

I am bold and want to do 200 days(note string are inmutable, so save to a new string)

In [6]:
text.replace('100', '200')

'Awesome, I am doing the #200DaysOfCode challenge'

## Regex == Meta language

But what if you need to do some more tricky things,say macthing any #(int)DaysOfCode?Here you want to use a regex pattern.Regular expressions are a (meta) language on their own and I highly encourage you to read through [this HOWTO](https://docs.python.org/3.7/howto/regex.html#regex-howto) to become familiar with their syntax. 

### search VS match

The main methods you want to know about are `search` and `match`, former matches a substring,latter matches the string from beginning to end.I always embed my regex in r'' to avoid having to escape special characters like \d(digit), \w(char), \s(space), \S(non-space), etc.

In [7]:
text = 'Awesome, I am doing the #100DaysOfCode challenge'

In [8]:
re.search(r'I am', text)

<re.Match object; span=(9, 13), match='I am'>

In [9]:
re.match(r'I am', text)

In [10]:
re.match(r'Awesome.*challenge', text)

<re.Match object; span=(0, 48), match='Awesome, I am doing the #100DaysOfCode challenge'>

### Capturing strings

A common task is to retrieve a match,you can use caputring() parenthesis for that:

In [11]:
hundred = 'Awesome, I am doing the #100DaysOfCode challenge'
two_hundred = hundred.replace('#100', '#200')

m = re.match(r'.*(#\d+DaysOfCode).*', hundred)
m.groups()[0]

'#100DaysOfCode'

In [12]:
m = re.search(r'.*(#\d+DaysOfCode).*', two_hundred)
m.group()[0]

'A'

### `findall` is your friend

What if you want to match multiple instance of a pattern? `re` has the convenient `findall` method I use a lot.For example in [our 100 Days Of Code](https://github.com/pybites/100DaysOfCode/blob/master/LOG.md) we used the `re` module for the following days - how would I extract the days from this string?

In [13]:
text = '''
$ python module_index.py |grep ^re
re                 | stdlib | 005, 007, 009, 015, 021, 022, 068, 080, 081, 086, 095
'''

re.findall(r'\d+', text)

['005', '007', '009', '015', '021', '022', '068', '080', '081', '086', '095']

How cool is that?Just because we can, look at how you can find the most common word combining `findall` with `Counter`:

In [14]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been 
the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of
Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus
PageMaker including versions of Lorem Ipsum"""

In [15]:
text.split()[:5]

['Lorem', 'Ipsum', 'is', 'simply', 'dummy']

Our course you can do the same with `words.split()` but if you have more requirements you might fit it in the same regex,for example let's only count words that start with a capital letter.

I am using two character classes here (= pattern inside []), the first to match a capital letter,the second to match 0 or more common word characters.

Note I am escaping the single quote(') inside the second character class,because the regex pattern is wrapped inside single quotes as well:

In [16]:
from collections import Counter

cnt = Counter(re.findall(r'[A-Z][A-Za-z0-9\']*', text))
cnt.most_common(5)

[('Lorem', 4), ('Ipsum', 4), ('It', 2), ('Letraset', 1), ('Aldus', 1)]

### Compiling regexes

If you want to run the same regex multiple times, say in a for loop it is best practice to define the regex one time using `re.compile`,here is an example:

In [17]:
movies = '''1. Citizen Kane (1941)
2. The Godfather (1972)
3. Casablanca (1942)
4. Raging Bull (1980)
5. Singin' in the Rain (1952)
6. Gone with the Wind (1939)
7. Lawrence of Arabia (1962)
8. Schindler's List (1993)
9. Vertigo (1958)
10. The Wizard of Oz (1939)'''.split('\n')
movies

['1. Citizen Kane (1941)',
 '2. The Godfather (1972)',
 '3. Casablanca (1942)',
 '4. Raging Bull (1980)',
 "5. Singin' in the Rain (1952)",
 '6. Gone with the Wind (1939)',
 '7. Lawrence of Arabia (1962)',
 "8. Schindler's List (1993)",
 '9. Vertigo (1958)',
 '10. The Wizard of Oz (1939)']

Let's find movie titles that have exactly 2 words,just for exercise sake.Before peaking to the solution how wold you define such a regex?

OK here is one way to do it,I am using `re.VERBOSE` which ignores spaces and comments so I can explain what each part of the regex does(really nice!)

In [18]:
pat = re.compile(r'''
^             # start of string
\d+           # one or more digits
\.            # a literal dot
\s+           # one or more spaces
(?:           # non-capturing parenthesis, so I don't want store this match in groups()
[A-Za-z']+\s  # character class (note inclusion of ' for "Schindler's"), followed by a space
)             # closing of non-capturing parenthesis
{2}           # exactly 2 of the previously grouped subpattern
\(            # literal opening parenthesis
\d{4}         # exactly 4 digits (year)
\)            # literal closing parenthesis
$             # end of string
''', re.VERBOSE
)

As we've seen before if the regex matches it returns an `_sre.SRE_Match` object,otherwise it returns `None`

In [19]:
for movie in movies:
    print(movie, pat.match(movie))

1. Citizen Kane (1941) <re.Match object; span=(0, 22), match='1. Citizen Kane (1941)'>
2. The Godfather (1972) <re.Match object; span=(0, 23), match='2. The Godfather (1972)'>
3. Casablanca (1942) None
4. Raging Bull (1980) <re.Match object; span=(0, 21), match='4. Raging Bull (1980)'>
5. Singin' in the Rain (1952) None
6. Gone with the Wind (1939) None
7. Lawrence of Arabia (1962) None
8. Schindler's List (1993) <re.Match object; span=(0, 26), match="8. Schindler's List (1993)">
9. Vertigo (1958) None
10. The Wizard of Oz (1939) None


### Advanced string replacing

As shown before `str.replace` probably covers a lot of your needs,for more advanced usage there is `re.sub`:

In [20]:
text = '''Awesome, I am doing #100DaysOfCode, #200DaysOfDjango and of course #365DaysOfPyBites'''

# I want all challenges to be 100 days, I need a break!
text.replace('200', '100').replace('365', '100')

'Awesome, I am doing #100DaysOfCode, #100DaysOfDjango and of course #100DaysOfPyBites'

`re.sub` makes this easy:

In [21]:
re.sub(r'\d+', '100', text)

'Awesome, I am doing #100DaysOfCode, #100DaysOfDjango and of course #100DaysOfPyBites'

Or what if I want to change all the #nDaysOf... to #nDaysOfPython? You can use `re.sub` for this.Note how I use the capturing parenthesis to port over the matching part of the string to the replacement(2nd argument) where I use `\1` to reference it:

In [23]:
re.sub(r'(#\d+DaysOf)\w+', r'\1Python', text)

'Awesome, I am doing #100DaysOfPython, #200DaysOfPython and of course #365DaysOfPython'