# Regex Character classes and the findall() Method

In [1]:
import regex as re

## Find All

In [2]:
phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [3]:
phoneRegex

regex.Regex('\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d', flags=regex.V0)

In [6]:
resume =  '''
als;djfa;jshfadh;alsjgpasousa;jf
asdfasjasjfsa;ljfas
a;glkjas;lgjk;ashgirpwuyhgna
alk;gdhalk;gjpowiegj;lakjsg
a
011-555-1234 and 011-222-4567 as;ldfjpwoqig;laknbvrupinbvakrjfsajf alsn as
asa
sdjfa;slgf aoug
asgajgf;lasg
ag;oajhga
g;agow[rjgfr
akfgpaqoegrj Puta que pariu quanto momomomomomomomomomomomomo
'''

In [7]:
phoneRegex.findall(resume)

['011-555-1234', '011-222-4567']

If you use a regular expression without groups markings "()" It will return a list of the strings found. This is the behavior for regex objects that have zero OR ONE groups in them.

In [8]:
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')

In [9]:
phoneRegex.findall(resume)

[('011', '555-1234'), ('011', '222-4567')]

Since you place the parenthesis you will get back a list with tuples and the Values inside, corresponding to the groups. 

If you place Paranthesis in everything, you will get 3 groups. 1 for full number, 1 for area code and 1 for the number.

In [16]:
phoneRegex = re.compile(r'((\d\d\d)-(\d\d\d-\d\d\d\d))')

In [17]:
phoneRegex.findall(resume)

[('011-555-1234', '011', '555-1234'), ('011-222-4567', '011', '222-4567')]

Find All doesn't return a match object. It returns a list of strigs or a list of tuples of strings

## Character Classes

In [19]:
digitRegex = re.compile('r\d') # or (r'(0|1|2|3|4|5|6|7|8|9)')

| Shorthad character class | Represents | 
| -------------------- | ------------ |
| \d | Any Numeric digit from 0 to 9 |
| \D | Any character that is *not* a numeric digit from 0 to 9 |
| \w | Any letter, numeric digit, or the underscore character (Think this as matching "word" characters.) |
| \W | Any character that is *not* a letter, numeric digit, or the underscore character. |
| \s | Any space, tab, or newline character. (Think of this as matching "space" characters.) |
| \S | Any character that is *not* a space, tab or newline. |

Character classes are nice for shortening regular expressions. The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing (0|1|2|3|4|5).

Example:

In [35]:
xmasRegex = re.compile(r'\d+\s\w+\s\w+')

In [36]:
 xmasRegex.findall('12 drummers drumming, 11 pipers piping, 10 lords a leaping, 9 ladies dancing, 8 maids a milking,\
 7 swans a swimming, 6 geese a laying, 5 golden rings, 4 calling birds, 3 french hens, 2 turtle doves, 1 partridge in a pear tree')

['12 drummers drumming',
 '11 pipers piping',
 '10 lords a',
 '9 ladies dancing',
 '8 maids a',
 '7 swans a',
 '6 geese a',
 '5 golden rings',
 '4 calling birds',
 '3 french hens',
 '2 turtle doves',
 '1 partridge in']

The regular expression \d+\s\w+ will match text that has one or more numeric digits (\d+), followed by a whitespace character (\s), followed by one or more letter/digit/underscore characters (\w+). The findall() method returns all matching strings of the regex pattern in a list.

### Creating your Own Classes

In [42]:
vowelRegex = re.compile(r'[aeiouAEIOU]')    # = r'(a|e|i|o|u)'

In [43]:
vowelRegex.findall('Robocop eats baby food.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']

In [44]:
doubleVowelRegex = re.compile(r'[aeiouAEIOU]{2}') 

In [45]:
doubleVowelRegex.findall('Robocop eats baby food.')

['ea', 'oo']

#### Negative character classes

The Carret sign "^" will negate the expression. So, it will find everything BUT a vowel

In [49]:
consonantsRegex = re.compile(r'[^aeiouAEIOU]')

In [50]:
consonantsRegex.findall('Robocop eats baby food.')

['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']