# Regular expression in python 2

In [1]:
import re

## Example 1: use groups

In [2]:
recids = ['(ABN)1224343343343', '(RERO)343423432432']
solution = ['ABN', 'RERO']

In [3]:
recids_found = []
for recid in recids:
    recid_match = re.match('\((\w+)\)\d+', recid)
    if recid_match is not None:
        recids_found.append(recid_match.group(1))
print(recids_found, recids_found==solution)
print(solution)

['ABN', 'RERO'] True
['ABN', 'RERO']


#### Explanations

`r'\((\w+)\)\d+'`

* Never forgive the `r`in front of the regular expression. It tells the system to consider `\` as it is. If you don't use `r` you will have to escape the `\b` writing `\\b` for example.
* `\w` match every chars equivalent of `[a-zA-Z0-9_]`
* `\d` match any digit
* `+` quantifier, match 1 or more chars
* `\` escape char, `(`and `)` have a special meaning in regular expressions (capturing strings). To match `(` we need to escape it
* To get the content of the entire match: `re.match('\((\w+)\)\d+', recid).group(0)`, to get only the content of the first capturing parenthesis: `re.match('\((\w+)\)\d+', recid).group(1)`
* To avoid errors, you need always check the result of the match before accessing to the groups: `if recid_match is not None:`

## Example 2: use flags + quantifiers

In [4]:
issns = ['0170-5970 (ISSN)',
         '0170-597x print',
         ' 0170397X',
         '12-34-5678' # should be rejected
        ] 

solution = ['01705970', '0170597x', '0170397X']

In [5]:
# Incomplete solution
issns_found = []
for issn in issns:
    issn_match = re.search(r'\d{4}-?\d{3}[\dX]', issn, re.I)
    if issn_match is not None:
        issns_found.append(issn_match.group(0))
print(issns_found, issns_found==solution)
print(solution)

['0170-5970', '0170-597x', '0170397X'] False
['01705970', '0170597x', '0170397X']


#### Explanations

`r'\d{4}-?\d{3}[\dX]'`


* `{a}` quantifier, match the number of times indicated. `{2,4}` will match between 2 and 4 times
* `?` quantifier, match one or 0 time
* `[\dX]` brackets are used to match single char among several option, here `X` or a digit
* `re.I` ignore case

**Hyphen is still a problem**

In [6]:
issns_found = []
for issn in issns:
    issn_match = re.search(r'(\d{4})-?(\d{3}[\dx])', issn, re.I)
    if issn_match is not None:
        issns_found.append(issn_match.group(1) + issn_match.group(2))
print(issns_found, issns_found==solution)
print(solution)

['01705970', '0170597x', '0170397X'] True
['01705970', '0170597x', '0170397X']


#### Explanations
Possible solutions are to replace the `-` with nothing: `re.search(r'\d{4}-?\d{3}[\dX]', issn.replace('-', ''), re.I)`

Other solution is to use capturing groups: `(\d{4})-?(\d{3}[\dx])`. You concatenate the two groups.

## Example 3: Word boundaries and findall

In [7]:
# Get all word starting with "p"
txt = """Python was conceived in the late 1980s[43] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which was inspired by SETL,[44] capable of exception handling and interfacing with the Amoeba operating system.[13] Its implementation began in December 1989.[45] Van Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation" from his responsibilities as Python's "benevolent dictator for life", a title the Python community bestowed upon him to reflect his long-term commitment as the project's chief decision-maker.[46] In January 2019, active Python core developers elected a five-member Steering Council to lead the project."""
solution = ['Python',
 'programming',
 'project',
 'permanent',
 'Python',
 'Python',
 'project',
 'Python',
 'project']

In [8]:
# Incomplete solution
result = re.findall(r'P\w*', txt, re.I)
print(result == solution)
result

False


['Python',
 'programming',
 'pired',
 'pable',
 'ption',
 'perating',
 'plementation',
 'ponsibility',
 'project',
 'per',
 'permanent',
 'ponsibilities',
 'Python',
 'Python',
 'pon',
 'project',
 'Python',
 'pers',
 'project']

#### Explanations

`r'P\w*'`

This regular expression will not match entire words only.

In [9]:
# Other incomplete solution
result = re.findall(r'\s(P\w*)', txt, re.I)
print(result == solution)
result

False


['programming', 'project', 'Python', 'Python', 'project', 'Python', 'project']

#### Explanations

`r'\s(P\w*)'`

This regular expression will not match the first `Python` because start of the line is not matched by `\s` (whitespace).

In [10]:
# Again an incomplete solution
result = re.findall(r'(?:^|\s)(P\w*)', txt, re.I)
print(result == solution)
result

False


['Python',
 'programming',
 'project',
 'Python',
 'Python',
 'project',
 'Python',
 'project']

#### Explanations

`r'(?:^|\s)(P\w*)'`

* `(?: ... )` non capturing group. I interested in the context but I don't want to have this part in the results of `findall`
* `^` indicates the start of the string, `$` indicates the end of the string
* `|` in parenthesis checks different patterns with `or`

This regular expression will not match `permanent` because `"` is before the word.

In [11]:
# Complete solution
result = re.findall(r'(?:^|[\W])(P\w*)', txt, re.I)
print(result == solution)
result
print(solution)

True
['Python', 'programming', 'project', 'permanent', 'Python', 'Python', 'project', 'Python', 'project']


#### Explanations

`r'(?:^|[\W])(P\w*)'`

`\W` will match everything that doesnt match `\w`, it is here the good choice

In [12]:
# more simple solution
result = re.findall(r'\bP\w*\b', txt, re.I)
print(result == solution)
result
re.findall(r'\bP\w*\b', txt, re.I)

True


['Python',
 'programming',
 'project',
 'permanent',
 'Python',
 'Python',
 'project',
 'Python',
 'project']

#### Explanations

`r'\bP\w*\b'`

`\b` indicates word boundaries

## Example 4: negative lookup

In [13]:
recids = ['ABN1224343343343', 'RERO343423432432', 'ALEX44333343343343'] 
solution = ['1224343343343', '44333343343343'] # we want to avoid any RERO identifier

In [14]:
recids_found = []
for recid in recids:
    recid_match = re.search(r'^(?!RERO)[A-Z]+(\d+)$', recid)
    if recid_match is not None:
        recids_found.append(recid_match.group(1))
print(recids_found, recids_found==solution)
print(solution)

['1224343343343', '44333343343343'] True
['1224343343343', '44333343343343']


#### Explanations

`r'^(?!RERO)[A-Z]+(\d+)$'`

* `^` indicates the start of the string, `$` indicates the end of the string
* `(?! ... )`  the system will check if in the following chars this match, if yes then it will be no match ath the end
* `[A-Z]+` match every char at least one time
* To avoid match single char, use `[^abc]`. It will match anything but 'a', 'b', 'c'.

In [15]:
recids_found = []
for recid in recids:
    recid_match = re.search(r'^[A-Z]+(?<!RERO)(\d+)$', recid)
    if recid_match is not None:
        recids_found.append(recid_match.group(1))
print(recids_found, recids_found==solution)
print(solution)

['1224343343343', '44333343343343'] True
['1224343343343', '44333343343343']


#### Explanations

`r'^[A-Z]+(?<!RERO)(\d+)$'`

`(?<! ... )`  same as `(?<! ... )` but look for behind context.
