<h3>Special Characters in Regular Expressions</h3>
The power of regular expressions comes when we add special characters to the search string that allow us to more precisely control what portions of the string match the regular expression.  In this notebook, we will look at some of the very commonly used special characters that help us achieve this objective.  

![RegExChars.png](attachment:RegExChars.png)

In [1]:
import re

The `\A` returns a match if the regex is at the start of the string.  Note that the special character must be placed at the start of the regex.

The caret (`^`) functions like the `\A`. 

In addition, there is a keyword argument called `flags`,  which when set equal to `re.MULTILINE` (or `re.M`) will also search for the regex at the start of every line in the input string.

In [2]:
string = "this is the first line\nthis is the second line\nthis is the third line this"

In [5]:
print('With \A ', re.findall('\Athis', string))

With \A  ['this']


In [6]:
print('With ^ but no MULTILINE', re.findall('^this', string))

With ^ but no MULTILINE ['this']


In [8]:
print('With ^ with MULTILINE', re.findall('^this', string, flags=re.MULTILINE))

With ^ with MULTILINE ['this', 'this', 'this']


<h3>More about the raw string character</h3>

Recall that the raw string character `r` is used to negate the meaning of escape or special characters.  However, we saw that this did not apply to the `\t` and `\n` characters in a regex expression.  As can be seen from the table above, `\t` and `\n` are not special characters used in a regex expression.  Hence, the raw string character does not apply to these characters in a regex. However, the special character `\b` is used as a backspace character in Python while it specifies a word boundary in a regex.  Hence, when using the regex special character `\b`, it needs to be prefixed with the raw string character `r`.

Note the use of the raw string character.  The `\b` returns a match if the specified characters are at the start or the end of a word.

In [1]:
import re
string = 'there aint no rain in spain'

The code below returns a match if the specified characters are at the __start__ of a word.

In [2]:
print("At the start: ", re.findall(r'\bain', string)) #to use it as a boundary and not as backspace,we need to
#add r. we dont add a space after b or else it ll look for a space

At the start:  ['ain']


The code below returns a match if the specified characters are at the __end__ of a word.

In [3]:
print("At the end: ", re.findall(r'ain\b', string))


At the end:  ['ain', 'ain']


The `\B` negates the `\b`. The `\B` returns a match if the specified characters are NOT at the start or the end of a word.

In [4]:
import re
string = 'there aint no rain in spain'

The code below returns a match if the specified characters are NOT at the start of a word

In [17]:
print("At the start: ", re.findall(r'\Bain', string)) #

At the start:  ['ain', 'ain']


The code below returns a match if the specified characters are NOT at the end of a word

In [None]:
print("At the end: ", re.findall(r'ain\B', string))

The `\d` finds and returns all digits in a string.

In [6]:
string1 = "This string has zero digits but has more than fifteen characters."
string2 = "This string has 3 digits and more than 15 characters."

Using `\d` together with the `findall()` method and an input string with no digits

In [7]:
print(re.findall('\d', string1))

[]


Using `\d` together with the `findall()` method and an input string containing digits

In [8]:
print(re.findall('\d', string2))

['3', '1', '5']


Using `\d` together with the `search()` method and an input string containing digits

In [21]:
print(re.search('\d', string2))

<re.Match object; span=(16, 17), match='3'>


The `\D` is the opposite of `\d`.  The `\D` finds and returns all non-digits in a string.

In [5]:
string1 = "This string has zero digits but has more than fifteen characters."
string2 = "This string has 3 digits and more than 15 characters."

Using `\D` together with the `findall()` method and an input string with no digits

In [6]:
print('With no digits in the input string', re.findall('\D', string1))
print('\nlength of string1', len(string1), 'length of the matches', len(re.findall('\D', string1)))

With no digits in the input string ['T', 'h', 'i', 's', ' ', 's', 't', 'r', 'i', 'n', 'g', ' ', 'h', 'a', 's', ' ', 'z', 'e', 'r', 'o', ' ', 'd', 'i', 'g', 'i', 't', 's', ' ', 'b', 'u', 't', ' ', 'h', 'a', 's', ' ', 'm', 'o', 'r', 'e', ' ', 't', 'h', 'a', 'n', ' ', 'f', 'i', 'f', 't', 'e', 'e', 'n', ' ', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', '.']

length of string1 65 length of the matches 65


Using `\D` together with the `findall()` method and an input string containing digits

In [None]:
print(re.findall('\D', string2))
print('\nlength of string2', len(string2), 'length of the matches', len(re.findall('\D', string2)))

`\s` finds and returns all occurrences of white space in a string. space, tab and new line are white space.

In [9]:
string1 = "This string has\tzero digits\nbut has more than fifteen characters."
string2 = "Thisstringhaszerodigitsbuthasmorethanfifteencharacters."

Using `\s` together with the `findall()` method and an input string with whitespace characters

In [10]:
print(re.findall('\s', string1))
print('\nlength of string1', len(string1), 'length of matches', len(re.findall('\s', string1)))

[' ', ' ', '\t', ' ', '\n', ' ', ' ', ' ', ' ', ' ']

length of string1 65 length of matches 10


Using `\s` together with the `findall()` method and an input string with no whitespace characters

In [11]:
print(re.findall('\s', string2))
print('\nlength of string2', len(string2), 'length of matches', len(re.findall('\s', string2)))

[]

length of string2 55 length of matches 0


`\S` is the opposite of `\s`.  `\S` finds and returns all non white space characters in a string.

In [7]:
string1 = "This string has\tzero digits\nbut has more than fifteen characters."
string2 = "Thisstringhaszerodigitsbuthasmorethanfifteencharacters."

Using `\S` together with the `findall()` method and an input string with whitespace characters

In [8]:
print(re.findall('\S', string1))
print('\nlengthof string1', len(string1), 'length of matches', len(re.findall('\S', string1)))

['T', 'h', 'i', 's', 's', 't', 'r', 'i', 'n', 'g', 'h', 'a', 's', 'z', 'e', 'r', 'o', 'd', 'i', 'g', 'i', 't', 's', 'b', 'u', 't', 'h', 'a', 's', 'm', 'o', 'r', 'e', 't', 'h', 'a', 'n', 'f', 'i', 'f', 't', 'e', 'e', 'n', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', '.']

lengthof string1 65 length of matches 55


Using `\S` together with the `findall()` method and an input string with no whitespace characters

In [9]:
print(re.findall('\S', string2))
print('\nlength of string2', len(string2), 'length of the matches', len(re.findall('\S', string2)))

['T', 'h', 'i', 's', 's', 't', 'r', 'i', 'n', 'g', 'h', 'a', 's', 'z', 'e', 'r', 'o', 'd', 'i', 'g', 'i', 't', 's', 'b', 'u', 't', 'h', 'a', 's', 'm', 'o', 'r', 'e', 't', 'h', 'a', 'n', 'f', 'i', 'f', 't', 'e', 'e', 'n', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', '.']

length of string2 55 length of the matches 55


`\w` finds and returns all word characters in the string.  Word characters are any character in the following set, `a to z, A to Z, 0 - 9 and the underscore character`.

In [10]:
string = "there are 58 characters in the first line, and zero digits"
print("All word characters: ", re.findall('\w', string))
print("All word characters: ", len(string), len(re.findall('\w', string)))

All word characters:  ['t', 'h', 'e', 'r', 'e', 'a', 'r', 'e', '5', '8', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', 'i', 'n', 't', 'h', 'e', 'f', 'i', 'r', 's', 't', 'l', 'i', 'n', 'e', 'a', 'n', 'd', 'z', 'e', 'r', 'o', 'd', 'i', 'g', 'i', 't', 's']
All word characters:  58 47


`\W` is the opposite of `\w`.  
It finds and returns all non-word characters.

In [15]:
string = "there are 58 characters in the first line, and zero digits"
print("All word characters: ", re.findall('\W', string))
print("All word characters: ", len(string), len(re.findall('\W', string)))

All word characters:  [' ', ' ', ' ', ' ', ' ', ' ', ' ', ',', ' ', ' ', ' ']
All word characters:  58 11


The `\Z` returns a match if the regex is at the end of the string.  Note that the special character must be placed at the end of the regex.

In [16]:
string = "this is the first line\nthis is line the second line\nthis is the third line" #$ means end of the line.
print('With \Z ', re.findall('line\Z', string))
print('With $ and no MULTILINE ', re.findall('line$', string))
print('With $ and MULTILINE', re.findall('line$', string, flags=re.MULTILINE))

With \Z  ['line']
With $ and no MULTILINE  ['line']
With $ and MULTILINE ['line', 'line', 'line']
