# RE | re


# Patterns: Special Characters

Many descriptions of regular expressions start with all the details of how to define
them. I think that’s a mistake. Regular expressions are a not-so-little language in their
own right, with too many details to fit in your head at once. They use so much punctuation that they look like cartoon characters swearing.
With these expressions (match(), search(), findall(), and sub()) under your belt,
let’s get into the details of building them. The patterns you make apply to any of these
functions.
You’ve seen the basics:

• Literal matches with any nonspecial characters 

• Any single character except \n with .

• Any number of the preceding character (including zero) with *

• Optional (zero or one) of the preceding character with ?

## First, special characters are shown in Table 12-2.

## Table 12-2. Special characters

Pattern     Matches

\d      A single digit  => equivalent to [0-9]

\D      A single nondigit

\w      An alphanumeric character  => equivalent to [a-zA-Z0-9_]

\W      A non-alphanumeric character 

\s      A whitespace character

\S      A nonwhitespace character

\b      A word boundary (between a \w and a \W, in either order)

\B      A nonword boundary

-------------------------------------------

The Python string module has predefined string constants that we can use for testing.

Let’s use printable, which contains 100 printable ASCII characters, including
letters in both cases, digits, space characters, and punctuation:

In [7]:
import string
import re
printable = string.printable
print(len(printable))

print(printable)
# result:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

print(printable[0:50])
# result:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN'

print(printable[50:])
# result:
'OPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'


100
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN
OPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	



'OPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [8]:
print(string.ascii_letters)
# print(string.capwords())
print(string.hexdigits)
print(string.octdigits)
print(string.punctuation)
print(string.whitespace)

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789abcdefABCDEF
01234567
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
 	



# Which characters in printable are digits?

In [9]:
import re
re.findall('\d', printable)

  re.findall('\d', printable)


['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [17]:
import re
re.findall(r'\d+', printable)

['0123456789']

In [18]:
import re
re.findall(r'\d', printable)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

# Which characters are digits, letters, or an underscore?

In [10]:
re.findall('\w', printable)

  re.findall('\w', printable)


['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '_']

# syntax warning ?!

<>:1: SyntaxWarning: invalid escape sequence '\w'
<>:1: SyntaxWarning: invalid escape sequence '\w'
C:\Users\illegible\AppData\Local\Temp\ipykernel_6368\1077115770.py:1: SyntaxWarning: invalid escape sequence '\w'
  re.findall('\w', printable)


The error messages you're seeing indicate that Python is interpreting \w as an escape sequence but doesn't recognize it as a valid one. In regular expressions (regex), \w is a special character that matches any word character (equivalent to [a-zA-Z0-9_]). However, outside of regex syntax, Python does not understand \w as a shorthand for word characters.

To fix this issue, ensure that you're using re.findall() correctly within a string that is passed as an argument. If you're trying to find all occurrences of word characters in a string named printable, you should use a raw string for the pattern to avoid Python interpreting backslashes as escape sequences. Here's how you can do it:

In [11]:
import re

# Assuming 'printable' is defined somewhere above
pattern = r'\w+'  # Raw string for the regex pattern
matches = re.findall(pattern, printable)

print(matches)

['0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', '_']


In [19]:
import re

# Assuming 'printable' is defined somewhere above
pattern = r'\w'  # Raw string for the regex pattern
matches = re.findall(pattern, printable)

print(matches)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_']


In this corrected version, r'\w+' is used instead of '\w'. The r before the string indicates a raw string, which tells Python to treat backslashes as literal characters rather than escape characters. This way, \w is correctly interpreted as part of the regex syntax, matching word characters.

#  * A little BREAK *

#  RAW STRING?

###  what is Raw string in python?

A raw string in Python is a string prefixed with an `r` or `R`, which tells Python to interpret the string literally, without processing any escape sequences contained within it. Escape sequences are special character combinations that start with a backslash (`\`) and have a specific meaning, such as `\n` for a newline, `\t` for a tab, etc. When you use a raw string, Python treats the backslash as a normal character, not as the beginning of an escape sequence.

Here's a simple example to illustrate the difference:

```python
# Without a raw string
normal_string = "This is a normal string\nwith a newline."
print(normal_string)  # Prints: This is a normal string
                        #         with a newline.

# With a raw string
raw_string = r"This is a raw string\nwith a newline."
print(raw_string)  # Prints: This is a raw string\nwith a newline.
```

In the first example, because we didn't use a raw string, Python interprets `\n` as a newline character, so when printed, the string appears on two lines. In the second example, because we used a raw string (`r` prefix), Python treats `\n` as two separate characters: a backslash followed by an 'n'. Therefore, the entire string, including `\n`, is printed on one line.

Raw strings are particularly useful in regular expressions (regex) patterns where backslashes are commonly used. By using a raw string, you avoid having to double each backslash, making the pattern easier to read and write. For example:

```python
import re

# Using a raw string for a regex pattern
pattern = r"\d+"  # Matches one or more digits

# Without a raw string, you would need to escape the backslashes
# pattern = "\\d+"
```

In summary, raw strings allow you to work with strings that contain many backslashes without Python interpreting them as escape sequences, simplifying the handling of certain types of data like file paths or regular expression patterns.

-------------------------------------------------------------------------

# Which are spaces?


In [14]:
re.findall('\s', printable)

  re.findall('\s', printable)


[' ', '\t', '\n', '\r', '\x0b', '\x0c']

In [15]:
re.findall(r'\s', printable)    

[' ', '\t', '\n', '\r', '\x0b', '\x0c']

In [20]:
re.findall(r'\s+', printable) 

[' \t\n\r\x0b\x0c']

In [21]:
re.findall(r'\s-', printable) 

[]

In [23]:
re.findall(r'\S-', printable) 

[',-']

In [24]:
re.findall(r'\S', printable) 

['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [25]:
re.findall(r'\S+', printable) 

['0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~']

In order, those were: plain old space, tab, newline, carriage return, vertical tab, and
form feed.
Regular expressions are not confined to ASCII. A \d will match whatever Unicode
calls a digit, not just ASCII characters '0' through '9'. Let’s add two non-ASCII
lowercase letters from FileFormat.info:
In this test, we’ll throw in the following:
• Three ASCII letters
• Three punctuation symbols that should not match a \w
• A Unicode LATIN SMALL LETTER E WITH CIRCUMFLEX (\u00ea)
• A Unicode LATIN SMALL LETTER E WITH BREVE (\u0115)

In [31]:
x = 'abc' + '-/*' + '\u00ea' + '\u0115'
print('\u00ea')
print('\u0115')
re.findall(r'\w', x)

ê
ĕ


['abc', 'êĕ']

# Patterns: Using Specifiers

## Table 12-3. Pattern specifiers


Pattern     Matches

abc      =>                    Literal abc

( expr )    =>                  expr

expr1 | expr2     =>               expr1 or expr2

.               =>             Any character except \n

^                 =>           Start of source string

$                   =>         End of source string

prev ?                =>       Zero or one prev

prev *            =>           Zero or more prev, as many as possible

prev *?         =>             Zero or more prev, as few as possible

prev +          =>             One or more prev, as many as possible

prev +?         =>             One or more prev, as few as possible

prev { m }        =>           m consecutive prev

prev { m, n }       =>         m to n consecutive prev, as many as possible

prev { m, n }?        =>       m to n consecutive prev, as few as possible

[ abc ]            =>          a or b or c (same as a|b|c)

[^ abc ]         =>            not (a or b or c)

prev (?= next )     =>         prev if followed by next

prev (?! next )    =>          prev if not followed by next

(?<= prev ) next      =>       next if preceded by prev

(?<! prev ) next     =>        next if not preceded by prev

Your eyes might cross permanently when trying to read these examples. First, let’s
define our source string:

In [32]:
source = '''I wish I may, I wish I might
... Have a dish of fish tonight.'''

Now we apply different regular expression pattern strings to try to match something
in the source string.

In [33]:
# First, find wish anywhere:
re.findall('wish', source)

['wish', 'wish']

In [35]:
# Next, find wish or fish anywhere:

re.findall('wish|fish', source)

['wish', 'wish', 'fish']

In [38]:
# Find wish at the beginning:
print(re.findall('^wish', source))
# look at the difference of these two output
print(re.match('wish', source))

[]
None


In [40]:
# Find I wish at the beginning:
print(re.findall('^I wish', source))
# look at the difference of these two output
print(re.match('I wish', source))

['I wish']
<re.Match object; span=(0, 6), match='I wish'>
