# Character Classes

- The **character classes** (also known as **character sets**) allow us to define a character that will match if any of the defined characters on the set is present.


- To define a character class, we should use the opening square bracket metacharacter `[`, then any accepted characters, and finally close with a closing square bracket `]`.

### Example 1

Consider an example below where we have messed up between `license` and `licence` spellings and want to find all occurances of `license`/`licence` in the text.

In [1]:
import re
from utils import highlight_regex_matches

In [2]:
txt = """
Yesterday, I was driving my car without a driving licence. The traffic police stopped me and asked me for my 
license. I told them that I forgot my licence at home. 
"""

In [3]:
pattern = re.compile("licen[cs]e")

In [4]:
print(pattern)

re.compile('licen[cs]e')


In [5]:
pattern.findall(txt)

['licence', 'license', 'licence']

In [6]:
highlight_regex_matches(pattern, txt)


Yesterday, I was driving my car without a driving [42m[1mlicence[0m. The traffic police stopped me and asked me for my 
[42m[1mlicense[0m. I told them that I forgot my [42m[1mlicence[0m at home. 



![](images/example2.png)

# Character Set Range

> It is possible to also use the range of a character. This is done by leveraging the hyphen symbol (-) between two related characters; for example, to match any lowercase letter we can use `[a-z]`. Likewise, to match any single digit we can define the character set `[0-9]`.

Let us consider an example in which we want to retrieve all the years from the given text.


[0123456789]

[0-9]

[`a`-`z`]

[`A`-`Z`]

[a-zA-Z0-1]

[a-e] = [abcde]

In [7]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [8]:
pattern = re.compile("[1-9][0-9][0-9][0-9]")

In [9]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [10]:
highlight_regex_matches(pattern, txt)


The first season of Indian Premiere League (IPL) was played in [42m[1m2008[0m. 
The second season was played in [42m[1m2009[0m in South Africa. 
Last season was played in [42m[1m2018[0m and won by Chennai Super Kings (CSK).
CSK won the title in [42m[1m2010[0m and [42m[1m2011[0m as well.
Mumbai Indians (MI) has also won the title 3 times in [42m[1m2013[0m, [42m[1m2015[0m and [42m[1m2017[0m.



> There is another possibility—the negation of ranges. We can invert the meaning
of a character set by placing a caret (`^`) symbol right after the opening square
bracket metacharacter (`[`).

For example, to find all the characters used in a text except vowels, we can use the pattern:

In [11]:
# [^A-Z0-9] that means avoid them or except them

In [12]:
pattern = re.compile("[^aeiou]")

In [13]:
pattern.findall(txt)

['\n',
 'T',
 'h',
 ' ',
 'f',
 'r',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'f',
 ' ',
 'I',
 'n',
 'd',
 'n',
 ' ',
 'P',
 'r',
 'm',
 'r',
 ' ',
 'L',
 'g',
 ' ',
 '(',
 'I',
 'P',
 'L',
 ')',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '8',
 '.',
 ' ',
 '\n',
 'T',
 'h',
 ' ',
 's',
 'c',
 'n',
 'd',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '9',
 ' ',
 'n',
 ' ',
 'S',
 't',
 'h',
 ' ',
 'A',
 'f',
 'r',
 'c',
 '.',
 ' ',
 '\n',
 'L',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '8',
 ' ',
 'n',
 'd',
 ' ',
 'w',
 'n',
 ' ',
 'b',
 'y',
 ' ',
 'C',
 'h',
 'n',
 'n',
 ' ',
 'S',
 'p',
 'r',
 ' ',
 'K',
 'n',
 'g',
 's',
 ' ',
 '(',
 'C',
 'S',
 'K',
 ')',
 '.',
 '\n',
 'C',
 'S',
 'K',
 ' ',
 'w',
 'n',
 ' ',
 't',
 'h',
 ' ',
 't',
 't',
 'l',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '0',
 ' ',
 'n',


In [14]:
print("".join(pattern.findall(txt)))


Th frst ssn f Indn Prmr Lg (IPL) ws plyd n 2008. 
Th scnd ssn ws plyd n 2009 n Sth Afrc. 
Lst ssn ws plyd n 2018 nd wn by Chnn Spr Kngs (CSK).
CSK wn th ttl n 2010 nd 2011 s wll.
Mmb Indns (MI) hs ls wn th ttl 3 tms n 2013, 2015 nd 2017.



# Predefined Character Classes

There exist some predefined character classes which can be used as a shortcut for some frequently used classes.


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Element</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>.</td>
    <td>This element matches any character except newline</td>
</tr>

<tr>
    <td>\d</td>
    <td>This matches any decimal digit; this is equivalent to the class [0-9]</td>
</tr>

<tr>
    <td>\D</td>
    <td>This matches any non-digit character; this is equivalent to the class [^0-9]</td>
</tr>

<tr>
    <td>\s</td>
    <td>This matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\S</td>
    <td>This matches any non-whitespace character; this is equivalent to the class
[^ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\w</td>
    <td>This matches any alphanumeric character; this is equivalent to the class
[a-zA-Z0-9_]</td>
</tr>
    
<tr>
    <td>\W</td>
    <td>This matches any non-alphanumeric character; this is equivalent to the
class [^a-zA-Z0-9_]</td>
</tr>
</tbody>
</table>


Now, we can improve our pattern to find years in a given text a bit:

In [15]:
pattern = re.compile("[1-9]\d\d\d")

In [16]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

Let us try to find out all special symbols (non-alphanumeric, non-whitespace characters) in our text now.

In [17]:
re.findall("[^\w\s]", txt)

['(', ')', '.', '.', '(', ')', '.', '.', '(', ')', ',', '.']

In [18]:
re.findall("[\W]", txt)

['\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '(',
 ')',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '(',
 ')',
 '.',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 ' ',
 ' ',
 '(',
 ')',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 '.',
 '\n']

In [None]:
# Just practice from Made Easy
#Character sets can match a set of characters

In [1]:
import re

In [2]:
re.search('abcd',"abcdefnc abcd" ) # earlier code

<re.Match object; span=(0, 4), match='abcd'>

In [3]:
re.search(r'\w\w\w\w',"abcdefnc abcd" )      #matches characters and numbers
                                        #alpha numeric characters 

<re.Match object; span=(0, 4), match='abcd'>

In [4]:
#\w matches alpha numeric characters [a-zA-Z0-9_]
re.search(r'\w\w\w\w',"ab_cdefnc abcd" ) #matches _ character


<re.Match object; span=(0, 4), match='ab_c'>

In [5]:
re.search(r'\w\w\w', "a3.!-!")  #doesn't match symbols only numbers and 
                                #    characters

In [6]:
re.search(r'\w\w\w', "a33-_!") .group()

'a33'

In [7]:
##\W  opposite of \w ; so nothing included in   [a-zA-Z0-9_]

re.search(r'\w\w\W', "a3.-_!") # \W matches non characters and numbers

<re.Match object; span=(0, 3), match='a3.'>

In [8]:
re.search(r'\w\w\W', "a3 .-_!")   # matches empty space as well

##We will go over other character sets later on

<re.Match object; span=(0, 3), match='a3 '>

<pre>quantifiers
'+'   = 1 or more
'?' =  0 or 1
'*' =  0 or more
'{n,m}'  = n to m repetitions {,3}, {3,}</pre>



In [10]:
re.search(r'\w\w',"abcdefnc abcd" )

<re.Match object; span=(0, 2), match='ab'>

In [11]:
re.search(r'\w+',"abcdefnc abcd" ).group()  #don't know the numbers of letters

'abcdefnc'

In [12]:
re.search(r'\w+\W+\w+',"abcdefnc abcd").group()

'abcdefnc abcd'

In [13]:
re.search('\w+\W+\w+',"abcdefnc       abcd").group()  #added spaces

'abcdefnc       abcd'

In [14]:
re.search(r'\w+\W?\w+',"abcdefnabcd").group()  # ? = 0 or 1 instances

'abcdefnabcd'

In [15]:
re.search(r'\w+\W?\w+',"abcde fnabcd").group()

'abcde fnabcd'

In [16]:
re.search(r'\w+\W+\w+', "abcdefnabcd")

In [17]:
#Pulling out specific amounts
re.search(r'\w{3}', 'aaaaaaaaaaa')   #only 3 \w characters

<re.Match object; span=(0, 3), match='aaa'>

In [18]:
re.search(r'\w{1,4}', 'aaaaaaaaaaa').group()   #1 is min, 4 is max

'aaaa'

In [19]:
 
re.search(r'\w{1,10}\W{0,4}\w+',"abcdefnc abcd").group()#1-10 \w characters,
                                                        #0-4  \W chracters
                                                        # 1+ \w characters

'abcdefnc abcd'

In [20]:
re.search(r'\w{1,}\W{0,}\w+',"abcdefnc abcd").group() #at least 1
                                                                #at least 0

'abcdefnc abcd'

# Other types of characters sets

<pre>
'\d'   =  matches digits [0-9]
'\D'   = matches This matches any non-digit character; ~\d
'\s'  = matches any whitespace character   #new lines, tabs, spaces etc
'\S' = matches any non-whitespace chracter #~\s
</pre>

In [21]:
string = '23abced++'
re.search('\d+', string).group()

'23'

In [22]:
string = '23abced++'
re.search('\S+', string).group()  #no spaces

'23abced++'

In [23]:
string = '''Robots are branching out. A new prototype soft robot takes inspiration from plants by growing to explore its environment.

Vines and some fungi extend from their tips to explore their surroundings. 
Elliot Hawkes of the University of California in Santa Barbara 
and his colleagues designed a bot that works 
on similar principles. Its mechanical body 
sits inside a plastic tube reel that extends 
through pressurized inflation, a method that some 
invertebrates like peanut worms (Sipunculus nudus)
also use to extend their appendages. The plastic 
tubing has two compartments, and inflating one 
side or the other changes the extension direction. 
A camera sensor at the tip alerts the bot when it’s 
about to run into something.

In the lab, Hawkes and his colleagues 
programmed the robot to form 3-D structures such 
as a radio antenna, turn off a valve, navigate a maze, 
swim through glue, act as a fire extinguisher, squeeze 
through tight gaps, shimmy through fly paper and slither 
across a bed of nails. The soft bot can extend up to 
72 meters, and unlike plants, it can grow at a speed of 
10 meters per second, the team reports July 19 in Science Robotics. 
The design could serve as a model for building robots 
that can traverse constrained environments

This isn’t the first robot to take 
inspiration from plants. One plantlike 
predecessor was a robot modeled on roots.'''

In [24]:
re.search('.+', string).group()  #no new line

'Robots are branching out. A new prototype soft robot takes inspiration from plants by growing to explore its environment.'

In [25]:
re.search('.+', string, flags = re.DOTALL).group()

'Robots are branching out. A new prototype soft robot takes inspiration from plants by growing to explore its environment.\n\nVines and some fungi extend from their tips to explore their surroundings. \nElliot Hawkes of the University of California in Santa Barbara \nand his colleagues designed a bot that works \non similar principles. Its mechanical body \nsits inside a plastic tube reel that extends \nthrough pressurized inflation, a method that some \ninvertebrates like peanut worms (Sipunculus nudus)\nalso use to extend their appendages. The plastic \ntubing has two compartments, and inflating one \nside or the other changes the extension direction. \nA camera sensor at the tip alerts the bot when it’s \nabout to run into something.\n\nIn the lab, Hawkes and his colleagues \nprogrammed the robot to form 3-D structures such \nas a radio antenna, turn off a valve, navigate a maze, \nswim through glue, act as a fire extinguisher, squeeze \nthrough tight gaps, shimmy through fly paper 

# Creating your own character sets

In [None]:
[A-Z]    '-'  is a metacharacter when used in [] (custom character sets)

In [26]:
string = 'Hello, There, How, Are, You'

In [27]:
re.findall('[A-Z]', string)  #pulls out all capital letters

['H', 'T', 'H', 'A', 'Y']

In [28]:
re.findall('[A-Z,]', string)  #here we search for any capital letters
                                #or a comma

['H', ',', 'T', ',', 'H', ',', 'A', ',', 'Y']

In [29]:
string = 'Hello, There, How, Are, You...'
re.findall('[A-Z,.]', string)

['H', ',', 'T', ',', 'H', ',', 'A', ',', 'Y', '.', '.', '.']

In [30]:
string = 'Hello, There, How, Are, You...'
re.findall('[A-Za-z,\s.]', string)

['H',
 'e',
 'l',
 'l',
 'o',
 ',',
 ' ',
 'T',
 'h',
 'e',
 'r',
 'e',
 ',',
 ' ',
 'H',
 'o',
 'w',
 ',',
 ' ',
 'A',
 'r',
 'e',
 ',',
 ' ',
 'Y',
 'o',
 'u',
 '.',
 '.',
 '.']