### Character Classes
* The character classes (also known as character sets) allow us to define a character that will match if any of the defined characters on the set is present.

* To define a character class, we should use the opening square bracket metacharacter [, then any accepted characters, and finally close with a closing square bracket ].

### Example 1
Consider an example below where we have messed up between license and licence spellings and want to find all occurances of license/licence in the text.

In [29]:
import re

from colorama import Back, Style


def highlight_regex_matches(pattern, text, print_output=True):
	output = text
	len_inc = 0
	for match in pattern.finditer(text):
		start, end = match.start() + len_inc, match.end() + len_inc
		output = output[:start] + Back.YELLOW + Style.BRIGHT + output[start:end] + Style.RESET_ALL + output[end:]
		len_inc = len(output) - len(text)  

	if print_output:
		print(output)
	else:
		return output

In [30]:
txt = """
Yesterday, I was driving my car without a driving licence. The traffic police stopped me and asked me for my 
license. I told them that I forgot my licence at home. 
"""


In [4]:
pattern = re.compile('licen[cs]e')

In [5]:
pattern.findall(txt)

['licence', 'license', 'licence']

In [6]:
highlight_regex_matches(pattern,txt)


Yesterday, I was driving my car without a driving [43m[1mlicence[0m. The traffic police stopped me and asked me for my 
[43m[1mlicense[0m. I told them that I forgot my [43m[1mlicence[0m at home. 



### Character Set Range
It is possible to also use the range of a character. This is done by leveraging the hyphen symbol (-) between two related characters; for example, to match any lowercase letter we can use [a-z]. Likewise, to match any single digit we can define the character set [0-9].

Let us consider an example in which we want to retrieve all the years from the given text.

In [7]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [8]:
pattern = re.compile("[1-9][0-9][0-9][0-9]")

In [9]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [10]:
highlight_regex_matches(pattern,txt)


The first season of Indian Premiere League (IPL) was played in [43m[1m2008[0m. 
The second season was played in [43m[1m2009[0m in South Africa. 
Last season was played in [43m[1m2018[0m and won by Chennai Super Kings (CSK).
CSK won the title in [43m[1m2010[0m and [43m[1m2011[0m as well.
Mumbai Indians (MI) has also won the title 3 times in [43m[1m2013[0m, [43m[1m2015[0m and [43m[1m2017[0m.



here is another possibility—the negation of ranges. We can invert the meaning

of a character set by placing a caret (^) symbol right after the opening square bracket metacharacter ([).

For example, to find all the characters used in a text except vowels, we can use the pattern:

In [11]:
pattern = re.compile('[^aeiou]')

In [12]:
pattern.findall(txt)

['\n',
 'T',
 'h',
 ' ',
 'f',
 'r',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'f',
 ' ',
 'I',
 'n',
 'd',
 'n',
 ' ',
 'P',
 'r',
 'm',
 'r',
 ' ',
 'L',
 'g',
 ' ',
 '(',
 'I',
 'P',
 'L',
 ')',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '8',
 '.',
 ' ',
 '\n',
 'T',
 'h',
 ' ',
 's',
 'c',
 'n',
 'd',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '9',
 ' ',
 'n',
 ' ',
 'S',
 't',
 'h',
 ' ',
 'A',
 'f',
 'r',
 'c',
 '.',
 ' ',
 '\n',
 'L',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '8',
 ' ',
 'n',
 'd',
 ' ',
 'w',
 'n',
 ' ',
 'b',
 'y',
 ' ',
 'C',
 'h',
 'n',
 'n',
 ' ',
 'S',
 'p',
 'r',
 ' ',
 'K',
 'n',
 'g',
 's',
 ' ',
 '(',
 'C',
 'S',
 'K',
 ')',
 '.',
 '\n',
 'C',
 'S',
 'K',
 ' ',
 'w',
 'n',
 ' ',
 't',
 'h',
 ' ',
 't',
 't',
 'l',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '0',
 ' ',
 'n',


In [13]:
print(''.join(pattern.findall(txt)))


Th frst ssn f Indn Prmr Lg (IPL) ws plyd n 2008. 
Th scnd ssn ws plyd n 2009 n Sth Afrc. 
Lst ssn ws plyd n 2018 nd wn by Chnn Spr Kngs (CSK).
CSK wn th ttl n 2010 nd 2011 s wll.
Mmb Indns (MI) hs ls wn th ttl 3 tms n 2013, 2015 nd 2017.



### Predefined Character Classes
There exist some predefined character classes which can be used as a shortcut for some frequently used classes.

#### Element	Description
* .    :-	This element matches any character except newline
* \d   :-	This matches any decimal digit; this is equivalent to the class [0-9]
* \D   :-	This matches any non-digit character; this is equivalent to the class [^0-9]
* \s   :-	This matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
* \S   :-	This matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
* \w   :-	This matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
* \W   :-	This matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]

In [14]:
pattern = re.compile("[1-9]\d\d\d")

In [15]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

In [24]:
pattern = re.compile("[^\w\s]")
pattern.findall(txt)

['(', ')', '.', '.', '(', ')', '.', '.', '(', ')', ',', '.']

### Alteration
Just like character classes are used to match a single character out of several possible characters, alternation is used to match a single regular expression out of several possible regular expressions.

This is accomplished using the pipe symbol |.

Consider a scenario where you want to find all occurances of and, or, the in a given text.

One way is to write and execute 3 separate regular expressions. Using alteration, it can be done in a single regular expression!

In [25]:
txt = """
the most common conjunctions are and, or and but.
"""

In [26]:
pattern = re.compile("and|or|the")

In [27]:
pattern.findall(txt)

['the', 'and', 'or', 'and']

In [32]:
highlight_regex_matches(pattern,txt)


Yesterday, I was driving my car without a driving licence. The traffic police stopped me [43m[1mand[0m asked me f[43m[1mor[0m my 
license. I told [43m[1mthe[0mm that I f[43m[1mor[0mgot my licence at home. 



Consider one more example now in which we want to search the substrings What is and Who is.



In [35]:
txt = """
what is your name?
who is that guy?
"""

In [36]:
pattern = re.compile('what is|who is')

pattern.findall(txt)

['what is', 'who is']

In [37]:
highlight_regex_matches(pattern,txt)


[43m[1mwhat is[0m your name?
[43m[1mwho is[0m that guy?



What|Who is regex pattern actually matches substrings What and Who is.

To get the desired result, we need to wrap the optional regular expressions using paranthesis.

In [42]:
pattern = re.compile("(what|who) is")

In [43]:
pattern.findall(txt)

['what', 'who']

In [44]:
highlight_regex_matches(pattern,txt)


[43m[1mwhat is[0m your name?
[43m[1mwho is[0m that guy?

