### Regular Expressions
###### Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).

###### If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the re module with Python for this lecture.

###### Let's get started!

###### Searching for Patterns in Text
###### One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [4]:
import re

patterns = ['term1','term2']

text = "hello everybody whats hapnin, i have term1 and term2"

for pattern in patterns:
    print("searching for '%s' in: \n '%s' " %(pattern, text))
    
    if re.search(pattern,text):
        print("match was found")
    else:
        print("no match was found")
    


searching for 'term1' in: 
 'hello everybody whats hapnin, i have term1 and term2' 
match was found
searching for 'term2' in: 
 'hello everybody whats hapnin, i have term1 and term2' 
match was found


In [5]:
for pattern in patterns:
    print(f"search for {pattern} in text: '{text}'")
    
    if re.search(pattern,text):
        print('match found')
    else:
        print('match not found')

search for term1 in text: 'hello everybody whats hapnin, i have term1 and term2'
match found
search for term2 in text: 'hello everybody whats hapnin, i have term1 and term2'
match found


In [7]:
match = re.search(patterns[0],text)

In [8]:
type(match)

_sre.SRE_Match

In [8]:
# to know the starting index of match
match.start()

37

In [9]:
#to know the ending index of match
match.end()

42

In [8]:
# now we will see how we can split with regular expressions
split_term = "@"

text = "my email address is abraham@gmail.com"

re.split(split_term,text)

['my email address is abraham', 'gmail.com']

In [9]:
# re.split is similar to split method like
text.split(' ') 
text.split(split_term)

['my email address is abraham', 'gmail.com']

In [12]:
# Returns a list of all matches
# Finding all instances of a pattern
# You can use re.findall() to find all the instances of a pattern in a string. For example:
re.findall('match','test phrase match is in middle of the match')

['match', 'match']

In [15]:
def multi_refind(pattern,text):
    print (f"find the pattern of {pattern} in the {text} ")
    
    for x in pattern:
        print(re.findall(x,text)) 
        

In [16]:
multi_refind('match' , 'match is a match')

find the pattern of match in the match is a match 
['m', 'm']
['a', 'a', 'a']
['t', 't']
['c', 'c']
['h', 'h']


In [20]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %s' %(pattern))
        print(re.findall(pattern,phrase))
        print('\n')

In [21]:
multi_re_find('hello wats up', 'hello bharat wats up myk')

Searching the phrase using the re check: h
['h', 'h']


Searching the phrase using the re check: e
['e']


Searching the phrase using the re check: l
['l', 'l']


Searching the phrase using the re check: l
['l', 'l']


Searching the phrase using the re check: o
['o']


Searching the phrase using the re check:  
[' ', ' ', ' ', ' ']


Searching the phrase using the re check: w
['w']


Searching the phrase using the re check: a
['a', 'a', 'a']


Searching the phrase using the re check: t
['t', 't']


Searching the phrase using the re check: s
['s']


Searching the phrase using the re check:  
[' ', ' ', ' ', ' ']


Searching the phrase using the re check: u
['u']


Searching the phrase using the re check: p
['p']




### Repetition Syntax
##### There are five ways to express repetition in a pattern:

##### A pattern followed by the meta-character * is repeated zero or more times.
##### Replace the * with + and the pattern must appear at least once.
##### Using ? means the pattern appears zero or one time.
##### For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.
##### Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n {m,} means the value appears at least m times, with no maximum.
##### Now we will see an example of each of these using our multi_re_find function:

In [22]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: sd{3}
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: sd{2,3}
['sddd', 'sddd', 'sddd', 'sddd']




### Character Sets
###### Character sets are used when you wish to match any one of a group of characters at a point in the input.
###### Brackets are used to construct character set inputs. 
###### For example: the input [ab] searches for occurrences of either a or b. Let's see some examples:

In [25]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

# [sd]: either s or d , s[sd]+: s followed by one or more s or d
test_patterns = ['[sd]', 's[sd]+']

multi_re_find(test_patterns,test_phrase)

# It makes sense that the first input [sd] returns every instance of s or d. Also, the second 
# input s[sd]+ returns any full strings that begin with an s and continue with s or d characters until another character is reached.

Searching the phrase using the re check: [sd]
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: s[sd]+
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




### Exclusion
###### We can use ^ to exclude terms by incorporating it into the bracket syntax notation.
###### For example: [^...] will match any single character not in the brackets. Let's see some examples:
###### Use [^!.? ] to check for matches that are not a !,.,?, or space.
###### Add a + to check that the match appears at least once. This basically translates into finding the words    

In [33]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

test_pattern = '[^?.! ]+'

re.findall(test_pattern,test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

### Character Ranges
###### As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

###### Common use cases are to search for a specific range of letters in the alphabet. For instance, [a-f] would return matches with any occurrence of letters between a and f.

###### Let's walk through some examples:

In [37]:
test_phrase = "This is an example sentence, Lets see if we can find the letters in it"

find_patterns = [
    # sequences of lower case letters
    '[a-z]+',
    #sequences of upper case letters     
    '[A-Z]+',
    #sequences of lowercase or uppercase letter     
    '[a-zA-Z]+',
    #one upper case letter followed by lower case letters
    '[A-Z][a-z]+'
]

multi_re_find(find_patterns,test_phrase)

Searching the phrase using the re check: [a-z]+
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'the', 'letters', 'in', 'it']


Searching the phrase using the re check: [A-Z]+
['T', 'L']


Searching the phrase using the re check: [a-zA-Z]+
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'the', 'letters', 'in', 'it']


Searching the phrase using the re check: [A-Z][a-z]+
['This', 'Lets']




#### Escape Codes
###### You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

#### Code:	    Meaning
##### \d:	         a digit
##### \D:	        a non-digit
##### \s:	         whitespace (tab, space, newline, etc.)
##### \S:	        non-whitespace
##### \w:	        alphanumeric
##### \W:	       non-alphanumeric



###### Escapes are indicated by prefixing the character with a backslash \. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r, eliminates this problem and maintains readability.

###### Personally, I think this use of r to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first. Hopefully after seeing these examples this syntax will become clear.

In [49]:
test_phrase = 'This is a string with some numbers 123467454 and some letters hcthctycty and #hashtag'

test_pattern = [
#     sequence of digits
    '\d+',
#     sequence of non-digits
    '\D+',
#     sequence of whitespaces
    '\s+',
#    sequence of non-whitespaces
    '\S+',
#    aplhanumneric characters
    '\w+',
#    non-alphanumeric characters
    '\W+'
]

multi_re_find(test_pattern, test_phrase)

Searching the phrase using the re check: \d+
['123467454']


Searching the phrase using the re check: \D+
['This is a string with some numbers ', ' and some letters hcthctycty and #hashtag']


Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: \S+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '123467454', 'and', 'some', 'letters', 'hcthctycty', 'and', '#hashtag']


Searching the phrase using the re check: \w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '123467454', 'and', 'some', 'letters', 'hcthctycty', 'and', 'hashtag']


Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




#### Conclusion
###### You should now have a solid understanding of how to use the regular expression module in Python. There are a ton of more special character instances, but it would be unreasonable to go through every single use case. Instead take a look at the full documentation if you ever need to look up a particular pattern.

You can also check out the nice summary tables at this source.

Good job!