REGEX EXPRESSION (REGEX) IN PYTHON

### using the re Module

The functionality for using regular expressions in python is included in the "re" package
,which you should be able to import


In [1]:
import re

lets do a quick example of using the search method in the remoduleto find some text:

In [2]:
s = "In principio erat verbum,et verbum erat apud Deum."

the next thing we define is the actual expression which we will use, or the string that we will use to search the sentence we defined  above. we pass this string to the 'compile'() function in the `re` package,

In [3]:
pattern = re.compile(r"verbum")

next, we call the sub() function from the re package on this pattern,
inorder to replace(or "substitute") our pattern. with another word,like this:

In [4]:
text = pattern.sub("XXX",s)
print(text)

In principio erat XXX,et XXX erat apud Deum.


in principio erat XXX, et XXX erat apud Deum.

note the order of the arguements passed to sub():first,
    the word we would like to replace(or "substitute") our pattern with,and secxondly our original string.we can just as easily get back to our original string:

In [5]:
pattern2 = re.compile(r"XXX")
text = pattern2.sub("verbum",s)
print(text)

In principio erat verbum,et verbum erat apud Deum.


### example 2
replace all vowels in a string with regular expressions:

In [6]:
vowel_pattern= re.compile(r"a|e|o|u|i")
without_vowels = vowel_pattern.sub("X",s)
print(without_vowels)

In prXncXpXX XrXt vXrbXm,Xt vXrbXm XrXt XpXd DXXm.


note how our pattern allows for a special syntax:the pipe `|` symbol which we used to express
that one character OR another one is fine for the regular expression to match.

### example3
from the answer we got in example2, we can notice that the capital letter"I" at the beginning of the sentence hasnt been replaced because we only included lowercase vowels in our pattern definition. lets add the uppercase vowels to the regex:



In [8]:
vowel_pattern = re.compile(r"a|A|e|E|o|O|u|U|i|I")
without_vowels = vowel_pattern.sub("X",s)
print(without_vowels)

Xn prXncXpXX XrXt vXrbXm,Xt vXrbXm XrXt XpXd DXXm.


there is a better way to match all lower case and uppercase characters in a string ,
like this

In [10]:
ups = re.compile(r"[A-Z]")
lows= re.compile(r"[a-z]")

without_ups = ups.sub("X",s)
print(without_ups)

without_lows = lows.sub("X",s)
print(without_lows)

Xn principio erat verbum,et verbum erat apud Xeum.
IX XXXXXXXXX XXXX XXXXXX,XX XXXXXX XXXX XXXX DXXX.


these specific patterns are called "ranges". they will match any lowercase or upper case letter. infact ,you can use such a range syntax using squared brackets, to replace the pipe syntax we used earlier

In [11]:
vowel_pattern = re.compile(r"[aeoui]")
without_vowels = vowel_pattern.sub("X",s)
print(without_vowels)

In prXncXpXX XrXt vXrbXm,Xt vXrbXm XrXt XpXd DXXm.


you can also look for more specific ,as well as longer letter groups by arranging them
with round brackets:

In [12]:
p = re.compile(r"(ri)|(um)|(Th)")
print(vowel_pattern.sub("X",s))

In prXncXpXX XrXt vXrbXm,Xt vXrbXm XrXt XpXd DXXm.


there is also a syntax to match any character (except the newline):

In [13]:
any_char = re.compile(r".")
print(any_char.sub("X",s))

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


if you would like your expression to match the actual dot, youbhave to escape it using
a backslash:

In [14]:
dot = re.compile(r"\.")
print(dot.sub("X",s))

In principio erat verbum,et verbum erat apud DeumX


In [16]:
s= "In principio [erat] verbum,et verbum erat apud Deum."

brackets_wrong = re.compile(r"[|]")
print(brackets_wrong.sub("X",s))

brackets_right = re.compile(r"(\[)|(\])")
print(brackets_right.sub("X",s))

In principio [erat] verbum,et verbum erat apud Deum.
In principio XeratX verbum,et verbum erat apud Deum.


the syntax for regular expression includes a whole range of possiblities which we simply cannot all deal with here,because of that we will stick to a number of helpful examples.an 
interesting featureis that you can specify whether or not a character really has to occur 
you can check whether the pattern occurs ina string using the match() function which will return none if it  doesnt find the pattern in the string searched

In [19]:
pattern =re.compile(r"m{2,4}")
print(pattern.match(""))
print(pattern.match("m"))
print(pattern.match("mm"))  
print(pattern.match("mmm"))  
print(pattern.match("mmmm"))  
print(pattern.match("mmmmm"))  
print(pattern.match("mmmmmm"))  
print(pattern.match("mmmmammm"))  
                                                                                                                                                                          

None
None
<re.Match object; span=(0, 2), match='mm'>
<re.Match object; span=(0, 3), match='mmm'>
<re.Match object; span=(0, 4), match='mmmm'>
<re.Match object; span=(0, 4), match='mmmm'>
<re.Match object; span=(0, 4), match='mmmm'>
<re.Match object; span=(0, 4), match='mmmm'>


In [22]:
#list of patterns to search for
patterns = ['term1','term2']
#text to parse
text= " this is a string with term1, but it does not have the other term."

for pattern in patterns:
    print(f'searching for {pattern} in:\n"{text}"'),
    
    #check for match
    if re.search(pattern,text):
        print('\n')
        print("match was found. \n")
        
    else:
        print('\n')
        print("No Match was found.\n")
    

searching for term1 in:
" this is a string with term1, but it does not have the other term."


match was found. 

searching for term2 in:
" this is a string with term1, but it does not have the other term."


No Match was found.



now we have seen that re.search() will take the pattern ,scan the text , and then returns 
a MATCH object. if no pattern is found, a None is returned. 
To give a clearer picture of this match object,check out the cell below


In [23]:
#list of patterns to search for
pattern = "term1"
#text to parse
text = "this is a string with term1, but it does not have the other term."
match = re.search(pattern,text)
type(match)

re.Match

this match object returned by the search() method is more than just a boolean or none, it contains information about the match ,including the original input string,the regular expression that was used  and the location of the match . lets see the methoid we can use on the match object:

In [24]:
match

<re.Match object; span=(22, 27), match='term1'>

In [25]:
#show start of match 
match.start()

22

In [26]:
#show end 
match.end()

27

### spliting with regular expressions

let's see how we can split with the re syntax . this should look similar to how you used the 
split method with strings.

In [27]:
# term to split on 
split_term = "@"

phrase = " what is the domain name of someone with the email: hello@gmail.com"

#split the phrase
re.split(split_term,phrase)

[' what is the domain name of someone with the email: hello', 'gmail.com']

# finding all instances of a pattern 
you can use re.findall() to find all instances of a pattern in a string.for example:

In [28]:
#returns a list of all matches
re.findall("match","test phrase match is in middle match")

['match', 'match']

## pattern re syntax

regular expressions support a huge variety of patterns than just simply finding where a single string occured.we can also use meta characters along with re to find specific types of patterns.
since we will be testing multiple re syntax forms, lets create a function that will print out results given a list of various regular expressions and a phrae to parse

In [33]:
def multi_re_find(patterns,phrase):
    """
    
    takes in a list of regex patterns
    prints a list of all matches
    """
    for pattern in patterns:
        print('searching the phrase using the re check: %r' %pattern)
        print(re.findall(pattern,phrase))
        print('\n')

## now we will see an example of each of these using our multi_re_find function

In [35]:
test_phrase = "sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd"

test_patterns = [ "sd*",   #s ffd by zero or more ds
                 "sd+",   #s ffd by one or more ds
                 "sd?",    
                 "sd{3}",
                 "sd{2,3}",
                ]
multi_re_find(test_patterns,test_phrase)

searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']




## character sets
they are used when u want to match any one of a group of characters at a point in the input .brackets are used to construct set inputs.eg the input [ab] searches for occurrences of either a or b. lets see some e.gs

In [36]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ '[sd]', #either s or d
              's[sd]+'] #s followed by one or more s or d

multi_re_find(test_patterns,test_phrase)
    

searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




`it makes sense that the first[sd] returns every instance .also the second input will just return anything starting with an sin this particular case of the test phrase input`

## EXCLUSION

WE can use ^ to exclude terms by incorporating it into the bracket syntax notation. for example:[^...] will match any single character not in d brackets.e.g

In [37]:
test_phrase = "this is a string! but it has punctuation. how do i remove it?"

use[^!.?] to check for matches that are not a!.,.,?, or space. add the + to check
 that the match appears at least once , this basically translates into finding the words


In [40]:
re.findall('[^!.? ]+' , test_phrase)

['this',
 'is',
 'a',
 'string',
 'but',
 'it',
 'has',
 'punctuation',
 'how',
 'do',
 'i',
 'remove',
 'it']

In [41]:
test_phrase = 'this is an example sentence. lets see if we can find some letters'

test_patterns = [ '[a-z]+', #sequences of lower case letters
                  '[A-Z]+',  #sequences of upper case letters
                  '[a-zA-Z]+',   #sequences of lower or upper case letters
                  '[A-Z][a-z]+'] #one upper case letter followed by lower case letters

multi_re_find(test_patterns,test_phrase)

searching the phrase using the re check: '[a-z]+'
['this', 'is', 'an', 'example', 'sentence', 'lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


searching the phrase using the re check: '[A-Z]+'
[]


searching the phrase using the re check: '[a-zA-Z]+'
['this', 'is', 'an', 'example', 'sentence', 'lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


searching the phrase using the re check: '[A-Z][a-z]+'
[]




In [44]:
test_phrase = 'this is a string with some numbers 1233 and a symbol #hashtag'

test_patterns = [r'\d+', #sequence of digits
                 r'\D+', #sequence of non digits
                 r'\s+', #sequence of white space
                 r'\S+', #sequence of non white space
                 r'\w+', #alphanumeric characters
                 r'\W+', #non_alphanumeric
                 ]
multi_re_find(test_patterns,test_phrase)

searching the phrase using the re check: '\\d+'
['1233']


searching the phrase using the re check: '\\D+'
['this is a string with some numbers ', ' and a symbol #hashtag']


searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


searching the phrase using the re check: '\\S+'
['this', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


searching the phrase using the re check: '\\w+'
['this', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




you should learn how to use `Regex module` .learn moretake a look at the full documentation .go to python website to get the documentation