<h1>Regular Expressions</h1>
<p>
Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more.</p>

<h3>Searching for Patterns in Text</h3>
<p>One of the most common uses for the re module is for finding patterns in text.</p>

In [3]:
import re
patterns=["foo1","foo2"]
foo="This is a foo string but not a foo2 string"
for pattern in patterns:
    print(f"Searching for {pattern} in :{foo}")
    if re.search(pattern,foo):
        print("Match was found")
    else:
        print("No Match was found")
    print()

Searching for foo1 in :This is a foo string but not a foo2 string
No Match was found

Searching for foo2 in :This is a foo string but not a foo2 string
Match was found



<p>Now we've seen that re.search() will take the pattern, scan the text, and then return a <strong>Match</strong> object. If no pattern is found, <strong>None</strong> is returned.</p><p>This Match object returned by the search() method is more than just a Boolean or None, it contains <strong>information about the match</strong>, <strong>including the original input string, the regular expression that was used, and the location of the match.</strong></p>

In [6]:
match=re.search("foo",foo)
print(f"Foo match start: {match.start()}")
print(f"Foo match end: {match.end()}")
print(match.string)
print(match.group())

Foo match start: 10
Foo match end: 13
This is a foo string but not a foo2 string
foo


<h3>Split with regular expressions</h3>


In [11]:
sample_string='What is the domain name of someone with the email: hello@gmail.com'
split_string=re.split("@",sample_string)
print(split_string)

['What is the domain name of someone with the email: hello', 'gmail.com']


<h3>Finding all instances of a pattern</h3>


In [14]:
find_instances=re.findall("the",sample_string)
print(find_instances)

['the', 'the']


<h3>re Pattern Syntax</h3>
<p>This will be the bulk of this lecture on using re with Python. Regular expressions support a huge variety of patterns beyond just simply finding where a single string occurred.

We can use metacharacters along with re to find specific types of patterns.</p>

<h3>Repetition Syntax</h3>
<p>There are five ways to express repetition in a pattern:
<ol>
<li>A pattern followed by the meta-character (star-symbol) is repeated zero or more times.</li>
<li>Replace the * with + and the pattern must appear at least once.</li>
<li>Using ? means the pattern appears zero or one time.</li>
<li>For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.</li>
<li>Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n {m,} means the value appears at least m times, with no maximum.</li>
</ol>
</p>


In [4]:
def multi_re_find(patterns,phrase):
    for pattern in patterns:
        print(f"Searching the phrase with re check:{pattern}")
        print(re.findall(pattern,phrase))
        print()

test_phrase="sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd"
test_patterns=["sd*","sd+","sd?","sd{3}","sd{2,3}"]
multi_re_find(patterns=test_patterns,phrase=test_phrase)

Searching the phrase with re check:sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']

Searching the phrase with re check:sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']

Searching the phrase with re check:sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']

Searching the phrase with re check:sd{3}
['sddd', 'sddd', 'sddd', 'sddd']

Searching the phrase with re check:sd{2,3}
['sddd', 'sddd', 'sddd', 'sddd']



<h3>Character Sets</h3>
<p>Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a or b.</p>


In [12]:
test_patterns=["[sd]","s[sd]+"]
test_phrase = 'sdd sdsds..sssddd...sdddsddd...dsds...dsssss...sdddd'
multi_re_find(test_patterns,test_phrase)

Searching the phrase with re check:[sd]
['s', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']

Searching the phrase with re check:s[sd]+
['sdd', 'sdsds', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']



<p>It makes sense that the first input <code>[sd]</code> returns every instance of s or d. Also, the second input <code>s[sd]+</code> returns any full strings that begin with an <strong>s</strong> and <strong>continue with s or d characters until another character is reached.</strong></p>

<h3>Exclusion</h3>
<p>
We can use ^ to exclude terms by incorporating it into the bracket syntax notation. For example: [^...] will match any single character not in the brackets.</p><p><strong>Use [^!.? ] to check for matches that are not a !,.,?, or space.</strong> Add a + to check that the match appears at least once. This basically translates into finding the words.</p>

In [12]:
import re
test_phrase="This is a string with foo.Don't be foolish!Work Hard like Elon Musk."
print(re.findall("[^!?'. ]+",test_phrase)) 
print(re.findall("[^!?'.]+",test_phrase))

['This', 'is', 'a', 'string', 'with', 'foo', 'Don', 't', 'be', 'foolish', 'Work', 'Hard', 'like', 'Elon', 'Musk']
['This is a string with foo', 'Don', 't be foolish', 'Work Hard like Elon Musk']


<p>
<pre>
print(re.findall("[^!?'. ]+",test_phrase)) # Excludes every first occurence including space that's why it is                                                    split in this fashion.


print(re.findall("[^?'.]+",test_phrase)) # Excludes every first occurence without including space,that's why it is
                                           split as a sentence wherever ".,',!" occurs.
</pre>
</p>

<h3>Character Ranges</h3>
<p>As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].</p>
<p>
Common use cases are to search for a specific range of letters in the alphabet. For instance, [a-f] would return matches with any occurrence of letters between a and f.</p>

In [12]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'
test_patterns=["[a-z]+","[A-Z]+","[A-Z][a-z]+","[A-Z ][a-z ]+"] # <-- One or more occurences with spaces.
multi_re_find(test_patterns,test_phrase)

Searching the phrase with re check:[a-z]+
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']

Searching the phrase with re check:[A-Z]+
['T', 'L']

Searching the phrase with re check:[A-Z][a-z]+
['This', 'Lets']

Searching the phrase with re check:[A-Z ][a-z ]+
['This is an example sentence', 'Lets see if we can find some letters']



In [25]:
test_phrase = 'abcdef abcdefghijk jkabcdef'
test_patterns=["[a-f]+","[a-z]+","[A-Z]+","[a-z][A-Z]+"] # They find patterns without including space.That's why
multi_re_find(test_patterns,test_phrase)                 # it doesn't match sentences with space.

Searching the phrase with re check:[a-f]+
['abcdef', 'abcdef', 'abcdef']

Searching the phrase with re check:[a-z]+
['abcdef', 'abcdefghijk', 'jkabcdef']

Searching the phrase with re check:[A-Z]+
[]

Searching the phrase with re check:[a-z][A-Z]+
[]



In [26]:
test_phrase = 'abcdef abcdefghijk jkabcdef'
test_patterns=["[a-f ]+","[a-z ]+","[A-Z ]+","[a-z ][A-Z ]+"] # They find patterns including space.That's why
multi_re_find(test_patterns,test_phrase)                      # we are getting sentences with space in the list.

Searching the phrase with re check:[a-f ]+
['abcdef abcdef', ' ', 'abcdef']

Searching the phrase with re check:[a-z ]+
['abcdef abcdefghijk jkabcdef']

Searching the phrase with re check:[A-Z ]+
[' ', ' ']

Searching the phrase with re check:[a-z ][A-Z ]+
['f ', 'k ']



In [13]:
print('He\'s Peter.')

He's Peter.
