# Regular Expressions

Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the re module with Python for this lecture.

In [1]:
#Search Text: one of the most useful method of re is to find text pattern

import re
#List of patter to serach for
patterns = ['term1','term2']

#sentence from which we need to search pattern
sentence = 'This is string with term1 but not with term'



In [3]:
for pattern in patterns:
    print("Searching for {} in: \n{}.".format(pattern,sentence))
    
    if re.search(pattern,sentence):
        print("Match Found")
    else:
        print("Match not found")

Searching for term1 in: 
This is string with term1 but not with term.
Match Found
Searching for term2 in: 
This is string with term1 but not with term.
Match not found


Now we've seen that re.search() will take the pattern, scan the text, and then return a Match if found and if not found return null.

In [4]:
# Lets see how re.search is working here

match = re.search(patterns,sentence)

TypeError: unhashable type: 'list'

In [5]:
#it shows we are searchin list in text.

#So lets take first index of the list to find it.

match = re.search(patterns[0],sentence)

In [7]:
type(match)

re.Match

In [8]:
print(match)

<re.Match object; span=(20, 25), match='term1'>


In [10]:
#It shows where the match index starts and ends
match.start()

20

In [11]:
match.end()

25

# Split with regular expression



In [12]:
splitter = '@'


phrase = 'What is the domain name of someone with the email: hello@gmail.com'

re.split(splitter,phrase)

['What is the domain name of someone with the email: hello', 'gmail.com']

In [13]:
#it splits the string 

#or

for sent in re.split(splitter,phrase):
    print(sent)

What is the domain name of someone with the email: hello
gmail.com


In [14]:
#Note how re.split() returns o/p with the term to split on removed.


# Finding all instances of a pattern

--- re.findall() to find all the instances of a pattern in a string. 

In [15]:
re.findall('business','this is my business, none of your business.So, keep you business seperate.')

['business', 'business', 'business']

In [16]:
#It found all the matchs with "findall" class

# Repetition Syntax

There are five ways to express repetition in a pattern:

    A pattern followed by the meta-character * is repeated zero or more times.
    Replace the * with + and the pattern must appear at least once.
    Using ? means the pattern appears zero or one time.
    For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.
    Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n {m,} means the value appears at least m times, with no maximum.

Now we will see an example of each of these:


In [24]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns =['sd*',        # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                'sd{1}'         # s followed by 1 d
                ]





In [22]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: {}'.format(pattern) )
        print(re.findall(pattern,phrase))
        print('\n')

In [23]:
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: sd{3}
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: sd{2,3}
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: sd{1}
['sd', 'sd', 'sd', 'sd', 'sd', 'sd', 'sd']




In [37]:
test_phrase1 = 'my name is sandy and im nowhere and im working and im sleeping'

test_patterns1 = ['and']

multi_re_find(test_patterns1,test_phrase1)

Searching the phrase using the re check: and*
['and', 'and', 'and', 'and']




In [38]:
#this gives four 'and' but we can see actually 3 indiviusal and only then why it showing four matches for 'and' keyword

#let try to see that string again.Are you able to identitify that.

#As you can see that there is 'and' also hidden in the word "S'and'y".Thats why it is giving four.I hope this clears to you



In [43]:
test_phrase2 = 'my name is sandy and im nowhere and im working and im sleeping'

test_patterns2 = ['[and]', # either 'a' or 'n' or 'd'
                  'a[nd]',  #a followed by either n or d
                  'an[d]*',  #a followed by either zero or more n or d
                  '[ ]'      # no. of spaces in text
                 ]

multi_re_find(test_patterns2,test_phrase2)

Searching the phrase using the re check: [and]
['n', 'a', 'a', 'n', 'd', 'a', 'n', 'd', 'n', 'a', 'n', 'd', 'n', 'a', 'n', 'd', 'n']


Searching the phrase using the re check: a[nd]
['an', 'an', 'an', 'an']


Searching the phrase using the re check: an[d]*
['and', 'and', 'and', 'and']


Searching the phrase using the re check: [ ]
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']




# Exclusion

[^...] :Matches any single character not in brackets

lets see one example

In [44]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

Use [^!.? ] to check for matches that are not a !,.,?, or space. Add a + to check that the match appears at least once. This basically translates into finding the words.


In [45]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [47]:
# it will give all words except "!,.,?"

re.findall('[!.? ]+',test_phrase)

#and to find these character i.e " ! " or " . " or " ? " or " "(space)

[' ', ' ', ' ', '! ', ' ', ' ', ' ', '. ', ' ', ' ', ' ', ' ', '?']

# https://www.tutorialspoint.com/python/python_reg_expressions.htm

to check all the lists of patterns 

# Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet. For instance, [a-f] would return matches with any occurrence of letters between a and f.

In [49]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

In [51]:
test_patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: [a-z]+
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: [A-Z]+
['T', 'L']


Searching the phrase using the re check: [a-zA-Z]+
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: [A-Z][a-z]+
['This', 'Lets']




# Escape Codes:

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

In [52]:


test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(test_patterns,test_phrase)



Searching the phrase using the re check: \d+
['1233']


Searching the phrase using the re check: \D+
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: \S+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching the phrase using the re check: \w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




Escapes are indicated by prefixing the character with a backslash \. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r, eliminates this problem and maintains readability.

Personally, I think this use of r to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first.

In [58]:
# or we can use "\\"
test_patterns=[ '\\d+', # sequence of digits
                '\\D+', # sequence of non-digits
            
        
                ]

multi_re_find(test_patterns,test_phrase)


Searching the phrase using the re check: \d+
['1233']


Searching the phrase using the re check: \D+
['This is a string with some numbers ', ' and a symbol #hashtag']




# https://docs.python.org/3/library/re.html#regular-expression-syntax

Refer this pyton offical doc for more information about regular expressions.