* Function	Description
* findall:    Returns a list containing all matches
* search:     Returns a Match object if there is a match anywhere in the string  or returns None if no position in the string                 matches the pattern.
* split:   	Returns a list where the string has been split at each match
* sub:     	Replaces one or many matches with a string

# Detect Floating Point Number

In [3]:
import re

In [7]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
y = re.search("Spai", txt)

In [8]:
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [9]:
y

<re.Match object; span=(12, 16), match='Spai'>

In [14]:
z = re.findall("Spais", txt)

In [15]:
z

[]

In [17]:
re.search("\s", txt) # The first white-space character is located in position

<re.Match object; span=(3, 4), match=' '>

In [18]:
re.split("\s", txt, 1) # Split the string only at the first occurrence

['The', 'rain in Spain']

In [19]:
re.sub("\s", "9", txt)

'The9rain9in9Spain'

In [21]:
re.sub("\s", "9", txt, 2) # Replace the first 2 occurrences

'The9rain9in Spain'

The Match object has properties and methods used to retrieve information about the search, and the result:

* .span() returns a tuple containing the start-, and end positions of the match.
* .string returns the string passed into the function
* .group() returns the part of the string where there was a match

In [23]:
y.span()

(12, 16)

In [24]:
txt = "The rain in Spain"
x1 = re.search(r"\bS\w+", txt)
print(x1.string)

The rain in Spain


In [25]:
x1

<re.Match object; span=(12, 17), match='Spain'>

In [26]:
x1.group()

'Spain'

re.match() expression only matches at the beginning of the string.
It either returns a MatchObject instance or returns None if the string does not match the pattern.

In [27]:
bool(re.search(r"ly","similarly"))

True

In [28]:
bool(re.match(r"ly","ly should be in the beginning"))

True

In [29]:
bool(re.match(r"ly","similarly"))

False

In [30]:
isinstance(-1.00,float)

True

Dot is a metacharacter in RegEx - it is used to match any character.

You need to escape it with \ when you want to match a literal dot.

In [31]:
bool(re.match(r'^[-+]?[0-9]*\.[0-9]+$','4.0O0'))

False

^ says start of the expression.

[-+]? says it can start with either - or +.

[0-9] says any number from 0-9 can be followed after it.

*'*' says that whichever thing it follows[in this case it is[0-9]], it can repeat arbitrarily times, even 0 times.

'.' is placeholder for any character.(for the answer it should be '\.' instead of '.' ; '\' is escape character. Because of this you can literally mean a dot in expression).

again[0-9] as explained earlier.

'+' says that whichever thing it follows[in this case it is[0-9]], it can repeat arbitrarily times, but atleast one time.

$ follows whichever thing it should come in the end.

# Re.split()
In re.split(), specify the regular expression pattern in the first parameter and the target character string in the second parameter.

The re.split() expression splits the string by occurrence of a pattern.

In [36]:
s = '100,000,000.000'
s.split(',')

['100', '000', '000.000']

In [35]:
txt = "apple#banana#cherry#orange"
x = txt.split("#")
print(x)

['apple', 'banana', 'cherry', 'orange']


In [37]:
re.split(r"-","+91-011-2711-1111")

['+91', '011', '2711', '1111']

In [45]:
re.split(',|\.',s)

['100', '000', '000', '000']

In [47]:
re.split(r'[.,]+',s)

['100', '000', '000', '000']

# Group(), Groups() & Groupdict()

In [48]:
# group()
# The regular expression looks for any words that starts with an upper case "S":

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


In [51]:
m = re.match(r'(\w+)@(\w+)\.(\w+)','username@hackerrank.com')
m.group(0)       # The entire match 

'username@hackerrank.com'

In [52]:
m.group(1)       # The first parenthesized subgroup.

'username'

In [53]:
 m.group(2)       # The second parenthesized subgroup.

'hackerrank'

In [54]:
m.group(3)       # The third parenthesized subgroup.

'com'

In [55]:
m.group(1,2,3)   # Multiple arguments give us a tuple.

('username', 'hackerrank', 'com')

In [57]:
# groups
# A groups() expression returns a tuple containing all the subgroups of the match.

m = re.match(r'(\w+)@(\w+)\.(\w+)','username@hackerrank.com')
m.groups()

('username', 'hackerrank', 'com')

In [59]:
# groupdict()
# A groupdict() expression returns a dictionary containing all the named subgroups of the match, keyed by the subgroup name

m = re.match(r'(?P<user>\w+)@(?P<website>\w+)\.(?P<extension>\w+)','myname@hackerrank.com')
m.groupdict()

{'user': 'myname', 'website': 'hackerrank', 'extension': 'com'}

In [67]:
a = '..12345678910111213141516171820212223'
m= re.search(r'([a-zA-Z0-9])\1+', a)
print(m.group(1) if m else -1)

1


In [65]:
a.strip()

'..12345678910111213141516171820212223'

In [69]:
m.groups()

('1',)

# Re.findall() & Re.finditer()

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

Pattern.findall(string[, pos[, endpos]])

Similar to the findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for search().

In [70]:
re.findall(r'\w','http://www.hackerrank.com/')

['h',
 't',
 't',
 'p',
 'w',
 'w',
 'w',
 'h',
 'a',
 'c',
 'k',
 'e',
 'r',
 'r',
 'a',
 'n',
 'k',
 'c',
 'o',
 'm']

re.finditer()

The expression re.finditer() returns an iterator yielding MatchObject instances over all non-overlapping matches for the re pattern in the string.

In [71]:
re.finditer(r'\w','http://www.hackerrank.com/')

<callable_iterator at 0x1a63ef47cf8>

In [75]:
map(lambda x: x.group(),re.finditer(r'\w','http://www.hackerrank.com/'))

<map at 0x1a63ef5c828>

In [91]:
e = 'rabcdeefgyYhFjkIoomnpOeorteeeeet'
v = "aeiou"
c = "qwrtypsdfghjklzxcvbnm"
m = re.findall(r"(?<=[%s])([%s]{2,})[%s]" % (c, v, c), e, flags = re.I) #  re.I (ignore case)
print('\n'.join(m or ['-1']))

ee
Ioo
Oeo
eeeee


In [92]:
m = re.findall(r"(?<=[%s])([%s]{2,})[%s]" % (c, v, c), e) #  re.I (ignore case)
print('\n'.join(m or ['-1']))

ee
eeeee


# Re.start() & Re.end()

start() & end()

These expressions return the indices of the start and end of the substring matched by the group.

In [93]:
m = re.search(r'\d+','1234')
m.end()

4

In [94]:
m = re.search(r'\d+','1234')
m.start()

0

In [96]:
S= 'aaadaa'
k = 'aa'

In [97]:
n = re.search(k,S)

In [98]:
n.start()

0

In [99]:
n.end()

2

In [112]:
count = 0
n = len(k)
for i in range(len(S)):
    if S[i:i+n]==k:
        count = 1
        print((i,i+n-1))
if count == 0:
    print((-1,-1))    

(0, 1)
(1, 2)
(4, 5)


In [105]:
m = re.search(k, S)
pattern = re.compile(k)
if not m: print("(-1, -1)")
while m:
    print("({0}, {1})".format(m.start(),m.end()-1))
    m = pattern.search(S,m.start()+1)

(0, 1)
(1, 2)
(4, 5)


In [106]:
pattern

re.compile(r'aa', re.UNICODE)

# Regex Substitution
re.subn(pattern, repl, string, count=0, flags=0)

The re.sub() method returns the modified string as an output.

In [117]:
#Squaring numbers
def square(match):
    number = int(match.group(0))
    return str(number**2)

re.sub(r"\d+", square, "1 2 3 4 5 6 7 8 9")

'1 4 9 16 25 36 49 64 81'

In [115]:
html = """
<head>
<title>HTML</title>
</head>
<object type="application/x-flash" 
  data="your-file.swf" 
  width="0" height="0">
  <!-- <param name="movie"  value="your-file.swf" /> -->
  <param name="quality" value="high"/>
</object>
"""

print(re.sub("(<!--.*?-->)", "", html)) #remove comment


<head>
<title>HTML</title>
</head>
<object type="application/x-flash" 
  data="your-file.swf" 
  width="0" height="0">
  
  <param name="quality" value="high"/>
</object>



In [118]:
eg = '''11
a = 1;
b = input();

if a + b > 0 && a - b < 0:
    start()
elif a*b > 10 || a/b < 1:
    stop()
print set(list(a)) | set(list(b)) 
#Note do not change &&& or ||| or & or |
#Only change those '&&' which have space on both sides.
#Only change those '|| which have space on both sides.'''


re.sub(r'(?<= )(&&|\|\|)(?= )', lambda x: 'and' if x.group() == '&&' else 'or',eg)

"11\na = 1;\nb = input();\n\nif a + b > 0 and a - b < 0:\n    start()\nelif a*b > 10 or a/b < 1:\n    stop()\nprint set(list(a)) | set(list(b)) \n#Note do not change &&& or ||| or & or |\n#Only change those '&&' which have space on both sides.\n#Only change those '|| which have space on both sides."

# Validating Roman Numerals
re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

In [119]:
thousand = 'M{0,3}'
hundred = '(C[MD]|D?C{0,3})'
ten = '(X[CL]|L?X{0,3})'
digit = '(I[VX]|V?I{0,3})'
print (bool(re.match(thousand + hundred+ten+digit +'$', 'CDXXI')))

True


Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).

# Validating phone numbers

In [121]:
if re.match(r'[789]\d{9}$','9587456281'):   
    print('YES')  
else:  
    print('NO')  

YES


# Validating and Parsing Email Addresses

In [122]:
import email.utils

In [123]:
email.utils.parseaddr('DOSHI <DOSHI@hackerrank.com>')

('DOSHI', 'DOSHI@hackerrank.com')

In [124]:
email.utils.formataddr(('DOSHI', 'DOSHI@hackerrank.com'))

'DOSHI <DOSHI@hackerrank.com>'

In [125]:
y ='<dexter@hotmail.com>'

In [126]:
 m = re.match(r'<[A-Za-z](\w|-|\.|_)+@[A-Za-z]+\.[A-Za-z]{1,3}>', y)

# Hex Color Code

(?<!^)(#(?:[\da-f]{3}){1,2})

In [131]:
import re
i='''
11
#BED
{
    color: #FfFdF8; background-color:#aef;
    font-size: 123px;
    background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
    background-color: #ABC;
    border: 2px dashed #fff;
}   
'''
re.findall(r'[\s:](#[a-fA-f0-9]{3,6})',i)

['#BED', '#FfFdF8', '#aef', '#f9f9f9', '#fff', '#Cab', '#ABC', '#fff']

# HTML Parser - Part 1

In [140]:
from html.parser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs): # This method is called to handle the start tag of an element.
        print("Found a start tag  :", tag) # The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag’s <> brackets.
    def handle_endtag(self, tag): # This method is called to handle the end tag of an element. 
        print("Found an end tag   :", tag)
    def handle_startendtag(self, tag, attrs): # This method is called to handle the empty tag of an element. (For example: <br />)
        print("Found an empty tag :", tag)

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("<html><head><title>HTML Parser - I</title></head>"
            +"<body><h1>HackerRank</h1><br /></body></html>")

Found a start tag  : html
Found a start tag  : head
Found a start tag  : title
Found an end tag   : title
Found an end tag   : head
Found a start tag  : body
Found a start tag  : h1
Found an end tag   : h1
Found an empty tag : br
Found an end tag   : body
Found an end tag   : html


# HTML Parser - Part 2
.handle_comment(data)

This method is called when a comment is encountered (e.g. <!--comment-->).

.handle_data(data)

This method is called to process arbitrary data (e.g. text nodes and the content of <script>...</script> and <style>...</style>).
The data argument is the text content of HTML.

In [142]:
class MyHTMLParser(HTMLParser):
    def handle_comment(self, data): 
          print("Comment  :", data)

In [143]:
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Data     :", data)

# UID 

In [4]:
N = 2
a = '''
B1CD102354
B1CDEF2354
'''

In [5]:
for _ in range(N):
    u = ''.join(sorted(a))
    try:
        assert re.search(r'[A-Z]{2}', u)
        assert re.search(r'\d\d\d', u)
        assert not re.search(r'[^a-zA-Z0-9]', u)
        assert not re.search(r'(.)\1', u)
        assert len(u) == 10
    except:
        print('Invalid')
    else:
        print('Valid')

Invalid
Invalid


# Credit Card

In [None]:
import re 
TESTER = re.compile(
    r"^"
    r"(?!.*(\d)(-?\1){3})"
    r"[456]"
    r"\d{3}"
    r"(?:-?\d{4}){3}"
    r"$")
i = '4123456789123456'
print("Valid" if TESTER.search(i) else "Invalid")

In [None]:
TESTER

# Postal Codes

In [None]:
P = '110000'
regex_integer_in_range = r'^[1-9][\d]{5}$'
regex_alternating_repetitive_digit_pair = r'(\d)(?=\d\1)
print(bool(re.match(regex_integer_in_range, P)) 
and len(re.findall(regex_alternating_repetitive_digit_pair, P)) < 2)