## Collection of Interesting Problems from Q&A Forum

In [1]:
import re # python regex module

### Question 1 : How to extract unit number from postal address by adil labiad

Link:  
https://www.udemy.com/course/python-regular-expressions/learn/#questions/17554094/  

*** input ***  
5035 68th St  
3310 B Wendy Woods Ln  
117-2555 A Branch Rd  
RC-123-998A Nowell St W  
1 rue de la fontaine  
333 RUE de la fontaine  
   
*** output ***  
5035  
3310 B  
117-2555 A  
RC-123-998A  
1  
333  

### Solution

- In the above example, you want to match the first word in each line 
- Followed by optional dash and word. The dash and word can repeat multiple times.
- Followed by an optional one character code
- You can include the case insensitive and multi-line inline options for interactive testing.

In [2]:
pattern = r"(?im)^\w+(?:[-\w]+)?(?:\s[a-z]\b)?"

In [3]:
text = '''5035 68th St
3310 B Wendy Woods Ln
117-2555 A Branch Rd
RC-123-998A Nowell St W
1 rue de la fontaine
333 RUE de la fontaine'''

In [4]:
# successful match
match_iter = re.finditer(pattern, text)

print ('Matches')
for match in match_iter:
    print('  ', match.group(0))

Matches
   5035
   3310 B
   117-2555 A
   RC-123-998A
   1
   333


### Question 2 : Unusual Behavior When Using FindAll by Israel Carrillo Becerril and Krishna Chaitanya 

https://www.udemy.com/course/python-regular-expressions/learn/#questions/16171160/

This is a python regex module specific issue

In the below code, the pattern looks for car or carpet. 

However, the findall method returns only  ['pet', '']. 

Why is findall not returning car and carpet as the matches? 

The finditer method correctly returns car and carpet as the matching values. 

What is causing this inconsistent behavior?

In [6]:
text = "carpet and car"

pattern = r"car(pet)?"

In [8]:
# findall

print('*** findall ***')
match = re.findall(pattern,text)

if match:
  print(match)
else:
  print("No match")

*** findall ***
['pet', '']


In [10]:
print('*** finditer - correctly matches carpet and car ***')

match_iter = re.finditer(pattern, text)

print ('Matches')
for match in match_iter:
    print('  ', match.group(0))

*** finditer - correctly matches carpet and car ***
Matches
   carpet
   car


### Solution

Interesting issue.  I had to refer to documentation to see what is going on. 

https://docs.python.org/3/library/re.html

Looks like findall returns only the strings that match the capturing group.  

"The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result."

So, once I modified the group as non-capturing, it worked. Very weird behavior!  Here is the updated pattern

In [11]:
text = "carpet and car"

pattern = r"car(?:pet)?"

In [13]:
# findall

print('*** findall - non capturing groups***')
match = re.findall(pattern,text)

if match:
  print(match)
else:
  print("No match")

*** findall - non capturing groups***
['carpet', 'car']


### Question 3 (from AWS Machine Learning Course): How to split results returned by sagemaker endpoint. Results contain comma and newline as separators

Input 1: b"2.3,1.9,15.01,0.95"  
Input 2: b'2.3\n1.9\n15.01\n0.95'  
Input 3: b'2.3,\n1.9\n,15.01,\n0.95'  
  
Output:  
2.3  
1.9  
15.01  
0.95  

In [8]:
import re # python regex module

inputFormats = [b"2.3,1.9,15.01,0.95", 
                b'2.3\n1.9\n15.01\n0.95', 
                b"0.3,\n1.9\n,15.01,\n0.95"]

# pattern looks for one or more of non-numeric characters
pattern = r'[^0-9.]+'

for s in inputFormats:
   print(re.split(pattern,s.decode()))

['2.3', '1.9', '15.01', '0.95']
['2.3', '1.9', '15.01', '0.95']
['0.3', '1.9', '15.01', '0.95']
