## REGEX (regular expressions)
Regex is used to extract values from strings.  In python this typically done with the re package.
Typically regex patterns are subbmitted to functions in the re package that take and return a string.

Regex Cheat Sheet https://www.debuggex.com/cheatsheet/regex/python

#### Regex function in the re packages
+ findall: 	Returns a list containing all matches
+ search: 	Returns a Match object if there is a match anywhere in the string
+ split: 	Returns a list where the string has been split at each match
+ sub: 	Replaces one or many matches with a string


\A 	restricts the match to start of string

\Z 	restricts the match to end of string

^ 	restricts the match to start of line

$ 	restricts the match to end of line

\n 	newline character is used as line separator

\b 	restricts the match to start/end of words

\B 	matches wherever \b doesn’t match

| 	multiple RE combined as conditional OR each alternative can have independent anchors

. 	Match any character except the newline character 

[] 	Character class, matches one character among many


'*' 	Match zero or more times


'+' 	Match one or more times


? 	Match zero or one times

{m,n} 	Match m to n times (inclusive)

{m,} 	Match at least m times

{,n} 	Match up to n times (including 0 times)

{n} 	Match exactly n times

[aeiou] 	Match any vowel

[^aeiou] 	^ inverts selection, so this matches any consonant

[a-f] 	- defines a range, so this matches any of abcdef characters

\d 	Match a digit, same as [0-9]

\D 	Match non-digit, same as [^0-9] or [^\d]

\w 	Match word character, same as [a-zA-Z0-9_]

\W 	Match non-word character, same as [^a-zA-Z0-9_] or [^\w]

\s 	Match whitespace character, same as [\ \t\n\r\f\v]

\S 	Match non-whitespace character, same as [^\ \t\n\r\f\v] or [^\s]


    
    
referenced from https://learnbyexample.github.io/cheatsheet/python/python-regex-cheatsheet/

In [2]:
## Find digits in strings
import re
x = '123test 456'
print(re.findall('\d+',x))


['123', '456']


In [4]:
## replaces all none digits with ''
print(re.sub('[^\d*]', '', x))

123456


In [10]:
## Extract floating point number
x = 'this is a test 1.23'
print(re.findall('\d[.]?\d*',x))


['1.23']


In [30]:
## packing floats into a function
## Extract floating point number

x = 'this is a test 1.23'

# most basic function
def string_to_float(x):
    output = re.findall('\d[.]?\d*',x)[0]
    return float(output)

# with error handeling
def string_to_float(x, fill_val=0):
    try:
        output = re.findall('\d[.]?\d*',x)[0]
    except:
        output = fill_val
    return float(output)

# as a generator

string_to_float(x)

1.23

#### Applying Regex to a list of Strings
Applying a regex function to a list of strings, there are 4 basic methods

+ list comprhension, simple and readable, but slow
+ generators, typically very fast when embedded in function
+ multi threading, shared memory space
+ multi processing, (spins up multiple processing with seperate memory spaces, cannot be burried easily in a function

In [43]:
## List comprehension method
# Good for small data, (typically slow)
my_list = [ 'this is a test 1.23', '1.123x', None]
[string_to_float(x) for x in my_list]

[1.23, 1.123, 0.0]

In [42]:
# Apply to a list as a generator
my_list = [ 'this is a test 1.23', '1.123x', None]
# as a generator
def string_to_float_gen(input_list, fill_val=0):
    for x in input_list:
        try:
            output = re.findall('\d[.]?\d*',x)[0]
        except:
            output = fill_val
        yield float(output)


list(string_to_float_gen(my_list ))


[1.23, 1.123, 0.0]

In [40]:
### Apply using multithreading using map

my_list = [ 'this is a test 1.23', '1.123x', None]
list(map(string_to_float, my_list))



[1.23, 1.123, 0.0]

In [39]:
## Using Multiprocessing to apply to a list in parallel
from multiprocessing import Pool, cpu_count

n_cpus =  cpu_count()
if __name__ == '__main__':  # protects from spinning up multiple proccess 
    p = Pool(n_cpus)
    output = p.map(string_to_float, my_list)
    p.terminate()

print(output)

[1.23, 1.123, 0.0]
