# Text processing with regular expressions #

In this lecture we will discuss extracting information from strings/pieces of text. By information we mean here a substring which satisfies certain specifications, e.g. it represents a number or valid email address. We will see that this task can sometimes get quickly out of hand - capturing all the corner cases will lead to code which maybe hard to read and maintain. This is where regular expression will enter the stage to save the day. 

## String methods
Recall the previously seen methods of `str` objects
```python
str.find
str.index
str.count
str.replace
```
which are suitable for working with specifications given by fixed characters. For example, a simple re-implementation of `os.path.splitext` could read

In [13]:
import os

def splitext(path):
    '''Naive split extension in path'''
    if path.count('.') == 0:
        return None
    # Unpack the list as list[:-1], list[-1]
    *noext, ext = path.split('.')
    return (''.join(noext), f'.{ext}')

print(splitext('foo_bar'), splitext('foo_bar'))
print(splitext('foo.bar'), os.path.splitext('foo.bar'))
print(splitext('foo.bar.txt'), os.path.splitext('foo.bar.txt'))

None None
('foo', '.bar') ('foo', '.bar')
('foobar', '.txt') ('foo.bar', '.txt')


### What is a number?
What if the specifications are not as easy to define? In the following we will attempt to write a fuction which return a list of numbers contained in a string. Before finding many we shold be able to find just one.

In [24]:
# Our first definition motivated by integers
def only_digits(string):
    if set(string) <= {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}:
        return True
    return False

print(only_digits('123a'))
print(only_digits('123'))

# But what about negative numbers
def is_integer(string):
    if len(string) == 1:
        return only_digits(string)
    maybe_sign, rest = string[0], string[1:]
    if maybe_sign == '-' and only_digits(rest):
        return True
    
    return only_digits(string)
    
print(is_integer('-123'))

False
True
True


But what about `floats`, 0.234 and scientific formating 0.234E-04. We could probably handle all these after a while but here we will settle for a different approach: a number string is that string which can be converted to a number 

In [22]:
def is_number(string):
    '''It is a number if it can be converted'''
    try:
        float(string)
        return True
    except ValueError:
        return False
    
print(is_number(-3.43E-02))

True


So far we have considered strings that were either number or not. But what if the strings are more involved. For example let us consider some string encoding wind speed and temperature 'wind27temperature-32'. We don't expect 
```python
is_number('wind27temperature32')
```
to work but we could start building a solution on this function.

In [26]:
is_number('wind27temperature32')

False

In [30]:
def find_numbers(string):
    numbers = []
    first, n = 0, len(string)
    while first < n:
        maybe = string[first:first+1]
        # If we have a number, grow it as much as possible
        if is_number(maybe):
            last = first
            while last < n and is_number(string[first:last+1]):
                maybe = string[first:last+1]
                last += 1
            numbers.append(maybe)
            first = last
        else:
            first += 1
            
    return numbers
print(find_numbers('wind27temperature32'))
print(find_numbers('wind27temperature42.23423'))

['27', '32']
['27', '42.23423']


So this looks promising but what about negetive temperatures and that scientific formatting again?

In [31]:
print(find_numbers('wind27temperature-42.234E-23'))


['27', '42.234', '23']


Not quite what we want ... .The problem above was matching on the smallest strings so (-) is ignore because it is alone and we don't include 42.234E-23 because "42.234E" is `False` according to `is_number`. We are back to the drawing board.

In [34]:
def get_substrings(string):
    '''Generated substring of string order by descending ize'''
    substrings = []
    n = len(string)
    for subl in range(n-1, 0, -1):
        for i in range(n+1-subl):
            substrings.append(string[i:i+subl])
    return substrings


def find_numbers1(string):
    '''Match from largest'''
    if not string:
        return []

    if is_number(string):
        return [string]

    numbers = []
    for substring in get_substrings(string):
        # print(substring, is_number(substring))
        # For a match we clip the string and as for matches in the neighbors
        if is_number(substring):
            numbers.append(substring)
            # print('->', substring, string[:string.index(substring)], string[string.index(substring)+len(substring):])
            numbers.extend(find_numbers1(string[:string.index(substring)]))
            numbers.extend(find_numbers1(string[string.index(substring)+len(substring):]))
            break
    return numbers
            
print(find_numbers1('wind27temperature-42.234E-23'))
print(find_numbers1('wind27temperature-42.234E-23 wind27temperature-42.234e123'))

['-42.234E-23', '27']
['-42.234E-23 ', '27', '-42.234e123', '27']


This seems to work but (while it may have been fun to write this code) you should have a feeling that [There must be a better way](https://fullstackfeed.com/there-must-be-a-better-way-raymond-hettinger-python/). 

# Regular expressions
- sequence of characters that defines a search pattern
- the pattern is parsed/interpretted in a special way (similar to programming language)
- they are not unique to Python, see [RegEx](https://regex101.com/)
- in Python the functionality for RegEx is provided in the [re](https://docs.python.org/3/library/re.html) module 

Here we will demonstrate the basics of regex syntex using the top-level function `re.search` and return to more of the module's functions later
```python
re.search(pattern:str, string:str) -> re.Match or None
```

## Specifying the pattern

## Additional Resources ##

Videos from Simon Funke from previous editions of the course

- [Regex 1](https://www.youtube.com/watch?v=ma93hpNFXZM)
- [Regex 2](https://www.youtube.com/watch?v=B6XoKtQA2Fc)
- [Regex 3](https://www.youtube.com/watch?v=jxNLY0L_N78)
- [Regex 4](https://www.youtube.com/watch?v=j1jW5EF5jfs)

RegEx [golf](https://nbviewer.org/url/norvig.com/ipython/xkcd1313.ipynb) by Google's Peter Norvig.