### Text parsing examples

This notebook demonstrates methods from the Python standard library to:

+ find specific text within a larger text string
+ replace the text with something else

It uses the Standard Library tools.  Methods on the Python `str` objects are used first, for exact matches.
Regular expressions from the `re` module allow patterns to be used.

The [pythex](https://pythex.org/) website is an excellent reference for Regular Expressions.

Beware: Regular Expressions are notoriously fiddly and can make things harder (see https://xkcd.com/1171/)

For further reading, see the [Real Python Regular Expression tutorial](https://realpython.com/regex-python/).

In [2]:
from pathlib import Path
import re

In [3]:
# Simple text string
EMAIL_ADDRESS = "user123@example.com"

# A Volcanic Ash Advisory notice is a more complicated text string
VAA_TEXT = Path('VAA_EXAMPLE.DAT').read_text()
print(VAA_TEXT)

FVFE01 RJTD 090552                                              2014068 0553
VA ADVISORY
DTG: 20140309/0552Z
VAAC: TOKYO
VOLCANO: SAKURAJIMA 0802-08
PSN: N3135E13040
AREA: JAPAN
SUMMIT ELEV: 1060M
ADVISORY NR: 2014/90
INFO SOURCE: MTSAT-2
AVIATION COLOUR CODE: NIL
ERUPTION DETAILS: VA CONTINUOUSLY OBS ON SATELLITE IMAGERY.
OBS VA DTG: 09/0515Z
OBS VA CLD: SFC/FL120 N3105 E13115 - N3125 E13150 - N3115 E13210 -
N3
130 E13235 - N3115 E13245 - N3055 E13205 - N3100 E13150 - N3050
E1312
0 MOV SE 25KT
FCST VA CLD +6 HR: 09/1115Z SFC/FL110 N3010 E13225 - N3115 E13730 -
N
2945 E13730 - N2900 E13500 - N2900 E13230
FCST VA CLD +12 HR: 09/1715Z SFC/FL090 N2830 E13350 - N2835 E13720 -
N3030 E14105 - N2855 E14150 - N2705 E13905 - N2700 E13400
FCST VA CLD +18 HR: 09/2315Z SFC/FL080 N2735 E14035 - N2950 E14440 -
N2820 E14555 - N2545 E14200 - N2455 E13455 - N2620 E13455
RMK: NIL
NXT ADVISORY: 20140309/1200Z=



### String methods

In [5]:
# Check that text exists
"user" in EMAIL_ADDRESS

True

In [4]:
# Check where it is found
EMAIL_ADDRESS.find("example")

8

In [5]:
# Use .casefold for case-insensitive comparison
"USER".casefold() in EMAIL_ADDRESS

True

In [6]:
# Use .replace to replace text
EMAIL_ADDRESS.replace("example.com", "bgs.ac.uk")

'user123@bgs.ac.uk'

In [14]:
# Use split to break text into newlines
for line_no, line in enumerate(VAA_TEXT.split('\n')):
    print(f"{line}")

FVFE01 RJTD 090552                                              2014068 0553
VA ADVISORY
DTG: 20140309/0552Z
VAAC: TOKYO
VOLCANO: SAKURAJIMA 0802-08
PSN: N3135E13040
AREA: JAPAN
SUMMIT ELEV: 1060M
ADVISORY NR: 2014/90
INFO SOURCE: MTSAT-2
AVIATION COLOUR CODE: NIL
ERUPTION DETAILS: VA CONTINUOUSLY OBS ON SATELLITE IMAGERY.
OBS VA DTG: 09/0515Z
OBS VA CLD: SFC/FL120 N3105 E13115 - N3125 E13150 - N3115 E13210 -
N3
130 E13235 - N3115 E13245 - N3055 E13205 - N3100 E13150 - N3050
E1312
0 MOV SE 25KT
FCST VA CLD +6 HR: 09/1115Z SFC/FL110 N3010 E13225 - N3115 E13730 -
N
2945 E13730 - N2900 E13500 - N2900 E13230
FCST VA CLD +12 HR: 09/1715Z SFC/FL090 N2830 E13350 - N2835 E13720 -
N3030 E14105 - N2855 E14150 - N2705 E13905 - N2700 E13400
FCST VA CLD +18 HR: 09/2315Z SFC/FL080 N2735 E14035 - N2950 E14440 -
N2820 E14555 - N2545 E14200 - N2455 E13455 - N2620 E13455
RMK: NIL
NXT ADVISORY: 20140309/1200Z=



#### Exercise

+ Write a function, `extract_vaac`, to extract the VAAC from the VAA text.
  <details><summary>Hint</summary>
  Split twice, first on newlines, then on `:`
  </details>
+ Confirm the function works by running `assert extract_vaac(VAA_TEXT) == "TOKYO"`

### Regular expressions

In [12]:
# Simple match

match = re.search(r'user123', EMAIL_ADDRESS)
if match:
    print(match.group())

user123


In [None]:
# Simple replace (substitute)
re.sub(r'example.com', 'bgs.ac.uk', EMAIL_ADDRESS)

'user123@bgs.ac.uk'

In [15]:
# Special characters and groups (see pythex.org)
# . - any character
# \d - digit
# \s - whitespace
# \w - word character
# \. - literal .
# ? - zero or one
# + - one or more
# * = zero or more
# ^ = start of string
# $ = end of string
match = re.search(r'\d+', EMAIL_ADDRESS)
if match:
    print(match.group())

match = re.search(r'(.*)@(.*)', EMAIL_ADDRESS, flags=re.MULTILINE)
if match:
    print(match.groups())
    username, domain = match.groups()


123
('user123', 'example.com')


In [17]:
# Substitution with groups
# Groups are stored in numbered locations e.g. \1
re.sub(r'(.*)@(.*)', r'user: \1, email: \2', EMAIL_ADDRESS)

'user: user123, email: example.com'

#### Exercise

+ Write a regular expression to extract the VAAC from the VAAC text.  You will need the re.MULTILINE flag to match the `$` within the text.

In [21]:
re.findall(r'\d{4}-\d{2}-\d{2}', VAA_TEXT)

[]

### Stretch exercise
+ Write a function that takes the observed cloud extent and returns a list of lat, long pairs. e.g. `SFC/FL120 N3105 E13115 - N3125 E13150 - N3115 E13210 -
N3
130 E13235 - N3115 E13245 - N3055 E13205 - N3100 E13150 - N3050
E1312`