# String Regex Annotator

The `StringRegexAnnotator` makes it extremely easy to match several complex regular expressions against 
a document and annotate the matches and/or the part of a match corresponding to a capturing regular expression 
group. 

It also has a simple macro substitution feature that makes it easy to build more complex regular expression from
simpler ones. 

In [1]:
import os
from gatenlp import Document
from gatenlp.processing.gazetteer import StringRegexAnnotator

## Creating the Annotator

Similar to the gazetteer annotators, there are several ways of how the annotator can be created: from a file that 
contains the regular expression rules, from a string (consisting of several lines) that contains regular expression rules (basically the content of a file as a string) or from prepared rule objects. Which of this to use is specified with the `source_fmt` parameter of either the constructor or the `append` method. 

#### Create from a string with rules

The following example shows a string that contains a single simple rule which finds a date in ISO format (YYYY-MM-DD) and annotates it with annotation type "Date"

In [4]:
rules1 = """
|[0-9]{4}-[0-9]{2}-[0-9]{2}
0 => Date
"""

annt1 = StringRegexAnnotator(source=rules1, source_fmt="string")

doc1 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")

annt1(doc1)
doc1


## The rules file/string format

A rules file must contain one or more rules.

Each rule consists of:
* one or more pattern lines which must start with "|", followed by
* one or more action lines which must start with a comma separated list of group numbers followed by "=>" followed by the annotation type to assign, optionally followed by feature assignments.

The action line specifies how an annotation should get created for one or more groups of a matching regular 
expression. 

The simple rules string above contains one rule, with one patterh line and one action line: 

```
|[0-9]{4}-[0-9]{2}-[0-9]{2}
0 => Date
```

The pattern line `|[0-9]{4}-[0-9]{2}-[0-9]{2}` specifies the simple regular expression.

The action line `0 => Date` specifies that an annotation with the annotation type "Date" should get created 
for the match, spanning "group 0". The convention with regular expressions is that "group 0" always referes to 
whatever is matched by the _whole regular expression_. 

## Using groups

In addition to group 0, anything within simple parentheses in the regular expression is a "capturing group". Capturing groups get numberd by their opening parenthesis when counting from left to right. For example, the following regular expression has 3 additional groups for the year, month and day part of the whole ISO date. The rule then refers to the whole matched date via group 0 but also creates annotations of type Year, Month and Day 
for each of the groups:

```
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date
1 => Year
2 => Month
3 => Day
```


In [7]:
rules2 = """
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date
1 => Year
2 => Month
3 => Day
"""

annt2 = StringRegexAnnotator(source=rules2, source_fmt="string")

doc2 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")

annt2(doc2)
doc2


## Adding features to annotations

For each annotation that gets created for a match it is possible to also specify features to set in each action.
Feature values can be specified as constants or as the value of one of the matched groups. To illustrate this, the following example assigns the year, month and day string to all annotations (Date, Day, Month, Year). In addition it assigns the constant value "iso" to the "type" feature of the "Date" annotation. To assign the value of some group number n, the variable "Gn" can be used, e.g. "G2" for group 2:

In [6]:
rules3 = """
|([0-9]{4})-([0-9]{2})-([0-9]{2})
0 => Date  type="iso", year=G1, month=G2, day=G3
1 => Year  year=G1, month=G2, day=G3
2 => Month year=G1, month=G2, day=G3
3 => Day year=G1, month=G2, day=G3
"""

annt3 = StringRegexAnnotator(source=rules3, source_fmt="string")

doc3 = Document("A document that contains a date here: 2013-01-12 and also here: 1999-12-31")

annt3(doc3)
doc3
