# Advanced Processing
In this tutorial we introduce some more advanced tools for section processing.

1. Processing using a generator function
2. Rules and Rule Sets
3. Regular expression for processing text
4. The FixedWidth and csv Parsing tools
5. Making use of the Context dictionary
    1. Setting context values when calling Section.read
    2. Accessing default context values
6. Passing other parameters to the processing methods.

### Processing Generator Functions

A Process applied to a Source (a sequence of SourceItems) results in
a sequence of ProcessedItems.  The relation between SourceItems and
ProcessedItems is not necessarily 1:1.
   1 SourceItem ≠1 ProcessedItem;
      • 1 SourceItem → 1 ProcessedItem
      • 1 SourceItem → 2+ ProcessedItems
      • 2+ SourceItems → 1 ProcessedItem

Generator functions are used when multiple input items are
required to generate an output item, or when one SourceItem results in
multiple ProcessedItems. 

In general, regular functions are used when there
is a one-to-one correspondence between input item and output item.  

ProcessedItems = Union[ProcessedItem, Generator[ProcessedItem]


Testing Source and Section item counting.

Process method should track the number of Source lines used for each processed line

Processor creates sequence of source.item_count for each output item
- Len(section.item_count) = # processed items
- section.item_count[-1] = # source items (includes skipped source items)
- Property item_count returns len(self._item_count)
- Property source_item_count returns self._item_count[-1]



In [1]:

# %% Imports

from pprint import pprint
import random
from buffered_iterator import BufferedIterator

from sections import SectionBreak, Section
from sections import Rule, RuleSet, ProcessingMethods

from pathlib import Path

In [2]:

# %% Processing Functions
def pairs(source):
    '''Convert a sequence of items into a sequence of item pairs

    Successive items are combined into length 2 tuples.

    Args:
        source (Sequence): any sequence of hashable items

    Yields:
        Tuple[Any]: Successive items combined into length 2 tuples.
    '''
    for item in source:
        yield tuple([item, next(source)])


In [3]:


def n_split(source):
    '''Extract numbers from stings of comma separated integers.

    Number are extracted by splitting on the commas.  Spaces are ignored.

    Args:
        source (Sequence[str]): A sequence of stings composed of comma separated
            integers. e.g. ['0, 1', '2, 3', '4, 5' ...]

    Yields:
        int: Integer values extracted from the strings.
    '''
    for item in source:
        nums = [int(num_s.strip()) for num_s in item.split(',')]
        yield from nums



In [4]:

def odd_nums(source):
    '''Yield Odd items
    Args:
        source (Sequence[int]): A sequence of integers

    Yields:
        int: odd integers from the source
    '''
    for item in source:
        if int(item)%2 == 1:
            yield item


In [5]:

buffer_size = 5
num_items = 10

str_source = BufferedIterator(
    (str(i) for i in range(num_items)), 
    buffer_size=buffer_size
    )

int_source = BufferedIterator(
    (i for i in range(num_items)),
    buffer_size=buffer_size
    )

pairs_source = BufferedIterator(
    [f'{a}, {b}' for a, b in zip(range(0, num_items * 2, 2),
                                 range(1, num_items * 2, 2))],
    buffer_size=buffer_size
    )


1-to-1 match
    * range(n) as source
    * processor just returns item
    * for each section item:
    * source.item_count = item = section.source_item_count
    * source.item_count = section.item_count
        


In [6]:
section_1_1 = Section(section_name='1-to-1 match')
for item in section_1_1.process(int_source):
    print(item)


0
1
2
3
4
5
6
7
8
9


2-to-1 match
    * `range(n)` as source
    * processor converts 2 successive source items into tuple of length 2.
    * for each section item:
        * item = (source.item_count-2, source.item_count-1)
        * source.item_count = section.source_item_count
        * source.item_count = section.item_count * 2
        


In [7]:
section_2_1 = Section(section_name='2-to-1 match',
                        processor=[pairs])
for item in section_2_1.process(int_source):
    print(item)


1-to-2 match
    * Numerical pairs as source:
        `['0, 1', '2, 3', '4, 5'` $\cdots$`]`
    * processor converts 1 source item into 2 output lines
    * for each source item:
        * `nums = [int(num_s.strip()) for num_s in item.split(',')]`<br>
        * `section item 1 = nums[0]`,<br>
        * `section item 2 = nums[1]`
    * source.item_count = (section.item_count + 1) // 2
    * section.source_item_count = section.source_index[-1]
        


In [8]:
section_1_2 = Section(section_name='1-to-2 match',
                        processor=[n_split])
for item in section_1_2.process(pairs_source):
    print(item)


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


Skip First Source Item
    * (str(i) for i in range(n)) as source
    * start_section='1', offset='Before'
    * processor returns int(item)
    * for each section item:
        * source.item_count = item
        * source.item_count = section.source.item_count
        * source.item_count = section.item_count
        


In [9]:
section_skip_0 = Section(
    section_name='Skipped First Source Item',
    start_section=SectionBreak('1', break_offset='Before')
    )
for item in section_skip_0.process(str_source):
    print(item)


1
2
3
4
5
6
7
8
9


Skip First 2 Source Items
    * (str(i) for i in range(n)) as source
    * start_section='1', offset='After'
    * processor returns int(item)
    * for each section item:
        * source.item_count = item + 1
        * source.item_count = section.source_item_count
        * source.item_count = section.item_count + 2
        


In [10]:
section_skip_2 = Section(
    section_name='Skipped First Source Item',
    start_section=SectionBreak('1', break_offset='After')
    )
for item in section_skip_2.process(str_source):
    print(item)

Don't Count Dropped Items
    * range(n) as source
    * processor drops even items and yields odd items
    * for each section item:
        * item + 1 = source.item_count
        * source.item_count = section.source_item_count
        * source.item_count = section.item_count * 2
        


In [11]:
section_odd = Section(
    section_name='Odd Numbers',
    processor=[odd_nums]
    )
for item in section_odd.process(int_source):
    print(item)


Completed Section Item Count
     (str(i) for i in range(n)) as source
     processor drops even items and yields odd items
     after section.read(source):
        * source.item_count = section.source_item_count
        * section.source_item_count = section.item_count = n * 2
        


In [12]:
section_odd = Section(
    section_name='Odd Numbers',
    processor=[odd_nums]
    )
section_odd.read(int_source)



[]

Partial Source Completed Section
    * (str(i) for i in range(n)) as source
    * Random start_section and end_section
    * after section.read(source):
        * source.item_count = section.source_item_count
        * source.item_count = end_num
        * section.item_count = end_num - start_num
        


In [13]:
start_num = random.randint(1, num_items-2)
end_num = random.randint(start_num + 1, num_items)
part_section = Section(
    section_name='Partial Source Section',
    start_section=str(start_num),
    end_section=str(end_num)
    )
part_section.read(str_source)


[]

Completed Section With End Before
    * `(str(i) for i in range(n))` as source
    * end_section='2', offset='Before'
    * after section.read(source):
        * source.item_count = section.source_item_count
        * source.item_count = section.item_count = 2



In [14]:
section_end_before = Section(
    section_name='End Before',
    end_section=SectionBreak('2', break_offset='Before')
    )

section_end_before.read(str_source)



[]

Completed Section With End After
    * `(str(i) for i in range(n))` as source
    * end_section='2', offset='After'
    * after section.read(source):
        * source.item_count = section.source_item_count
        * source.item_count = section.item_count = 3
        


In [15]:
section_end_before = Section(
    section_name='End Before',
    end_section=SectionBreak('2', break_offset='After')
    )

item_list = section_end_before.read(str_source)
source_count = str_source.item_count
source_item_count = section_end_before.source_item_count
item_count = section_end_before.item_count



## Section Processing

Once identified, a section's content can be *processed* before being returned.
Automatic processing of the items in a section's content is specified with the 
*processor* argument in the *Section* definition. 

The *processor* argument takes a list of functions, *Rules*, or *RuleSets*. If 
the processor argument is not given or is `None` the items in the section are 
returned as-is.  *Rules* and *RuleSets* will be discussed in the next section.

Processor functions have one required positional argument, the item to be 
processed.  In addition, the function may contain a second positional argument,
a *context* dictionary.  The *context* dictionary will be discussed in a more
detail in a later section.  Additional keyword arguments may also be included.  
If the keyword matches with a key in the section's *context*, The corresponding 
*context* value will be supplied.  Otherwise the keyword argument will be 
ignored.

The functions will be applied in list order with the input of the function being 
the output from the previous function.  This means that the expected input type 
of a processor function should be able to handle all possible output types from 
the previous function in the list.

Processor functions may also be generator functions, in which case the required 
positional argument is the sequence to iterate over.  This can be useful if the 
processing involves skipping items or merging of multiple items.  Examples of 
this will be given in a separate tutorial.


Processing functions should accept one the following argument sets:
    func(item)
    func(item, ** context)
    func(item, context)
    func(item, [other(s),] ** context)


Custom function
- First argument is the item to be processed
- Optional second argument is the Section's *Context* dictionary.
- keyword arguments can be used to accept specific items from the Section's 
*Context* dictionary, provided there is a trailing **kwarks argument to catch 
the remainder of the Section's *Context* dictionary.



Processing functions should accept one the following argument sets:
    func(item)
    func(item, ** context)
    func(item, context)
    func(item, [other(s),] ** context)


Custom function
- First argument is the item to be processed
- Optional second argument is the Section's *Context* dictionary.
- keyword arguments can be used to accept specific items from the Section's 
*Context* dictionary, provided there is a trailing **kwarks argument to catch 
the remainder of the Section's *Context* dictionary.

The processing instruction(s) can be a ProcessingMethods
instance, or one of / a list of:
    Rule,
    RuleSet,
    section,
    list of sections,
    Any function with an appropriate call signature:
        func(items: SourceItem),
        func(items: SourceItem, context: ContextType),
        func(items: SourceItem, **kwargs: Any)

        Both regular functions and generator functions are
        accepted.

### Processing with a generator function

In a simple processor function the input is a single item from the section and
the output is one "item" for each "input item".

The relation between individual input (section) items and the resulting 
processed items is not necessarily 1:1. A single input item could be broken up 
into multiple processed items, or conversely, multiple section items could be 
converted into one processed item:

    	  • 1 SourceItem → 1 ProcessedItem
    	  • 1 SourceItem → 2+ ProcessedItems
    	  • 2+ SourceItems → 1 ProcessedItem

In general, regular functions are only used when there is a one-to-one 
correspondence between input item and output item.  Generator functions are used 
when multiple input items are required to generate a processed item, or when one 
input item results in multiple processed items.   

## Rule and RuleSets
Instead of having one function `process_directory()` that manages all possible 
text lines in the section, the function can be broken down into parts by 
defining *Rules*.

#### Rules
Rules define an action to take on an item depending on the result of a test.

A *Rule* definition has two parts:
1. Trigger:
   > Defines the test to be applied to the source item
   > Trigger related arguments:
   > - sentinel
   >   - For string items, sentinel can be a string or compiled regular expression.
   > - location
   >   - A sentinel modifier that applies to str or re.Pattern types of sentinels. One of  ['IN', 'START', 'END', 'FULL', None]. Default is None, which is treated as 'IN'

2. Action
   > Defines the actions to take depending on the Trigger outcome.
   > Action related arguments:
   > - pass_method
   > - fail_method
   >
   > Both take functions, or the name of standard actions to be implemented if the test passes or fails respectively.
   >
   > The pass_method and fail_method functions can be simple process functions, with one positional argument and additional keyword arguments. The functions can also contain a second positional argument *event* which allows the function to access information about the test results.  This is particularly useful when the sentinel is a regular expression.
   >
   > pass_method and fail_method can also be a string with the name of one of the standard actions.  The most common are:
   > - 'Original': return the item being.
   > - 'None': return None
   > - 'Blank': return ''  (an empty string)


#### RuleSets
RuleSets combine related Rules to provide multiple choices for actions.
RuleSets
    are used when the function that should be applied to the SourceItem(s)
    depends on the result of one or more tests (Triggers).  Individual Rules can
    be used when only a single Trigger is required (by using both the Pass and
    Fail methods of the Rule) or to modify some of the SourceItems while leaving
    others unchanged (by setting the Fail method to 'Original').  For Rules or
    RuleSets it is important that the output is of the same type regardless of
    whether the Trigger(s) pass or fail.

- A Rule Set takes A sequence of Rules and a default method.
- Each Rule in the sequence will be applied to the input until One of the rules triggers. At that point The sequence ends.  
- If no Rule triggers then the default method is applied.  
- Each of the Rules (and the default) should expect the same input type and should produce the same output type.  
- The default_method can be any valid process function or standard action.



*Triggers*, *TriggerEvent*, *Rules* and *RuleSets* will be covered in more detail in a separate tutorial.
        

# Regex based processing

In [16]:
import re

#%% Regex Parsing patterns
# File Count and summary:
     #          1 File(s)          59904 bytes
     #         23 Dir(s)     63927545856 bytes free
folder_summary_pt = re.compile(
    '(?P<files>'       # beginning of files string group
    '[0-9]+'           # Integer number of files
    ')'                # end of files string group
    '[ ]+'             # Arbitrary number of spaces
    '(?P<type>'        # beginning of type string group
    'File|Dir'         # "File" or " Dir" text
    ')'                # end of type string group
    '\\(s\\)'          # "(s)" text
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' bytes'           # "bytes" text
    )



In [17]:
test_file = Path.cwd() / 'examples' / 'test_DIR_Data.txt'
dir_text = test_file.read_text().splitlines()