# Section Introductory Tutorial

## Section Components
A section defines a continuous portion of a text stream or other iterable.
A Section instance is the set of definitions for managing a sequence. the same instance can be used repeatedly on different sequence inputs or even on different portions of the same sequence.

The section definition may include:
- Boundary definitions for the section's start and end.
- Formatting or processing instructions for parsing text, changing data types, and merging or dropping sequence items.
- An assembly method, which combines all of the processed items into a single item such as a list, dictionary or pandas DataFrame.



### Boundaries
At its simplest a section is a definition of the start and/or end of a a sequence.

#### SectionBreak
### Formatting
#### ProcessingMethods
### Summary
#### assemble Functions

## "Hello World"
A Sectionary equivalent to the traditional "Hello World" example.

In [2]:
from sections import Section
s = Section()
s.read('Hello World')

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

1. treats the string 'Hello World' as a sequence
2. iterates through the entire sequence (No Boundaries are defined), converting it into a list (The default assemble function)

In [3]:
from sections import Section
s = Section('W')
s.read('Hello World')

['W', 'o', 'r', 'l', 'd']

1. 'W' is a starting Boundary.
2. iterates through the sequence starting with the first 'W' encountered.
3. 

In [4]:
from sections import Section
s = Section('H','W')
s.read('Hello World')

['H', 'e', 'l', 'l', 'o', ' ']

1. 'H' is a starting Boundary.
2. 'W' is an ending Boundary.
3. iterates through the sequence starting with the first 'H' encountered and ending with the first 'W' encountered.

In [3]:
from sections import Section
s = Section('H','W', str.upper)
s.read('Hello World')

['H', 'E', 'L', 'L', 'O', ' ']

1. str.upper is a Processor (Formatting) function
2. str.upper is applied to each item (letter) in the sequence (string) before building the list

In [9]:
from functools import partial
from sections import Section
j = partial(str.join, '.')
s = Section('H','W', str.upper, assemble=j)
s.read('Hello World')

'H.E.L.L.O. '

# *DONE TO HERE*

## Introduction

 There are many good text readers and parsers available for Python (such as *csv*), 
 but they generally assume that the source they are reading from has uniform 
 formatting throughout.  However, often this is not the case. Different parts 
 of a text file may contain different types of information each of which require different approached to reading the data. 

 The Sections module is used to define, read and process distinct groups of items
 -- usually lines of text -- from an iterable source.  

The principal class is:

    Section(name: str = 'Section',
            start_section: (SectionBreak, List[SectionBreak], str, Optional)
            end_section: (SectionBreak, List[SectionBreak], str, Optional)
            processor: (ProcessingMethods, Section, List[Section], Optional)
            assemble: (Callable, Optional)
            keep_partial: bool = False)

- Section defines a continuous portion of a text stream or other iterable.

- A section definition may include:

    - Starting and ending break points.
    - Processing instructions.
    - An assembly method.

- A Section instance is created by defining one or Once a section has been defined, it can be applied to an iterator using:

`read(source)`
> Where
> *source* is any iterable supplying the text lines to be parsed.

Supporting classes:

`Trigger(sentinel, location=None, name)`: 
>  Define a test for evaluating a source item.

`SectionBreak(sentinel, location, break_offset, name)`: 
>  Identify the start or end of a section.

`Rule(sentinel, location, pass_method, fail_method, name)`: 
>  Apply a method based on trigger test result.

`RuleSet(rule_list, default, name)`:  
>  Apply a sequence of Rules, stopping with the first Rule to pass.
        
`ProcessingMethods(processing_methods, name)`: 
>  Apply a series of functions to a supplied sequence of items.

**Note:** Although the examples given here are focused on text, The Sectionary package works with any type of sequence.

## Imports

#### Standard Python Modules

In [3]:
from pathlib import Path
from pprint import pprint
import inspect
import re

#### Useful Third Party Packages

In [4]:
import pandas as pd
#import xlwings as xw

#### Sectionary Imports

In [5]:
import sections
#import text_reader as tp
#from sections import Rule, RuleSet, SectionBreak, ProcessingMethods, Section

## The Text to be Processed

This tutorial uses the output from the Windows `dir` command:
> `DIR "Test Dir Structure" /S /N /-C /T:W >  "test_DIR_Data.txt"`

More information on this command syntax and resulting output can be found 
[here](MS_Dir_Output.html)

The output from the `Dir` command is read in as a list of text lines by the command:
`dir_text = Path('test_DIR_Data.txt').read_text().splitlines()`

In [8]:
dir_text = Path('test_DIR_Data.txt').read_text().splitlines()

`dir_text` also can be obtained directly using an _iPython_ command:
> `dir_text = !DIR "Test Dir Structure" /S /N /-C /T:W`

### `Dir` Output Structure

The first 24 lines of `dir_text` are:


In [9]:
for line in dir_text[0:20]:
    print('\t', line)

	  Volume in drive C is Windows
	  Volume Serial Number is DAE7-D5BA
	 
	  Directory of c:\users\...\Test Dir Structure
	 
	 2021-12-27  03:33 PM    <DIR>          .
	 2021-12-27  03:33 PM    <DIR>          ..
	 2021-12-27  04:03 PM    <DIR>          Dir1
	 2021-12-27  05:27 PM    <DIR>          Dir2
	 2016-02-25  09:59 PM                 3 TestFile1.txt
	 2016-02-15  06:46 PM                 7 TestFile2.rtf
	 2016-02-15  06:47 PM                 0 TestFile3.docx
	 2016-04-21  01:06 PM              3491 xcopy.txt
	                4 File(s)           3501 bytes
	 
	  Directory of c:\users\...\Test Dir Structure\Dir1
	 
	 2021-12-27  04:03 PM    <DIR>          .
	 2021-12-27  04:03 PM    <DIR>          ..
	 2016-02-15  06:48 PM                 0 File in Dir One.txt


We will ignore the first two lines (the *header section*)

In [10]:
for line in dir_text[0:2]:
    print('\t', line)

	  Volume in drive C is Windows
	  Volume Serial Number is DAE7-D5BA


After this come multiple Folder sections something like this:

In [11]:
print(dir_text[3])
print()
for line in dir_text[5:9]:
    print(line)
print(dir_text[13])

 Directory of c:\users\...\Test Dir Structure

2021-12-27  03:33 PM    <DIR>          .
2021-12-27  03:33 PM    <DIR>          ..
2021-12-27  04:03 PM    <DIR>          Dir1
2021-12-27  05:27 PM    <DIR>          Dir2
               4 File(s)           3501 bytes


## Defining a Section

The start and end of a folder listing can be identified by key phrases:
- The section start is identified by the text '*Directory of*'
- The section end is identified by the text '*File(s)*'

### Define a Section Based on these start and end identifiers

In [12]:
dir_section = sections.Section(
    start_section='Directory of', 
    end_section='File(s)'
    )
dir_section.read(dir_text)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt']

`dir_section.read(dir_text)` returned the first folder listing in *dir_text*.
However, it is missing the final line:

In [13]:
print(dir_text[13])

               4 File(s)           3501 bytes


To include this line, we need to define the `end_section` to end *After* the specified text.  We include this information by creating a `SectionBreak` object and explicitly including the last line using the `break_offset` argument:

In [14]:
dir_section = sections.Section(
    start_section='Directory of',
    end_section=sections.SectionBreak('File(s)', break_offset='After'))

dir_section.read(dir_text)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

dir_text is a list so `dir_section.read(dir_text)` starts over at the beginning each time it is called.

In [15]:
pprint(dir_section.read(dir_text))
pprint(dir_section.read(dir_text))

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']
[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '        

By creating an iterator from *dir_text* `dir_text_iter = iter(dir_text)` 
(representing a text stream source) 
successive calls to `dir_section.read(dir_text_iter)` 
will return the next directory group

In [16]:
dir_text_iter = iter(dir_text)
pprint(dir_section.read(dir_text_iter))
pprint(dir_section.read(dir_text_iter))

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']
[' Directory of c:\\users\\...\\Test Dir Structure\\Dir1',
 '',
 '2021-12-27  04:03 PM    <DIR>          .',
 '2021-12-27  04:03 PM    <DIR>          ..',
 '2016-02-15  06:48 PM                 0 File in Dir One.txt',
 '2021-12-27  03:45 PM    <DIR>          SubFolder1',
 '2021-12-27  03:45 PM    <DIR>          SubFolder2',
 '               1 File(s)              0 bytes']


## Section Processing
Identifying sections is only the first step.
Next let's do something with the section text.

### Section assemble functions
Summarize a section's content by supplying the section definition with 
an `assemble` method.

`assemble` (AssembleFunction, optional): 
A function used to collect and format, the section into a single object.
Defaults to None, which returns a list.

An `AssembleFunction` is a function that accepts a sequence of items as its first argument.  It may also accept a *Context* dictionary as its second argument, which supplies attributes generates in the Section class.

#### Important Note:
The sequence object passed to the `AssembleFunction` is actually a generator.  This is done to allow stream type sequences to be handled well.

When used with a for loop, a generator behaves just like a list, but a generator cannot be *sliced*.  If slicing is needed, simply convert the generator to a list with:<br>
> `section_list = list(section_gen)`.

 #### Format of a line from a dir listing
 The DIR output is formatted into columns with spaces for padding.
 The information can be extracted by identifying start and end columns:
 (The numbers on teh top an bottom are provided to aid with counting the text)
```
00000000001111111111222222222233333333334444444444555555555566666666667777777777
01234567890123456789012345678901234567890123456789012345678901234567890123456789
2021-12-27  03:33 PM    <DIR>          .
2021-12-27  03:33 PM    <DIR>          ..
2021-12-27  04:03 PM    <DIR>          Dir1
2021-12-27  05:27 PM    <DIR>          Dir2
2016-02-25  09:59 PM                 3 TestFile1.txt
2016-02-15  06:46 PM                 7 TestFile2.rtf
2016-02-15  06:47 PM                 0 TestFile3.docx
2016-04-21  01:06 PM              3491 xcopy.txt
00000000001111111111222222222233333333334444444444555555555566666666667777777777
01234567890123456789012345678901234567890123456789012345678901234567890123456789
```
- The First **20** characters contain the date and time
- The ending characters, starting at character number **39** 
contain the name of the file or directory
- The file size is in characters **29** to **38**
- Sub-directory names can be identified by the text _`<DIR>`_

The `summarize_directory` function converts a folder listing into a list of dictionaries and from there into a *Pandas* DataFrame

In [17]:
def summarize_directory(dir_src):
    dir_list = list(dir_src)  # Convert generator into a list
    # If the section is not found dir_list will be empty
    if not dir_list:
        return pd.DataFrame
    
    # The first line contains the folder name: 
    #         Directory of c:\\users\\...\\Test Dir Structure
    folder_name = dir_list[0].rsplit('\\', 1)[1]
    
    # The last line contains the number of files
    file_count = dir_list[-1].strip().split(' ', 1)[0]
    
    # Process the rest of the lines 
    # (There is a blank line between the folder name and the first)
    folder_list = list()
    for dir_line in dir_list[2:-1]:
        # Include the folder info for each listing
        folder_dict = {
            'Folder': folder_name,
            'NumFiles': file_count
            }
        
        # First 20 characters contain the date
        folder_dict['DateModified'] = dir_line[:20].strip()
        
        # Ending Characters, starting at #39 contain the name of 
        # the file or directory
        folder_dict['Name'] = dir_line[39:].strip()
        
        # Check if the listing is a file or directory
        # Directories will contain the text <DIR>
        if '<DIR>' in dir_line:
            folder_dict['IsDir'] = True
            # No FileSize for directories
            folder_dict['FileSize'] = ''
        else:
            folder_dict['IsDir'] = False
            # The File Size is given in characters 29 to 38
            folder_dict['FileSize'] = dir_line[29:38].strip()
            
        # ignore the `.` and `..` entries (current directory and parent directory)
        if folder_dict['Name'] not in ['.', '..']:
            folder_list.append(folder_dict)
    # Convert the list to a Pandas Dataframe for easy viewing
    folder_data = pd.DataFrame(folder_list)
    return  folder_data

With `assemble=summarize_directory` in the *dir_section* definition,
the command `dir_section.read(dir_text)` results in a DataFrame object representing the folder listing.

In [18]:
dir_section = sections.Section(
    start_section='Directory of',
    end_section=sections.SectionBreak('File(s)', break_offset='After'),
    assemble=summarize_directory)

dir_section.read(dir_text)

Unnamed: 0,Folder,NumFiles,DateModified,Name,IsDir,FileSize
0,Test Dir Structure,4,2021-12-27 04:03 PM,Dir1,True,
1,Test Dir Structure,4,2021-12-27 05:27 PM,Dir2,True,
2,Test Dir Structure,4,2016-02-25 09:59 PM,TestFile1.txt,False,3.0
3,Test Dir Structure,4,2016-02-15 06:46 PM,TestFile2.rtf,False,7.0
4,Test Dir Structure,4,2016-02-15 06:47 PM,TestFile3.docx,False,0.0
5,Test Dir Structure,4,2016-04-21 01:06 PM,xcopy.txt,False,3491.0


### Section Processing Tools

The `summarize_directory()` function, that converts the entire folder section into a DataFrame, can be broken down into parts by including a *processor* as part of the Section definition. There are a number of advantages this modular design.
1. A sequence of smaller function each focusing on a single task improves the readability of the overall code.
2. It makes it easier to modify the code if the structure of the sourec changes slightly.
> For example: If the date format or the column spacing changes, it is simpler to replace a function that reads dates or splits columns than to modify a large function that is doing many things.

3. As we will see later, it allows for much more complex Sections to be defined.

#### `processor` Instructions
The *processor* argument supplies either a single item or list of functions or One of:
*Standard Action* names to be applied to each of the section items.  
    which case the section will use the default SectionProcessor,
    it can be a SectionProcessor instance, a Section instance, or a
    list of Section instances.

Standard Action name
        'Original': return the original item supplied.
        'Blank': return ''  (an empty string).
        'None': return None.

Custom function
- First argument is the item to be processed
- Optional second argument is the Section's *Context* dictionary.
- keyword arguments can be used to accept specific items from the Section's 
*Context* dictionary, provided there is a trailing **kwarks argument to catch 
the remainder of the Section's *Context* dictionary.

- returns 
(SourceItem, ContextType)->ProcessedItems

function([SourceItem], ContextType)->ProcessedItems
func(item  [, other(s)] **context)
function(SourceItem, **context)->ProcessedItems
                               Callable[, ProcessedItems],
                               Callable[..., ProcessedItems]]

If not supplied (or None), the section items as 
they are obtained from the Source Iterator.
NoneApplies a series of functions to a supplied sequence of items.

    Processing Methods combines a series of functions, generator functions,
    Rules, and/or Rule Sets (Processes) to produce a single generator function.
    The generator function will iterate through a supplied source of items
    returning the final processed item. The output type of each Process must
    match the expected input type of the next Process in the series.  No
    validation tests are done on this.

    A Process applied to a Source (a sequence of SourceItems) results in
    a sequence of ProcessedItems.  The relation between SourceItems and
    ProcessedItems is not necessarily 1:1.
       1 SourceItem ≠1 ProcessedItem;
    	  • 1 SourceItem → 1 ProcessedItem
    	  • 1 SourceItem → 2+ ProcessedItems
    	  • 2+ SourceItems → 1 ProcessedItem

    Generator functions are used when multiple input items are
    required to generate an output item, or when one SourceItem results in
    multiple ProcessedItems. In general, regular functions are used when there
    is a one-to-one correspondence between input item and output item.  RuleSets
    are used when the function that should be applied to the SourceItem(s)
    depends on the result of one or more tests (Triggers).  Individual Rules can
    be used when only a single Trigger is required (by using both the Pass and
    Fail methods of the Rule) or to modify some of the SourceItems while leaving
    others unchanged (by setting the Fail method to 'Original').  For Rules or
    RuleSets it is important that the output is of the same type regardless of
    whether the Trigger(s) pass or fail.

    Processing functions should accept one the following argument sets:
        func(item)
        func(item, ** context)
        func(item, context)
        func(item, [other(s),] ** context)

ContextType = Union[Dict[str, Any], None]
ProcessedItems = Union[ProcessedItem, Generator[ProcessedItem]

ProcessedItem = TypeVar('ProcessedItem')
SourceItem = TypeVar('SourceItem')
process_actions = {
        'Original': lambda test_object, context: test_object,
        'Blank':  lambda test_object, context: '',
        'None':  lambda test_object, context: None
        }


In [None]:
?sections.ProcessingMethods

[1;31mInit signature:[0m
[0msections[0m[1;33m.[0m[0mProcessingMethods[0m[1;33m([0m[1;33m
[0m    [0mprocessing_methods[0m[1;33m:[0m [1;34m'List[ProcessMethodOptions]'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;34m'Processor'[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Applies a series of functions to a supplied sequence of items.

Processing Methods combines a series of functions, generator functions,
Rules, and/or Rule Sets (Processes) to produce a single generator function.
The generator function will iterate through a supplied source of items
returning the final processed item. The output type of each Process must
match the expected input type of the next Process in the series.  No
validation tests are done on this.

A Process applied to a Source (a sequence of SourceItems) results in
a sequence of ProcessedItems.  The relation between SourceItems and
ProcessedItems is not n

sections.ProcessingMethods(
    processing_methods: 'List[ProcessMethodOptions]' = None,
    name='Processor',
)
Docstring:     
Applies a series of functions to a supplied sequence of items.

Processing Methods combines a series of functions, generator functions,
Rules, and/or Rule Sets (Processes) to produce a single generator function.
The generator function will iterate through a supplied source of items
returning the final processed item. The output type of each Process must
match the expected input type of the next Process in the series.  No
validation tests are done on this.

A Process applied to a Source (a sequence of SourceItems) results in
a sequence of ProcessedItems.  The relation between SourceItems and
ProcessedItems is not necessarily 1:1.
   1 SourceItem ≠1 ProcessedItem;
      • 1 SourceItem → 1 ProcessedItem
      • 1 SourceItem → 2+ ProcessedItems
      • 2+ SourceItems → 1 ProcessedItem

Generator functions are used when multiple input items are
required to generate an output item, or when one SourceItem results in
multiple ProcessedItems. In general, regular functions are used when there
is a one-to-one correspondence between input item and output item.  RuleSets
are used when the function that should be applied to the SourceItem(s)
depends on the result of one or more tests (Triggers).  Individual Rules can
be used when only a single Trigger is required (by using both the Pass and
Fail methods of the Rule) or to modify some of the SourceItems while leaving
others unchanged (by setting the Fail method to 'Original').  For Rules or
RuleSets it is important that the output is of the same type regardless of
whether the Trigger(s) pass or fail.

Processing functions should accept one the following argument sets:
    func(item)
    func(item, ** context)
    func(item, context)
    func(item, [other(s),] ** context)

Arguments:
    processing_methods (ProcessGroup): The sequence of Processes (functions,
        generator functions, Rules, and/or RuleSets) to be applied to a
        source.
    name (str): Reference label for the processing method.
        Defaults to 'Processor'

Methods:
    process(self, item, context)->RuleResult:
    reader(self, buffered_source, context):
    read(self, buffered_source, context):
        a generator function, accepting a source text stream
            and yielding the processed text. Defaults to None, which sets
            a basic csv parser.


### Combine the folder tables (*DataFrames*) into one large table

The folder tables (*DataFrames*) can be combined to produce one large table

In [19]:
def combine_dataframes(df_list):
    # Remove empty DataFrames from the list
    valid_df = [frm for frm in df_list if not frm.empty]
    
    # Combine the folder tables
    folder_table = pd.concat(valid_df, ignore_index=True)
    
    # Set a table index
    folder_table.set_index(['Folder', 'Name'], inplace=True)
    return folder_table

Itterate through all of the `Dir` output text until it is completed, obtaining a table of information for each folder.  Then combine all of the folder tables into one large table.

In [20]:
dir_text_iter = (txt for txt in iter(dir_text))
folder_list = list()
gen_status = inspect.getgeneratorstate(dir_text_iter)
while gen_status not in 'GEN_CLOSED':
   folder_list.append(dir_section.read(dir_text_iter))
   gen_status = inspect.getgeneratorstate(dir_text_iter)
combine_dataframes(folder_list)

Unnamed: 0_level_0,Unnamed: 1_level_0,NumFiles,DateModified,IsDir,FileSize
Folder,Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Test Dir Structure,Dir1,4,2021-12-27 04:03 PM,True,
Test Dir Structure,Dir2,4,2021-12-27 05:27 PM,True,
Test Dir Structure,TestFile1.txt,4,2016-02-25 09:59 PM,False,3.0
Test Dir Structure,TestFile2.rtf,4,2016-02-15 06:46 PM,False,7.0
Test Dir Structure,TestFile3.docx,4,2016-02-15 06:47 PM,False,0.0
Test Dir Structure,xcopy.txt,4,2016-04-21 01:06 PM,False,3491.0
Dir1,File in Dir One.txt,1,2016-02-15 06:48 PM,False,0.0
Dir1,SubFolder1,1,2021-12-27 03:45 PM,True,
Dir1,SubFolder2,1,2021-12-27 03:45 PM,True,
SubFolder1,File in SubFolder One.rtf,1,2016-03-19 09:26 AM,False,7.0


## Sections of Sections

Sections can be composed of other sections.
In this simple example the `all_folders` Section is obtained from the sequence of dir_section objects that can be extracted from the `Dir` output text, and the `combine_dataframes` function is given as the assemble method.

In [21]:
all_folders = sections.Section(
    processor=dir_section,
    assemble=combine_dataframes)

The iteration over all folders then reduces to one line:

In [22]:
all_folders.read(dir_text)

Unnamed: 0_level_0,Unnamed: 1_level_0,NumFiles,DateModified,IsDir,FileSize
Folder,Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Test Dir Structure,Dir1,4,2021-12-27 04:03 PM,True,
Test Dir Structure,Dir2,4,2021-12-27 05:27 PM,True,
Test Dir Structure,TestFile1.txt,4,2016-02-25 09:59 PM,False,3.0
Test Dir Structure,TestFile2.rtf,4,2016-02-15 06:46 PM,False,7.0
Test Dir Structure,TestFile3.docx,4,2016-02-15 06:47 PM,False,0.0
Test Dir Structure,xcopy.txt,4,2016-04-21 01:06 PM,False,3491.0
Dir1,File in Dir One.txt,1,2016-02-15 06:48 PM,False,0.0
Dir1,SubFolder1,1,2021-12-27 03:45 PM,True,
Dir1,SubFolder2,1,2021-12-27 03:45 PM,True,
SubFolder1,File in SubFolder One.rtf,1,2016-03-19 09:26 AM,False,7.0


### The processor argument

processor is an optional argument which instructs the Section reader to apply a sequence of functions to the section items.

Processing Methods combines a series of functions, generator functions,
Rules, and/or Rule Sets (Processes) to produce a single generator function.
The generator function will iterate through a supplied source of items
returning the final processed item. The output type of each Process must
match the expected input type of the next Process in the series.  No
validation tests are done on this.

A Process applied to a Source (a sequence of SourceItems) results in
a sequence of ProcessedItems.  The relation between SourceItems and
ProcessedItems is not necessarily 1:1.
   1 SourceItem ≠1 ProcessedItem;
      • 1 SourceItem → 1 ProcessedItem
      • 1 SourceItem → 2+ ProcessedItems
      • 2+ SourceItems → 1 ProcessedItem

Generator functions are used when multiple input items are
required to generate an output item, or when one SourceItem results in(or `None`), in which case the section will use the default SectionProcessor.

(ProcessMethodOptions, optional): Instructions for
    processing and the section items.  processor can be None,
    it can be a SectionProcessor instance, a Section instance, or a
    list of Section instances.
One of:
Standard Action name
        'Original': return the original item supplied.
        'Blank': return ''  (an empty string).
        'None': return None.

Custom function
- First argument is the item to be processed
- Optional second argument is the Section's *Context* dictionary.
- keyword arguments can be used to accept specific items from the Section's 
*Context* dictionary, provided there is a trailing **kwarks argument to catch 
the remainder of the Section's *Context* dictionary.

- returns 
(SourceItem, ContextType)->ProcessedItems

function([SourceItem], ContextType)->ProcessedItems
func(item  [, other(s)] **context)
function(SourceItem, **context)->ProcessedItems
                               Callable[, ProcessedItems],
                               Callable[..., ProcessedItems]]

None

ContextType = Union[Dict[str, Any], None]
ProcessedItems = Union[ProcessedItem, Generator[ProcessedItem]

ProcessedItem = TypeVar('ProcessedItem')
SourceItem = TypeVar('SourceItem')
process_actions = {
        'Original': lambda test_object, context: test_object,
        'Blank':  lambda test_object, context: '',
        'None':  lambda test_object, context: None
        }


In [6]:
?sections.ProcessingMethods

[1;31mInit signature:[0m
[0msections[0m[1;33m.[0m[0mProcessingMethods[0m[1;33m([0m[1;33m
[0m    [0mprocessing_methods[0m[1;33m:[0m [1;34m'List[ProcessMethodOptions]'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;34m'Processor'[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Applies a series of functions to a supplied sequence of items.

Processing Methods combines a series of functions, generator functions,
Rules, and/or Rule Sets (Processes) to produce a single generator function.
The generator function will iterate through a supplied source of items
returning the final processed item. The output type of each Process must
match the expected input type of the next Process in the series.  No
validation tests are done on this.

A Process applied to a Source (a sequence of SourceItems) results in
a sequence of ProcessedItems.  The relation between SourceItems and
ProcessedItems is not n

sections.ProcessingMethods(
    processing_methods: 'List[ProcessMethodOptions]' = None,
    name='Processor',
)
Docstring:     
Applies a series of functions to a supplied sequence of items.

Processing Methods combines a series of functions, generator functions,
Rules, and/or Rule Sets (Processes) to produce a single generator function.
The generator function will iterate through a supplied source of items
returning the final processed item. The output type of each Process must
match the expected input type of the next Process in the series.  No
validation tests are done on this.

A Process applied to a Source (a sequence of SourceItems) results in
a sequence of ProcessedItems.  The relation between SourceItems and
ProcessedItems is not necessarily 1:1.
   1 SourceItem ≠1 ProcessedItem;
      • 1 SourceItem → 1 ProcessedItem
      • 1 SourceItem → 2+ ProcessedItems
      • 2+ SourceItems → 1 ProcessedItem

Generator functions are used when multiple input items are
required to generate an output item, or when one SourceItem results in
multiple ProcessedItems. In general, regular functions are used when there
is a one-to-one correspondence between input item and output item.  RuleSets
are used when the function that should be applied to the SourceItem(s)
depends on the result of one or more tests (Triggers).  Individual Rules can
be used when only a single Trigger is required (by using both the Pass and
Fail methods of the Rule) or to modify some of the SourceItems while leaving
others unchanged (by setting the Fail method to 'Original').  For Rules or
RuleSets it is important that the output is of the same type regardless of
whether the Trigger(s) pass or fail.

Processing functions should accept one the following argument sets:
    func(item)
    func(item, ** context)
    func(item, context)
    func(item, [other(s),] ** context)

Arguments:
    processing_methods (ProcessGroup): The sequence of Processes (functions,
        generator functions, Rules, and/or RuleSets) to be applied to a
        source.
    name (str): Reference label for the processing method.
        Defaults to 'Processor'

Methods:
    process(self, item, context)->RuleResult:
    reader(self, buffered_source, context):
    read(self, buffered_source, context):
        a generator function, accepting a source text stream
            and yielding the processed text. Defaults to None, which sets
            a basic csv parser.


### Rule and RuleSets
Instead of having one function `summarize_directory()` that manages all possible 
text lines in the folder section, the function can be broken down into parts by 
defining *Rules*.  There are a number of advantages this modular design.
1. The same *Rule* can be re-used for other similar Sections.
2. The specific condition(s) and action(s) of the *Rule* are separated from the 
code required to implement them.
2. *Rule* definitions can be placed in logical groups for clarity.


#### Rule
The ***Rule*** class Defines action to take on an item depending on the result of a test.

A *Rule* is defined by:
1. A *condition*, which defines a test to apply to an object.
2. One or both of:
    1. A pass_method: a function applied if the condition passes
    2. A fail_method: a function applied if the condition fails.

The *condition* can be:
- A string or list of strings.  The condition will pass if the item being tested 
matches with the string (or with any of the strings in the list).
- A  Compiled regular expression pattern or list pf patterns. The condition 
will pass if the pattern (or one of the patterns in the list) successfully 
matches in the item being tested. 
- A function or list of functions. The condition will pass if the function 
(or one of the functions in the list) returns a non-blank (None, '', []) value 
when applied to the item being tested.

The *location* argument is a sentinel modifier that applies to string or
regular expression sentinels. location can be one of:

|location    | str test                  | re.Pattern test          |
|------------|---------------------------|--------------------------|
|    IN      | sentinel in item          | sentinel.search(item)    |
|    START   | item.startswith(sentinel) | sentinel.match(item)     |
|    END     | item.endswith(sentinel),  | NotImplementedError      |
|    FULL    | sentinel == item          | sentinel.fullmatch(item) |

Both pass_method and fail_method should have one of the following
argument signatures:
- rule_method(item: SourceItem)
- rule_method(item: SourceItem, ** context)
- rule_method(item: SourceItem, event: TriggerEvent)
- rule_method(item: SourceItem, event: TriggerEvent, **context)

Both pass_method and fail_method should return the same data type. No
checking is done to validate this.

In addition to a callable, the pass, fail and default attributes can be
the names of standard actions:
|String    | Resulting Action                         |
|----------|------------------------------------------|
|'Original'| return the item being.                   |
|'Event'   | return the self.event object.            |
|'Value'   | return the self.event.test_value object. |
|'Name'    | return the self.event.test_name object.  |
|'None'    | return None                              |
|'Blank'   | return ''  (an empty string)             |

#### RuleSet
The ***RuleSet*** class is composed of a sequence of *Rules* where one, and only one of the Rules will be applied.  The *Rules* are applied in order and once a *Rule* passes, no further *Rules* are applied.  If none of the *Rules* in the *RuleSet* pass, then a *Default Action* is applied.  In essence they act like a collection of `elif` statements, with a concluding `else`.

A regular expression is used for the file and directory listing text

#### File Listing Regular Expression
- Begins with numeric date and time, 
- Allow for different date delimiters: `-`, `/` or `\`
- Time ends with possible AM or PM 2 to 20 Spaces at the beginning of the line.
- Possible `<DIR>` text as Directory indicator
- Possible integer for *FileSize*
- Name is all remaining characters in the line

In [3]:
file_listing = re.compile(
    # Initial spaces
    '^'              # Start of line
    # Date and Time
    '(?P<DateTime>'  # Beginning of DateTime group
    '[0-9 -/\\:]+'   # Numeric date and time with different possible delimiters
    '[AaPp][Mm]?'    # Possible AM or PM
    ')'              # End of DateTime group  

    '\s+'            # Whitespace
    
    # Directory Indicator
    '(?P<Directory>' # Beginning of Directory group
    '<DIR>'          # <DIR> text
    ')?'             # End of optional Directory group
    
    '\s{6,}'         # Minimum of 6 spaces 
    
    # FileSize  
    '(?P<FileSize>'  # Beginning of FileSize group
    '[0-9]+'         # An integer number
    ')?'             # End of optional FileSize group
  
    '\s+' # Whitespace
    
    # File or Directory Name
    '(?P<Name>'      # Beginning of Name group
    '.*'             # All remaining characters
    ')'              # End of Name group
    
    # End of Line  
    '$'              # End of line
    )

In [25]:
dir_section = sections.Section(
    start_section='Directory of',
    end_section=sections.SectionBreak('File(s)', break_offset='After'))

a = dir_section.read(dir_text)

In [31]:
b = file_listing.fullmatch(a[6])
b.groupdict()

{'DateTime': '2016-02-25  09:59 PM',
 'Directory': None,
 'FileSize': '3',
 'Name': 'TestFile1.txt'}

In [None]:
def dir_name_split(line):
    return ['Folder', line.rsplit('\\', 1)[1]]
dir_name_rule = sections.Rule('Directory of', pass_method=dir_name_split)

def file_count_split(line):
    return ['NumFiles', line.strip().split(' ', 1)[0]]
file_count_rule = sections.Rule('File(s)', pass_method=file_count_split)

def subfolder(line):
    return ['Subdirectory:', line[36:]]
subfolder_rule = sections.Rule('<DIR>', pass_method=subfolder)

def file(line):
    return ['File:' + line[36:]]

dir_process = sections.RuleSet([dir_name_rule, file_count_rule, subfolder_rule], 
                      default=file)

# The first line contains the folder name: 
#         Directory of c:\\users\\...\\Test Dir Structure
folder_name = dir_list[0].rsplit('\\', 1)[1]

# The last line contains the number of files
file_count = dir_list[-1].strip().split(' ', 1)[0]

# Process the rest of the lines 
# (There is a blank line between the folder name and the first)
folder_list = list()
for dir_line in dir_list[2:-1]:
    # Include the folder info for each listing
    folder_dict = {
        'Folder': folder_name,
        'NumFiles': file_count
        }
    
    # First 20 characters contain the date
    folder_dict['DateModified'] = dir_line[:20].strip()
    
    # Ending Characters, starting at #39 contain the name of 
    # the file or directory
    folder_dict['Name'] = dir_line[39:].strip()
    
    # Check if the listing is a file or directory
    # Directories will contain the text <DIR>
    if '<DIR>' in dir_line:
        folder_dict['IsDir'] = True
        # No FileSize for directories
        folder_dict['FileSize'] = ''
    else:
        folder_dict['IsDir'] = False
        # The File Size is given in characters 29 to 38
        folder_dict['FileSize'] = dir_line[29:38].strip()
        
    # ignore the `.` and `..` entries (current directory and parent directory)
    if folder_dict['Name'] not in ['.', '..']:
        folder_list.append(folder_dict)
# Convert the list to a Pandas Dataframe for easy viewing
folder_data = pd.DataFrame(folder_list)

In [None]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=ProcessingMethods([dir_process]))

output = dir_section.read(dir_text)
for line in output:
    print(line)

In [None]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=[dir_process])

output = dir_section.read(dir_text)
for line in output:
    print(line)

In [None]:
def dir_name_split(line):
    # Get the directory name
    if 'Directory of' in line:
        return ['Folder Name:', line.rsplit('\\', 1)[1]]
    return line

def subfolder(line):
    # Label the subdirectories
    if '<DIR>' in line:
        return ['Subdirectory:', line[36:]]
    return line

def file_count_split(line):
    # Label the file counts
    if 'File(s)' in line:
        return ['Number of Files:', line.strip().split(' ', 1)[0]]
    return line

def file(line):
    # Label the files
    return ['File:' + line[36:]] 

In [None]:
print('column index')
print(''.join(str(i)*10 for i in range(10)))
print(''.join(str(i) for i in range(10))*10)
print(dir_text[9])
    

In [None]:


#%% Regex Parsing patterns
# File Count and summary:
     #          1 File(s)          59904 bytes
     #         23 Dir(s)     63927545856 bytes free
folder_summary_pt = re.compile(
    '(?P<files>'       # beginning of files string group
    '[0-9]+'           # Integer number of files
    ')'                # end of files string group
    '[ ]+'             # Arbitrary number of spaces
    '(?P<type>'        # beginning of type string group
    'File|Dir'         # "File" or " Dir" text
    ')'                # end of type string group
    '\\(s\\)'          # "(s)" text
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' bytes'           # "bytes" text
    )
date_pattern = tp.build_date_re(compile_re=False)
file_listing_pt = re.compile(
    f'{date_pattern}'  # Insert date pattern
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' '                # Single space
    '(?P<filename>'    # beginning of filename string group
    '.*'               # Integer size of folder
    ')'                # end of size string group
    '$'                # end of string
    )


#%% Line Parsing Functions
# Directory Label Rule

def extract_directory(line: str, event, *args,
                    context=None, **kwargs) -> List[List[str]]:
    '''Extract Directory path from folder header.
    '''
    full_dir = line.replace('Directory of', '').strip()
    return [full_dir]


dir_header_rule = Rule(
    name='Dir Header Rule',
    sentinel='Directory of ',
    pass_method=extract_directory
    )


# skip <DIR>
def blank_line(*args, **kwargs) -> List[List[str]]:
    return [['']]


skip_dir_rule = Rule(
    name='Skip <DIR> Rule',
    sentinel=' <DIR> ',
    pass_method='Blank'
    )
skip_totals_rule = Rule(
    name='Skip Total Files Header Rule',
    sentinel='Total Files Listed:',
    pass_method='Blank'
    )


# Regular file listings
def file_parse(line: str, event, *args, **kwargs) -> List[List[str]]:
    '''Break file data into three columns containing Filename, Date, Size.

    Typical file is:
        2016-02-25  22:59     3 TestFile1.txt
    File line is parsed using a regular expression with 3 named groups.
    Output for the example above is:
        [[TestFile1.txt , 2016-02-25  22:59, 3]]

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test on the line.
            Contains 3 named groups: ['date', 'size', 'filename'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: A one-item list containing the parsed file
            information as a 3-item tuple:
                [(filename: str, date: str, file size: int)].
    '''
    file_line_parts = event.test_value.groupdict(default='')
    parsed_line = tuple([
        file_line_parts['filename'],
        tp.make_date_time_string(event),
        int(file_line_parts['size'])
        ])
    return parsed_line


# Regular File Parsing Rule
file_listing_rule = Rule(file_listing_pt, pass_method=file_parse,
                            name='Files_rule')


# File Count Parsing Rule
def file_count_parse(line: str, event, *args, **kwargs) -> List[List[str]]:
    '''Break file data into two rows containing:
           Number of files, & Directory size.

    Output has the following format:
        ['Number of files', file count value: int]
        ['Directory Size', directory size value: int]

    Typical line is:
        4 File(s)           3501 bytes
    File count is parsed using a regular expression with 2 named groups.

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test size the line.
            Contains 3 named groups: ['files', 'type', 'size'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: The parsed file information.
            The parsed file information consists of three lines with the
            following format:
                'Number of files', file count value: int
                'Directory Size', directory size value: int
    '''
    file_count_parts = event.groupdict(default='')
    # Manage case where bytes free is given:
    # 23 Dir(s)     63927545856 bytes free
    if line.strip().endswith('free'):
        file_count_parts['size_label'] = 'Free Space'
    else:
        file_count_parts['size_label'] = 'Size'
    parsed_line_template = ''.join([
        'Number of {type}s, {files}\n',
        'Directory {size_label}, {size}'
        ])
    parsed_line_str = parsed_line_template.format(**file_count_parts)
    parsed_line = [new_line.split(',')
                   for new_line in parsed_line_str.splitlines()]
    return parsed_line
file_count_rule = Rule(folder_summary_pt, pass_method=file_count_parse,
                          name='Files_rule')


skip_file_count_rule = Rule(
    name='Skip File(s) Rule',
    sentinel=folder_summary_pt,
    pass_method='Blank'
    )


# Files / DIRs Parse
def make_files_rule() -> Rule:
    '''If  File(s) or  Dir(s) extract # files & size
        '''
    def files_total_parse(line, event, *args, **kwargs) -> List[List[str]]:
        '''Break file counts into three columns containing:
           Type (File or Dir), Count, Size.

        The line:
               11 File(s)          72507 bytes
        Results in:
            [('File', 11, 3501)]
        The line:
           23 Dir(s)     63927545856 bytes free
        Results in:
            [('Dir', 23, 3501)]

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test on the line.
            Contains 3 named groups: ['type', 'files', 'size'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: A one-item list containing the parsed file count
            information as a 3-item tuple:
                [(Type: str (File or Dir), Count: int, Size: int)].
        '''
        files_dict = event.test_value.groupdict(default='')
        parsed_line = tuple([
            files_dict["type"],
            files_dict["files"],
            files_dict["size"]
            ])
        return [parsed_line]

    files_total_rule = Rule(folder_summary_pt,
                               pass_method=files_total_parse,
                               name='Files_Total_rule')
    return files_total_rule


default_csv = tp.define_csv_parser('dir_files', delimiter=':',
                                       skipinitialspace=True)


#%% Line Processing
def print_lines(parsed_list):
    output = list()
    for item in parsed_list:
        pprint(item)
        output.append(item)
    return output


def to_folder_dict(folder_list):
    '''Combine folder info into dictionary.
    '''
    # TODO separate directory info from file info
    #The first line in the folder list is the directory path
    directory = ''
    if folder_list:
        d_list = folder_list[0]
        if d_list:
            directory = d_list[0]
    folder_dict = {'Directory': directory}
    for folder_info in folder_list[1:]:
        filename, date, file_size = folder_info
        full_path = '\\'.join([directory, filename])
        file_parts = filename.rsplit('.', 1)
        if len(file_parts) > 1:
            extension = file_parts[1]
        else:
            extension = ''
        folder_dict = {
            'Path': full_path,
            'Directory': directory,
            'Filename': filename,
            'Extension': extension,
            'Date': date,
            'Size': file_size
            }
    return folder_dict


def make_files_table(dir_gen):
    '''Combine folder info dictionaries into Pandas DataFrame.
    '''
    list_of_folders = list(dir_gen)
    files_table = pd.DataFrame(list_of_folders)
    files_table.set_index('Path')
    return files_table


#%% Reader definitions
default_parser = tp.define_csv_parser('dir_files', delimiter=':',
                                       skipinitialspace=True)
heading_reader = ProcessingMethods([
    default_parser,
    tp.trim_items
    ])
folder_reader = ProcessingMethods([
    RuleSet([skip_dir_rule, file_listing_rule, dir_header_rule,
             skip_file_count_rule], default=default_parser),
    tp.drop_blanks
    ])
summary_reader = ProcessingMethods([
    RuleSet([file_count_rule, skip_totals_rule], default=default_parser),
    tp.drop_blanks
    ])


#%% SectionBreak definitions
folder_start = SectionBreak(
    name='Start of Folder', sentinel='Directory of', break_offset='Before')
folder_end = SectionBreak(name='End of Folder',sentinel=folder_summary_pt,
                             break_offset='After')
summary_start = SectionBreak(sentinel='Total Files Listed:',
                                name='Start of DIR Summary', break_offset='Before')


#%% Section definitions
header_section = Section(
    name='Header',
    start_section=None,
    end_section=folder_start,
    processor=heading_reader,
    assemble=print_lines
    )
folder_section = Section(
    name='Folder',
    start_section=folder_start,
    end_section=folder_end,
    processor=folder_reader,
    assemble=to_folder_dict
    )
all_folder_section = Section(
    name='All Folders',
    start_section=folder_start,
    end_section=summary_start,
    processor=[folder_section],
    assemble=make_files_table
    )
summary_section = Section(
    name='Summary',
    start_section=summary_start,
    end_section=None,
    processor=summary_reader,
    assemble=tp.to_dict
    )


#%% Main Iteration
def main():
    # Test File
    base_path = Path.cwd() / 'examples'
    test_file = base_path / 'test_DIR_Data.txt'

    # Call Primary routine
    context = {
        'File Name': test_file.name,
        'File Path': test_file.parent,
        'top_dir': str(base_path),
        'tree_name': 'Test folder Tree'
        }

    source = tp.file_reader(test_file)
    file_info = all_folder_section.read(source, context)
    #summary = summary_section.read(source, **context)

    # Output  Data
    xw.view(file_info)
    print('done')

if __name__ == '__main__':
    main()

In [None]:
print('column index')
print(''.join(str(i)*10 for i in range(10)))
print(''.join(str(i) for i in range(10))*10)
print(dir_text[9])
    

In [None]:
a =dir_text[3]
a.index('\\')
a.rsplit('\\', 1)
#'Folder Name:\t' + a.rsplit('\\', 1)[0]

In [None]:
ones = ''.join([str(i) for i in range(10)])
ones*8
tens = ''.join([str(i)*10 for i in range(8)])
tens
print(tens)
print(ones*8)

00000000001111111111222222222233333333334444444444555555555566666666667777777777
01234567890123456789012345678901234567890123456789012345678901234567890123456789
