# Advanced Processing
In this tutorial we introduce some more advanced tools for section processing.

1. Processing using a generator function
2. Rules and Rule Sets
3. Regular expression for processing text
4. The FixedWidth and csv Parsing tools
5. Making use of the Context dictionary
    1. Setting context values when calling Section.read
    2. Accessing default context values
6. Passing other parameters to the processing methods.

### Processing with a generator function

In a simple processor function the input is a single item from the section and
the output is one "item" for each "input item".

The relation between individual input (section) items and the resulting 
processed items is not necessarily 1:1. A single input item could be broken up 
into multiple processed items, or conversely, multiple section items could be 
converted into one processed item:

    	  • 1 SourceItem → 1 ProcessedItem
    	  • 1 SourceItem → 2+ ProcessedItems
    	  • 2+ SourceItems → 1 ProcessedItem

In general, regular functions are only used when there is a one-to-one 
correspondence between input item and output item.  Generator functions are used 
when multiple input items are required to generate a processed item, or when one 
input item results in multiple processed items.   

## Rule and RuleSets
Instead of having one function `process_directory()` that manages all possible 
text lines in the section, the function can be broken down into parts by 
defining *Rules*.

#### Rules
Rules define an action to take on an item depending on the result of a test.

A *Rule* definition has two parts:
1. Trigger:
   > Defines the test to be applied to the source item
   > Trigger related arguments:
   > - sentinel
   >   - For string items, sentinel can be a string or compiled regular expression.
   > - location
   >   - A sentinel modifier that applies to str or re.Pattern types of sentinels. One of  ['IN', 'START', 'END', 'FULL', None]. Default is None, which is treated as 'IN'

2. Action
   > Defines the actions to take depending on the Trigger outcome.
   > Action related arguments:
   > - pass_method
   > - fail_method
   >
   > Both take functions, or the name of standard actions to be implemented if the test passes or fails respectively.
   >
   > The pass_method and fail_method functions can be simple process functions, with one positional argument and additional keyword arguments. The functions can also contain a second positional argument *event* which allows the function to access information about the test results.  This is particularly useful when the sentinel is a regular expression.
   >
   > pass_method and fail_method can also be a string with the name of one of the standard actions.  The most common are:
   > - 'Original': return the item being.
   > - 'None': return None
   > - 'Blank': return ''  (an empty string)


#### RuleSets
RuleSets combine related Rules to provide multiple choices for actions.
RuleSets
    are used when the function that should be applied to the SourceItem(s)
    depends on the result of one or more tests (Triggers).  Individual Rules can
    be used when only a single Trigger is required (by using both the Pass and
    Fail methods of the Rule) or to modify some of the SourceItems while leaving
    others unchanged (by setting the Fail method to 'Original').  For Rules or
    RuleSets it is important that the output is of the same type regardless of
    whether the Trigger(s) pass or fail.

- A Rule Set takes A sequence of Rules and a default method.
- Each Rule in the sequence will be applied to the input until One of the rules triggers. At that point The sequence ends.  
- If no Rule triggers then the default method is applied.  
- Each of the Rules (and the default) should expect the same input type and should produce the same output type.  
- The default_method can be any valid process function or standard action.



*Triggers*, *TriggerEvent*, *Rules* and *RuleSets* will be covered in more detail in a separate tutorial.
        

### Convert the Process Directory Function into Rules
The process_directory function consists of a set of `if` statements which each call a different function.  Each `if` statement can be converted into is own rule.

#### Get the directory name
```
if 'Directory of' in dir_line:
    output_line = dir_name_split(dir_line)
```
**Becomes the Rule:**

In [None]:
dir_name_rule = Rule('Directory of', pass_method=dir_name_split)

#### Label the subdirectories
```
elif '<DIR>' in dir_line:
    output_line = get_subfolder_name(dir_line)
```
**Becomes the Rule:**

In [None]:
subfolder_rule = Rule('<DIR>', pass_method=get_subfolder_name)

#### Label the file counts
```
elif 'File(s)' in dir_line:
    output_line = file_count_split(dir_line)
```
**Becomes the Rule:**

In [None]:
file_count_rule = Rule('File(s)', pass_method=file_count_split)

#### Label the files
```
else:
    output_line = get_file_name(dir_line)

```
This is not converted into a rule because there is no conditional.  Instead it becaomes the default method for a *RuleSet*:

In [None]:
dir_process = RuleSet([dir_name_rule, subfolder_rule, file_count_rule], 
                      default=get_file_name)

#### New Dir Section Definition

In [None]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=[dir_process])

output = dir_section.read(dir_text)
for line in output:
    print(line)

Folder Name:	Test Dir Structure
	File:		
	Subdirectory:	   .
	Subdirectory:	   ..
	Subdirectory:	   Dir1
	Subdirectory:	   Dir2
	File:		 3 TestFile1.txt
	File:		 7 TestFile2.rtf
	File:		 0 TestFile3.docx
	File:		91 xcopy.txt
Number of Files:	4


# Regex based processing

In [None]:


#%% Regex Parsing patterns
# File Count and summary:
     #          1 File(s)          59904 bytes
     #         23 Dir(s)     63927545856 bytes free
folder_summary_pt = re.compile(
    '(?P<files>'       # beginning of files string group
    '[0-9]+'           # Integer number of files
    ')'                # end of files string group
    '[ ]+'             # Arbitrary number of spaces
    '(?P<type>'        # beginning of type string group
    'File|Dir'         # "File" or " Dir" text
    ')'                # end of type string group
    '\\(s\\)'          # "(s)" text
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' bytes'           # "bytes" text
    )
date_pattern = tp.build_date_re(compile_re=False)
file_listing_pt = re.compile(
    f'{date_pattern}'  # Insert date pattern
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' '                # Single space
    '(?P<filename>'    # beginning of filename string group
    '.*'               # Integer size of folder
    ')'                # end of size string group
    '$'                # end of string
    )



In [None]:

#%% Line Parsing Functions
# Directory Label Rule

def extract_directory(line: str, event, *args,
                    context=None, **kwargs) -> List[List[str]]:
    '''Extract Directory path from folder header.
    '''
    full_dir = line.replace('Directory of', '').strip()
    return [full_dir]


dir_header_rule = Rule(
    name='Dir Header Rule',
    sentinel='Directory of ',
    pass_method=extract_directory
    )


# skip <DIR>
def blank_line(*args, **kwargs) -> List[List[str]]:
    return [['']]


skip_dir_rule = Rule(
    name='Skip <DIR> Rule',
    sentinel=' <DIR> ',
    pass_method='Blank'
    )
skip_totals_rule = Rule(
    name='Skip Total Files Header Rule',
    sentinel='Total Files Listed:',
    pass_method='Blank'
    )


# Regular file listings
def file_parse(line: str, event, *args, **kwargs) -> List[List[str]]:
    '''Break file data into three columns containing Filename, Date, Size.

    Typical file is:
        2016-02-25  22:59     3 TestFile1.txt
    File line is parsed using a regular expression with 3 named groups.
    Output for the example above is:
        [[TestFile1.txt , 2016-02-25  22:59, 3]]

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test on the line.
            Contains 3 named groups: ['date', 'size', 'filename'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: A one-item list containing the parsed file
            information as a 3-item tuple:
                [(filename: str, date: str, file size: int)].
    '''
    file_line_parts = event.test_value.groupdict(default='')
    parsed_line = tuple([
        file_line_parts['filename'],
        tp.make_date_time_string(event),
        int(file_line_parts['size'])
        ])
    return parsed_line


# Regular File Parsing Rule
file_listing_rule = Rule(file_listing_pt, pass_method=file_parse,
                            name='Files_rule')


# File Count Parsing Rule
def file_count_parse(line: str, event, *args, **kwargs) -> List[List[str]]:
    '''Break file data into two rows containing:
           Number of files, & Directory size.

    Output has the following format:
        ['Number of files', file count value: int]
        ['Directory Size', directory size value: int]

    Typical line is:
        4 File(s)           3501 bytes
    File count is parsed using a regular expression with 2 named groups.

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test size the line.
            Contains 3 named groups: ['files', 'type', 'size'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: The parsed file information.
            The parsed file information consists of three lines with the
            following format:
                'Number of files', file count value: int
                'Directory Size', directory size value: int
    '''
    file_count_parts = event.groupdict(default='')
    # Manage case where bytes free is given:
    # 23 Dir(s)     63927545856 bytes free
    if line.strip().endswith('free'):
        file_count_parts['size_label'] = 'Free Space'
    else:
        file_count_parts['size_label'] = 'Size'
    parsed_line_template = ''.join([
        'Number of {type}s, {files}\n',
        'Directory {size_label}, {size}'
        ])
    parsed_line_str = parsed_line_template.format(**file_count_parts)
    parsed_line = [new_line.split(',')
                   for new_line in parsed_line_str.splitlines()]
    return parsed_line
file_count_rule = Rule(folder_summary_pt, pass_method=file_count_parse,
                          name='Files_rule')


skip_file_count_rule = Rule(
    name='Skip File(s) Rule',
    sentinel=folder_summary_pt,
    pass_method='Blank'
    )


# Files / DIRs Parse
def make_files_rule() -> Rule:
    '''If  File(s) or  Dir(s) extract # files & size
        '''
    def files_total_parse(line, event, *args, **kwargs) -> List[List[str]]:
        '''Break file counts into three columns containing:
           Type (File or Dir), Count, Size.

        The line:
               11 File(s)          72507 bytes
        Results in:
            [('File', 11, 3501)]
        The line:
           23 Dir(s)     63927545856 bytes free
        Results in:
            [('Dir', 23, 3501)]

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test on the line.
            Contains 3 named groups: ['type', 'files', 'size'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: A one-item list containing the parsed file count
            information as a 3-item tuple:
                [(Type: str (File or Dir), Count: int, Size: int)].
        '''
        files_dict = event.test_value.groupdict(default='')
        parsed_line = tuple([
            files_dict["type"],
            files_dict["files"],
            files_dict["size"]
            ])
        return [parsed_line]

    files_total_rule = Rule(folder_summary_pt,
                               pass_method=files_total_parse,
                               name='Files_Total_rule')
    return files_total_rule


default_csv = tp.define_csv_parser('dir_files', delimiter=':',
                                       skipinitialspace=True)


#%% Line Processing
def print_lines(parsed_list):
    output = list()
    for item in parsed_list:
        pprint(item)
        output.append(item)
    return output


def to_folder_dict(folder_list):
    '''Combine folder info into dictionary.
    '''
    # TODO separate directory info from file info
    #The first line in the folder list is the directory path
    directory = ''
    if folder_list:
        d_list = folder_list[0]
        if d_list:
            directory = d_list[0]
    folder_dict = {'Directory': directory}
    for folder_info in folder_list[1:]:
        filename, date, file_size = folder_info
        full_path = '\\'.join([directory, filename])
        file_parts = filename.rsplit('.', 1)
        if len(file_parts) > 1:
            extension = file_parts[1]
        else:
            extension = ''
        folder_dict = {
            'Path': full_path,
            'Directory': directory,
            'Filename': filename,
            'Extension': extension,
            'Date': date,
            'Size': file_size
            }
    return folder_dict


def make_files_table(dir_gen):
    '''Combine folder info dictionaries into Pandas DataFrame.
    '''
    list_of_folders = list(dir_gen)
    files_table = pd.DataFrame(list_of_folders)
    files_table.set_index('Path')
    return files_table


#%% Reader definitions
default_parser = tp.define_csv_parser('dir_files', delimiter=':',
                                       skipinitialspace=True)
heading_reader = ProcessingMethods([
    default_parser,
    tp.trim_items
    ])
folder_reader = ProcessingMethods([
    RuleSet([skip_dir_rule, file_listing_rule, dir_header_rule,
             skip_file_count_rule], default=default_parser),
    tp.drop_blanks
    ])
summary_reader = ProcessingMethods([
    RuleSet([file_count_rule, skip_totals_rule], default=default_parser),
    tp.drop_blanks
    ])


#%% SectionBreak definitions
folder_start = SectionBreak(
    name='Start of Folder', sentinel='Directory of', break_offset='Before')
folder_end = SectionBreak(name='End of Folder',sentinel=folder_summary_pt,
                             break_offset='After')
summary_start = SectionBreak(sentinel='Total Files Listed:',
                                name='Start of DIR Summary', break_offset='Before')


#%% Section definitions
header_section = Section(
    section_name='Header',
    start_section=None,
    end_section=folder_start,
    processor=heading_reader,
    aggregate=print_lines
    )
folder_section = Section(
    section_name='Folder',
    start_section=folder_start,
    end_section=folder_end,
    processor=folder_reader,
    aggregate=to_folder_dict
    )
all_folder_section = Section(
    section_name='All Folders',
    start_section=folder_start,
    end_section=summary_start,
    processor=[folder_section],
    aggregate=make_files_table
    )
summary_section = Section(
    section_name='Summary',
    start_section=summary_start,
    end_section=None,
    processor=summary_reader,
    aggregate=tp.to_dict
    )


#%% Main Iteration
def main():
    # Test File
    base_path = Path.cwd() / 'examples'
    test_file = base_path / 'test_DIR_Data.txt'

    # Call Primary routine
    context = {
        'File Name': test_file.name,
        'File Path': test_file.parent,
        'top_dir': str(base_path),
        'tree_name': 'Test folder Tree'
        }

    source = tp.file_reader(test_file)
    file_info = all_folder_section.read(source, context)
    #summary = summary_section.read(source, **context)

    # Output  Data
    xw.view(file_info)
    print('done')

if __name__ == '__main__':
    main()

In [None]:
print('column index')
print(''.join(str(i)*10 for i in range(10)))
print(''.join(str(i) for i in range(10))*10)
print(dir_text[9])
    

In [None]:
a =dir_text[3]
a.index('\\')
a.rsplit('\\', 1)
#'Folder Name:\t' + a.rsplit('\\', 1)[0]

## Fixed Width Parser

The main part of a directory listing is formatted into columns

In [None]:
print('column index')
print(''.join(str(i)*10 for i in range(7)))
print(''.join(str(i) for i in range(10))*7)
#print(''.join(divider))
print(dir_text[7])
print(dir_text[12])

column index
0000000000111111111122222222223333333333444444444455555555556666666666
0123456789012345678901234567890123456789012345678901234567890123456789
2021-12-27  04:03 PM    <DIR>          Dir1
2016-04-21  01:06 PM              3491 xcopy.txt


In [None]:
column_breaks=[11, 20, 29, 38]

divider_list = ['.']*70
for brk in column_breaks:
    divider_list[brk] = '|'
divider = ''.join(divider_list)

In [None]:
print('column breaks')
#print(''.join(str(i)*10 for i in range(7)))
#print(''.join(str(i) for i in range(10))*7)
print(divider)

for line in dir_text[6:13]:
    print(line)
    
print(divider)

column breaks
...........|........|........|........|...............................
2021-12-27  03:33 PM    <DIR>          ..
2021-12-27  04:03 PM    <DIR>          Dir1
2021-12-27  05:27 PM    <DIR>          Dir2
2016-02-25  09:59 PM                 3 TestFile1.txt
2016-02-15  06:46 PM                 7 TestFile2.rtf
2016-02-15  06:47 PM                 0 TestFile3.docx
2016-04-21  01:06 PM              3491 xcopy.txt
...........|........|........|........|...............................


In [None]:
a = ['Part 1', 'Part 2a', 'Part 2b']
b = tp.FixedWidthParser([4,3])
[item for item in b.parser(a)]


[['Part', ' 1'], ['Part', ' 2a'], ['Part', ' 2b']]

In [None]:
b.parse(a[1])

['Part', ' 2a']

In [None]:
a = tp.FixedWidthParser(locations=[20,30,39])
a.parse(dir_text[12])

['2016-04-21  01:06 PM', '          ', '    3491 ', 'xcopy.txt']

In [None]:
b = tp.define_fixed_width_parser(locations=[20,30,39])
b(dir_text[8])

<generator object FixedWidthParser.parser at 0x00000203D1DEE6D0>

In [None]:
list(b(dir_text[8]))

[['2021-12-27  05:27 PM', '    <DIR> ', '         ', 'Dir2']]

In [None]:
# Define Functions
def dir_name_split(dir_line):
    output_dict = {'Folder Name': dir_line.rsplit('\\', 1)[1]}
    return output_dict
def file_count_split(dir_line):
    output_dict = {'Number of Files': dir_line.strip().split(' ', 1)[0]}
    return output_dict
def get_subfolder_name(dir_line):
    output_dict = {'Subdirectory': dir_line[36:]}
    return output_dict
def get_file_name(dir_line):
    output_dict = {'File': dir_line[36:]}
    return output_dict

# Define Rules
dir_name_rule = Rule('Directory of', pass_method=dir_name_split)
subfolder_rule = Rule('<DIR>', pass_method=get_subfolder_name)
file_count_rule = Rule('File(s)', pass_method=file_count_split)

#Define Rule Set
dir_process = RuleSet([dir_name_rule, subfolder_rule, file_count_rule], 
                      default=get_file_name)


In [None]:
for line in dir_text[0:20]:
    print('\t', line)

	  Volume in drive C is Windows
	  Volume Serial Number is DAE7-D5BA
	 
	  Directory of c:\users\...\Test Dir Structure
	 
	 2021-12-27  03:33 PM    <DIR>          .
	 2021-12-27  03:33 PM    <DIR>          ..
	 2021-12-27  04:03 PM    <DIR>          Dir1
	 2021-12-27  05:27 PM    <DIR>          Dir2
	 2016-02-25  09:59 PM                 3 TestFile1.txt
	 2016-02-15  06:46 PM                 7 TestFile2.rtf
	 2016-02-15  06:47 PM                 0 TestFile3.docx
	 2016-04-21  01:06 PM              3491 xcopy.txt
	                4 File(s)           3501 bytes
	 
	  Directory of c:\users\...\Test Dir Structure\Dir1
	 
	 2021-12-27  04:03 PM    <DIR>          .
	 2021-12-27  04:03 PM    <DIR>          ..
	 2016-02-15  06:48 PM                 0 File in Dir One.txt
