# Example: Output from Windows Dir command
This tutorial demonstrates the main features of the Sectionary package with a simple example; parsing the output of the Windows `DIR` command.

### Imports

#### Standard Python Modules

In [39]:
from typing import List
from pathlib import Path
from pprint import pprint
import re
import sys

#### Useful Third Party Packages

In [40]:
import pandas as pd
import xlwings as xw

#### Sectionary Imports

In [41]:
#sys.path.append(r'../src/sectionary') 

import text_reader as tp
from sections import Rule, RuleSet, SectionBreak, ProcessingMethods, Section

## The Sample `Dir` Output

The Windows `dir` command displays a list of a directory's files and subdirectories.  
It's output will be used to showcase some of the features of the *sectionary* package.

Adding switches (options) to the `dir` command control what it displays and the format of the output.
In thses examples we will be using the command line:

`DIR "Test Dir Structure" /S /N /-C /T:W >  "test_DIR_Data.txt"`

| Switch | Description                                                                                              |
|--------|----------------------------------------------------------------------------------------------------------|
| /S     | Lists every occurrence of the specified file name within the specified directory and all subdirectories. |
| /N     | Displays a long list format with file names on the far right of the screen.                              |
| /-C    | Hides the thousand separator in file sizes.                                                              |
| /T:W   | Specifies which time field to display as "Last written".                                                 |
| >      | Redirect the output to the specified file.                                                               |

For more information, see [DIR Command Syntax](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/dir)

In [42]:
test_file = Path.cwd() / 'examples' / 'test_DIR_Data.txt'
dir_text = test_file.read_text().splitlines()

### `Dir` Output Structure

The first 20 lines of the diretory listing are:

In [43]:
for line in dir_text[0:20]:
    print('\t', line)

	  Volume in drive C is Windows
	  Volume Serial Number is DAE7-D5BA
	 
	  Directory of c:\users\...\Test Dir Structure
	 
	 2021-12-27  03:33 PM    <DIR>          .
	 2021-12-27  03:33 PM    <DIR>          ..
	 2021-12-27  04:03 PM    <DIR>          Dir1
	 2021-12-27  05:27 PM    <DIR>          Dir2
	 2016-02-25  09:59 PM                 3 TestFile1.txt
	 2016-02-15  06:46 PM                 7 TestFile2.rtf
	 2016-02-15  06:47 PM                 0 TestFile3.docx
	 2016-04-21  01:06 PM              3491 xcopy.txt
	                4 File(s)           3501 bytes
	 
	  Directory of c:\users\...\Test Dir Structure\Dir1
	 
	 2021-12-27  04:03 PM    <DIR>          .
	 2021-12-27  04:03 PM    <DIR>          ..
	 2016-02-15  06:48 PM                 0 File in Dir One.txt


We want to ignore the first two lines (the *header section*)

In [44]:
for line in dir_text[0:2]:
    print('\t', line)

	  Volume in drive C is Windows
	  Volume Serial Number is DAE7-D5BA


After this come multiple Folder sections something like this:

In [45]:
print(dir_text[3][0:23], '...', dir_text[3][-19:])
print()
for line in dir_text[5:9]:
    print(line)
print(dir_text[13])

 Directory of c:\users\ ... \Test Dir Structure

2021-12-27  03:33 PM    <DIR>          .
2021-12-27  03:33 PM    <DIR>          ..
2021-12-27  04:03 PM    <DIR>          Dir1
2021-12-27  05:27 PM    <DIR>          Dir2
               4 File(s)           3501 bytes


## Define a Section (section breaks)

### Define a Section Based on start and end identifiers:

The start and end of the folder listing can be identified by key phrases:
- The section start is identified by the text '*Directory of*'
- The section end is identified by the text '*File(s)*'

In [46]:
dir_section = Section(start_section='Directory of', end_section='File(s)')
dir_section.read(dir_text)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt']

### SectionBreak objects

`dir_section.read(dir_text)` returned the first folder listing in *dir_text*.
However, it is missing the final line:

In [47]:
print(dir_text[13])

               4 File(s)           3501 bytes


To include this line, we need to define the end_setion to end *After* the specified text.  We include this information by explicitly creating a `SectionBreak` object:

In [48]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'))

dir_section.read(dir_text)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

## Iterating through multiple sections

dir_text is a list so `dir_section.read(dir_text)` starts over at the beginning each time it is called.

In [49]:
dir_section.read(dir_text)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

In [50]:
dir_section.read(dir_text)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

By creating an iterator from *dir_text* `dir_text_iter = iter(dir_text)` 
(representing a text stream source) 
successive calls to `dir_section.read(dir_text_iter)` 
will return the next directory group

In [51]:
dir_text_iter = iter(dir_text)
dir_section.read(dir_text_iter)

[' Directory of c:\\users\\...\\Test Dir Structure',
 '',
 '2021-12-27  03:33 PM    <DIR>          .',
 '2021-12-27  03:33 PM    <DIR>          ..',
 '2021-12-27  04:03 PM    <DIR>          Dir1',
 '2021-12-27  05:27 PM    <DIR>          Dir2',
 '2016-02-25  09:59 PM                 3 TestFile1.txt',
 '2016-02-15  06:46 PM                 7 TestFile2.rtf',
 '2016-02-15  06:47 PM                 0 TestFile3.docx',
 '2016-04-21  01:06 PM              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

In [52]:
dir_section.read(dir_text_iter)

[' Directory of c:\\users\\...\\Test Dir Structure\\Dir1',
 '',
 '2021-12-27  04:03 PM    <DIR>          .',
 '2021-12-27  04:03 PM    <DIR>          ..',
 '2016-02-15  06:48 PM                 0 File in Dir One.txt',
 '2021-12-27  03:45 PM    <DIR>          SubFolder1',
 '2021-12-27  03:45 PM    <DIR>          SubFolder2',
 '               1 File(s)              0 bytes']

## Section Processing

Once identified, a section's content can be *processed* before being returned.
Automatic processing of the items in a section's content is specified with the 
*processor* argument in the *Section* definition. 

The *processor* argument takes a list of functions, *Rules*, or *RuleSets*. If 
the processor argument is not given or is `None` the items in the section are 
returned as-is.  *Rules* and *RuleSets* will be discussed in the next section.

Processor functions have one required positional argument, the item to be 
processed.  In addition, the function may contain a second positional argument,
a *context* dictionary.  The *context* dictionary will be discussed in a more
detail in a later section.  Additional keyword arguments may also be included.  
If the keyword matches with a key in the section's *context*, The corresponding 
*context* value will be supplied.  Otherwise the keyword argument will be 
ignored.

The functions will be applied in list order with the input of the function being 
the output from the previous function.  This means that the expected input type 
of a processor function should be able to handle all possible output types from 
the previous function in the list.

Processor functions may also be generator functions, in which case the required 
positional argument is the sequence to iterate over.  This can be useful if the 
processing involves skipping items or merging of multiple items.  Examples of 
this will be given in a separate tutorial.

### Processing Directory Listing Parts
There are 4 different text line types in a directory listing section as we have 
defined it.  
1. The directory path
2. Subdirectory listings
3. File listings
4. number of flies

Here we will write simple functions for each line type and a single processor 
function to handle all 4 types.

#### Directory Path
- The directory path line begins with the text *Directory of*:
> `Directory of c:\users\...\Test Dir Structure`
- Extract the directory name from the full path:
    1. Split the path at the last '\'. 
    2. Keep the right hand part after the split.<br>
    `text_line.rsplit('\\', 1)[1]`
- Return a tab delimited line with:
    - *Folder Name:* before the tab and 
    - The directory name after the tab
  
`output_line = 'Folder Name:\t' + dir_line.rsplit('\\', 1)[1]`

In [53]:
def dir_name_split(dir_line):
    output_line = 'Folder Name:\t' + dir_line.rsplit('\\', 1)[1]
    return output_line

#### Number of Files
- The last line in the listing gives the number of files in the directory.
- That line contains the text *File(s)*:
> `	                4 File(s)           3501 bytes`
- Extract the number of files from the beginning of the line:
    1. Strip off the initial white space.
    2. Split the remaining text after the first space
    3. Keep the left hand part before the split.<br>
    `text_line.strip().split(' ', 1)[0]`    
- Return a tab delimited line with:
    - An initial tab
    - The text *Number of Files:* followed by another tab
    - The extracted number of files.

`output_line = 'Number of Files:\t' + dir_line.strip().split(' ', 1)[0]`

In [54]:
def file_count_split(dir_line):
    output_line = 'Number of Files:\t' + dir_line.strip().split(' ', 1)[0]
    return output_line

#### Subdirectories
- Lines containing a directory listing are indicated with the text *\<DIR\>*
> `2021-12-27  04:03 PM    <DIR>          Dir1`
- The name of the subdirectory begins at text column 36<br>
    `text_line[36:]`    
- Return a tab delimited line with:
    - An initial tab
    - The text *Subdirectory:* followed by another tab
    - The extracted name of the subdirectory.

`output_line = '\tSubdirectory:\t' + dir_line[36:]`

In [55]:
def get_subfolder_name(dir_line):
    output_line = '\tSubdirectory:\t' + dir_line[36:]
    return output_line

#### Files
- The remaining lines are assumed to contain file information.
- `	 2016-02-25  09:59 PM                 3 TestFile1.txt`
- The name of the file begins at text column 36<br>
    `text_line[36:]`    
- Return a tab delimited line with:
    - An initial tab
    - The text *File:* followed by another tab
    - The extracted name of the file.

`output_line = '\tFile:\t\t' + dir_line[36:]`

In [56]:
def get_file_name(dir_line):
    output_line = '\tFile:\t\t' + dir_line[36:]
    return output_line

#### Process Directory Function
Combine the above functions into one function that checks for the appropriate line

In [57]:
def process_directory(dir_line):
    # Get the directory name
    if 'Directory of' in dir_line:
        output_line = dir_name_split(dir_line)
    # Label the subdirectories
    elif '<DIR>' in dir_line:
        output_line = get_subfolder_name(dir_line)
    # Label the file counts
    elif 'File(s)' in dir_line:
        output_line = file_count_split(dir_line)
    # Label the files
    else:
        output_line = get_file_name(dir_line)
    return output_line 

#### New Dir Section Definition

In [58]:
dir_section = Section(start_section='Directory of',
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=[process_directory])

output = dir_section.read(dir_text)
for line in output:
    print(line)

Folder Name:	Test Dir Structure
	File:		
	Subdirectory:	   .
	Subdirectory:	   ..
	Subdirectory:	   Dir1
	Subdirectory:	   Dir2
	File:		 3 TestFile1.txt
	File:		 7 TestFile2.rtf
	File:		 0 TestFile3.docx
	File:		91 xcopy.txt
Number of Files:	4


# *DONE TO HERE*

1. Section Assembles
2. Subsections
3. Using the Context dictionary
    1. Setting context values when calling Section.read
    2. Accessing default context values
4. assemble options

## Sub-Sections
As mentioned above, there are 4 different text line types in a directory listing 
section *as we have defined it*.  However, we could define the directory listing section as a sequence of 3 sub-sections:
1. The directory path
2. Subdirectory and File listings
3. number of flies

Here we will write these three sub-section definitions and combine them as 
sub-sections in a directory listing section.

### Directory Name and File Count Processing Functions

As a reminder, A typical directory listing is shown below:

In [59]:
for line in dir_text[3:15]:
    print('\t', line)

	  Directory of c:\users\...\Test Dir Structure
	 
	 2021-12-27  03:33 PM    <DIR>          .
	 2021-12-27  03:33 PM    <DIR>          ..
	 2021-12-27  04:03 PM    <DIR>          Dir1
	 2021-12-27  05:27 PM    <DIR>          Dir2
	 2016-02-25  09:59 PM                 3 TestFile1.txt
	 2016-02-15  06:46 PM                 7 TestFile2.rtf
	 2016-02-15  06:47 PM                 0 TestFile3.docx
	 2016-04-21  01:06 PM              3491 xcopy.txt
	                4 File(s)           3501 bytes
	 


We will use the same directory name and file count processing functions as 
before, but for simplicity will combine the file and subdirectory processing 
functions.

In [60]:
def dir_name_split(dir_line: str) -> str:
    '''Extract the folder name from the full path.

    Args:
        dir_line (str): The directory path line from a DIR folder listing.
    
    Returns (str): A tab delimited line with 'Folder Name:' before the tab and
        the folder name after the tab.
    '''
    output_line = 'Folder Name:\t' + dir_line.rsplit('\\', 1)[1]
    return output_line
    

def file_count_split(dir_line: str) -> str:
    '''Extract the number of files from the "File(s)" DIR line.

    Args:
        dir_line (str): The "File(s)" line from a DIR folder listing.
    
    Returns (str): A tab delimited line with 'Number of Files:' before the tab 
        and the extracted number of files after the tab.
    '''
    output_line = 'Number of Files:\t' + dir_line.strip().split(' ', 1)[0]
    return output_line

#### File Names Processing Function

In [61]:
divider = ' '*39 + '|'

print(''.join(str(i)*10 for i in range(5)))
print(''.join(str(i) for i in range(10))*5)
print(divider)
print(dir_text[7])
print(dir_text[12])
print(divider)

00000000001111111111222222222233333333334444444444
01234567890123456789012345678901234567890123456789
                                       |
2021-12-27  04:03 PM    <DIR>          Dir1
2016-04-21  01:06 PM              3491 xcopy.txt
                                       |


- Lines in between the directory path line and the "File(s)" DIR line contain 
  subdirectory or file names.
- Lines containing a directory listing are indicated with the text '\<DIR\>'
    > `2021-12-27  04:03 PM    <DIR>          Dir1`
- Lines containing a file listing have the same format except without the 
  '\<DIR\>' text.
    > `2016-02-25  09:59 PM                 3 TestFile1.txt`
- The name of the file or subdirectory begins at text column 39
    > `text_line[39:]`   

In [62]:
def get_file_name(dir_line: str) -> str:
    '''Extract the name of the file or subdirectory from a DIR line.

    Args:
        dir_line (str): A main listing line from a DIR folder listing.
    
    Returns (str): A tab delimited line with 'File:' or 'Subdirectory:'before 
        the tab and the extracted name of the file or subdirectory after 
        the tab.
    '''
    if len(dir_line) < 39:  # This deals with blank lines.
        output_line = ''
    elif '<DIR>' in dir_line:  # Contains a subdirectory name.
        output_line = '\tSubdirectory:\t' + dir_line[39:]
    else:  # Contains a file name.
        output_line = '\tFile:\t\t' + dir_line[39:]
    return output_line

 ##### File and Sub-directory Section Start and End
- The subdirectory or file listings begins **after** the line containing the 
  text '*Directory of*'
- The subdirectory or file listings end **before** the line containing the text 
- '*File(s)*' 

### Sub-Section Definitions
The start and end of the folder listing can be identified by key phrases:
- The section start is identified by the text '*Directory of*'
- The section end is identified by the text '*File(s)*'


- The directory path line is the first line in the folder listing, so we do not 
  need to define a `start_section`[1].
- All on the sub-section we define for the folder listing section can be assumed 
  to begin immediately after the end of the previous sub-section. This means we
  do not need to define a `start_section` for any of the sub-sections.
- The `dir_path` and `file_count` sections are both a single lines, so we wish 
  to stop after the first line.
- To do this, `end_section` is set to `True` so that it will unconditionally 
  stop after the first line.
- The `file_listing` section end is identified by the text '*File(s)*' on the 
  next line. The `SectionBreak` argument `break_offset` is set to '*Before*', which allows the next sub-section `file_count` to begin with that line.

[1]: **NOTE:** A section's `end_section` and any sub-sections `start_section` begin checking with the *Second Item* in the section.  If we set 
`start_section='Directory of'` for this sub-section, is would search for the 
*next* occurrence of 'Directory of', which doesn't exist; and so would return 
a empty sub-sections.  This is a deliberate arrangement to make it possible to identify repeating sub-sections using identical `start_section` and 
`end_section`.


In [63]:
dir_path = Section(
    end_section=SectionBreak(True), 
    processor=[dir_name_split]
    )


file_count = Section(
    end_section=SectionBreak(True), 
    processor=[file_count_split]
    )


file_listing = Section(
    start_section=SectionBreak('Directory of', break_offset='After'),
    end_section=SectionBreak('File(s)', break_offset='Before'),
    processor=[get_file_name]
    )

#### Combining Sub-Section to Form a New Section.
The `dir_section`,`filename_section`, and `files_section` can be combined to form a full directory listing section:
*Note:* The Sub-sections must be combined in the order they will appear in the source.

In [64]:
dir_section = Section(
    start_section='Directory of',
    end_section=SectionBreak('File(s)', break_offset='After'),
    processor=[dir_path, file_listing, file_count]
    )

dir_section.read(dir_text)
# FIXME List of subsections returning None

[None]

In [39]:
for line in dir_text[3:20]:
    print('\t', line)

	  Directory of c:\users\...\Test Dir Structure
	 
	 2021-12-27  03:33 PM    <DIR>          .
	 2021-12-27  03:33 PM    <DIR>          ..
	 2021-12-27  04:03 PM    <DIR>          Dir1
	 2021-12-27  05:27 PM    <DIR>          Dir2
	 2016-02-25  09:59 PM                 3 TestFile1.txt
	 2016-02-15  06:46 PM                 7 TestFile2.rtf
	 2016-02-15  06:47 PM                 0 TestFile3.docx
	 2016-04-21  01:06 PM              3491 xcopy.txt
	                4 File(s)           3501 bytes
	 
	  Directory of c:\users\...\Test Dir Structure\Dir1
	 
	 2021-12-27  04:03 PM    <DIR>          .
	 2021-12-27  04:03 PM    <DIR>          ..
	 2016-02-15  06:48 PM                 0 File in Dir One.txt


## Section Assembles

A section's content can be summarized by supplying the section with an 
`assemble` method.  The `assemble` argument takes an *assemble* function; one
that combines the section sequence into a single object.

#### assemble Functions
assemble function are functions that can act on a sequence to combine them in 
some form.  The simplest assemble function (and also the default) is the 
built-in list command.

The assemble function has one required positional argument, the sequence to be 
assembled.  In addition, the function may contain a second positional argument,
a *context* dictionary.  The *context* dictionary will be discussed in a more
detail in a later section.  Additional keyword arguments may also be included.  
If the keyword matches with a key in the section's *context*, The corresponding 
*context* value will be supplied.  Otherwise the keyword argument will be 
ignored.

1. Use Processor function as assemble
2. assemble function to build dictionary
3. assemble function for user interaction (for Later)

## Documentation

In [40]:
#print(Section.__doc__)
#print(SectionBreak.__init__.__doc__)
#print(Section.__init__.__doc__)
#print(ProcessingMethods.__doc__)
#print(ProcessingMethods.__init__.__doc__)
#print(Rule.__doc__)
#print(Rule.__init__.__doc__)
#print(RuleSet.__doc__)
#print(RuleSet.__init__.__doc__)

#from sections import Trigger, TriggerEvent
#print(Trigger.__doc__)
#print(Trigger.__init__.__doc__)
#print(TriggerEvent.__doc__)
#print(TriggerEvent.__init__.__doc__)
