# Example: Output from Windows Dir command

The Windows `dir` command displays a list of a directory's files and subdirectories.  
It's output will be used to showcase some of the features of the *sectionary* package.

Adding switches (options) to the `dir` command control what it displays and the format of the output.
In thses examples we will be using the command line:

`DIR "Test Dir Structure" /S /N /-C /T:W >  "test_DIR_Data.txt"`

| Switch | Description                                                                                              |
|--------|----------------------------------------------------------------------------------------------------------|
| /S     | Lists every occurrence of the specified file name within the specified directory and all subdirectories. |
| /N     | Displays a long list format with file names on the far right of the screen.                              |
| /-C    | Hides the thousand separator in file sizes.                                                              |
| /T:W   | Specifies which time field to display as "Last written".                                                 |
| >      | Redirect the output to the specified file.                                                               |

For more information, see [DIR Command Syntax](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/dir)

### Imports

#### Standard Python Modules

In [1]:
from pathlib import Path
from pprint import pprint
import re
import sys

#### Useful Third Party Packages

In [2]:
import pandas as pd
import xlwings as xw

#### Sectionary Imports

In [3]:
sys.path.append(r'../src/sectionary') 

import text_reader as tp
from sections import Rule, RuleSet, SectionBreak, ProcessingMethods, Section

In [4]:
#print(Section.__doc__)

In [5]:
#print(SectionBreak.__init__.__doc__)

## The Sample `Dir` Output

In [6]:
dir_text = Path('test_DIR_Data.txt').read_text().splitlines()

### `Dir` Output Structure

The first 20 lines of the diretory listing are:

In [7]:
for line in dir_text[0:20]:
    print('\t', line)

	  Volume in drive C is Windows
	  Volume Serial Number is DAE7-D5BA
	 
	  Directory of c:\users\...\Test Dir Structure
	 
	 2021-12-27  03:33 PM    <DIR>          .
	 2021-12-27  03:33 PM    <DIR>          ..
	 2021-12-27  04:03 PM    <DIR>          Dir1
	 2021-12-27  05:27 PM    <DIR>          Dir2
	 2016-02-25  09:59 PM                 3 TestFile1.txt
	 2016-02-15  06:46 PM                 7 TestFile2.rtf
	 2016-02-15  06:47 PM                 0 TestFile3.docx
	 2016-04-21  01:06 PM              3491 xcopy.txt
	                4 File(s)           3501 bytes
	 
	  Directory of c:\users\...\Test Dir Structure\Dir1
	 
	 2021-12-27  04:03 PM    <DIR>          .
	 2021-12-27  04:03 PM    <DIR>          ..
	 2016-02-15  06:48 PM                 0 File in Dir One.txt


We want to ignore the first two lines (the *header section*)

In [31]:
for line in dir_text[0:2]:
    print('\t', line)

	  Volume in drive C has no label.
	  Volume Serial Number is 56DB-14A7


After this come multiple Folder sections something like this:

In [32]:
print(dir_text[3][0:23], '...', dir_text[3][-19:])
print()
for line in dir_text[5:9]:
    print(line)
print(dir_text[13])

 Directory of C:\Users\ ... \Test Dir Structure

2021-06-18  14:54    <DIR>          .
2021-06-18  14:54    <DIR>          ..
2021-06-18  14:54    <DIR>          Dir1
2021-06-18  14:54    <DIR>          Dir2
               4 File(s)           3501 bytes


The start and end of the folder listing can be identified by key phrases:
- The section start is identified by the text '*Directory of*'
- The section end is identified by the text '*File(s)*'

**Define a Section Based on these start and end identifiers:**

In [33]:
dir_section = Section(start_section='Directory of', end_section='File(s)')
dir_section.read(dir_text)

[" Directory of C:\\Users\\gsalomon\\OneDrive - Queen's University\\Python\\Projects\\EclipseRelated\\PyUtilities\\Text Processing\\Text Files\\Test Dir Structure",
 '',
 '2021-06-18  14:54    <DIR>          .',
 '2021-06-18  14:54    <DIR>          ..',
 '2021-06-18  14:54    <DIR>          Dir1',
 '2021-06-18  14:54    <DIR>          Dir2',
 '2016-02-25  22:59                 3 TestFile1.txt',
 '2016-02-15  19:46                 7 TestFile2.rtf',
 '2016-02-15  19:47                 0 TestFile3.docx',
 '2016-04-21  14:06              3491 xcopy.txt']

`dir_section.read(dir_text)` returned the first folder listing in *dir_text*.
However, it is missing the final line:

In [34]:
print(dir_text[13])

               4 File(s)           3501 bytes


To include this line, we need to define the end_setion to end *After* the specified text.  We include this information by explicitly creating a `SectionBreak` object:

In [35]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'))

dir_section.read(dir_text)

[" Directory of C:\\Users\\gsalomon\\OneDrive - Queen's University\\Python\\Projects\\EclipseRelated\\PyUtilities\\Text Processing\\Text Files\\Test Dir Structure",
 '',
 '2021-06-18  14:54    <DIR>          .',
 '2021-06-18  14:54    <DIR>          ..',
 '2021-06-18  14:54    <DIR>          Dir1',
 '2021-06-18  14:54    <DIR>          Dir2',
 '2016-02-25  22:59                 3 TestFile1.txt',
 '2016-02-15  19:46                 7 TestFile2.rtf',
 '2016-02-15  19:47                 0 TestFile3.docx',
 '2016-04-21  14:06              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

dir_text is a list so `dir_section.read(dir_text)` starts over at the beginning each time it is called.

In [36]:
dir_section.read(dir_text)

[" Directory of C:\\Users\\gsalomon\\OneDrive - Queen's University\\Python\\Projects\\EclipseRelated\\PyUtilities\\Text Processing\\Text Files\\Test Dir Structure",
 '',
 '2021-06-18  14:54    <DIR>          .',
 '2021-06-18  14:54    <DIR>          ..',
 '2021-06-18  14:54    <DIR>          Dir1',
 '2021-06-18  14:54    <DIR>          Dir2',
 '2016-02-25  22:59                 3 TestFile1.txt',
 '2016-02-15  19:46                 7 TestFile2.rtf',
 '2016-02-15  19:47                 0 TestFile3.docx',
 '2016-04-21  14:06              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

In [37]:
dir_section.read(dir_text)

[" Directory of C:\\Users\\gsalomon\\OneDrive - Queen's University\\Python\\Projects\\EclipseRelated\\PyUtilities\\Text Processing\\Text Files\\Test Dir Structure",
 '',
 '2021-06-18  14:54    <DIR>          .',
 '2021-06-18  14:54    <DIR>          ..',
 '2021-06-18  14:54    <DIR>          Dir1',
 '2021-06-18  14:54    <DIR>          Dir2',
 '2016-02-25  22:59                 3 TestFile1.txt',
 '2016-02-15  19:46                 7 TestFile2.rtf',
 '2016-02-15  19:47                 0 TestFile3.docx',
 '2016-04-21  14:06              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

By creating an iterator from *dir_text* `dir_text_iter = iter(dir_text)` 
(representing a text stream source) 
successive calls to `dir_section.read(dir_text_iter)` 
will return the next directory group

In [38]:
dir_text_iter = iter(dir_text)
dir_section.read(dir_text_iter)


[" Directory of C:\\Users\\gsalomon\\OneDrive - Queen's University\\Python\\Projects\\EclipseRelated\\PyUtilities\\Text Processing\\Text Files\\Test Dir Structure",
 '',
 '2021-06-18  14:54    <DIR>          .',
 '2021-06-18  14:54    <DIR>          ..',
 '2021-06-18  14:54    <DIR>          Dir1',
 '2021-06-18  14:54    <DIR>          Dir2',
 '2016-02-25  22:59                 3 TestFile1.txt',
 '2016-02-15  19:46                 7 TestFile2.rtf',
 '2016-02-15  19:47                 0 TestFile3.docx',
 '2016-04-21  14:06              3491 xcopy.txt',
 '               4 File(s)           3501 bytes']

In [39]:
dir_section.read(dir_text_iter)

[" Directory of C:\\Users\\gsalomon\\OneDrive - Queen's University\\Python\\Projects\\EclipseRelated\\PyUtilities\\Text Processing\\Text Files\\Test Dir Structure\\Dir1",
 '',
 '2021-06-18  14:54    <DIR>          .',
 '2021-06-18  14:54    <DIR>          ..',
 '2016-02-15  19:48                 0 File in Dir One.txt',
 '2021-06-18  14:54    <DIR>          SubFolder1',
 '2021-06-18  14:54    <DIR>          SubFolder2',
 '               1 File(s)              0 bytes']

## Section Processing
Next lets do something with the section text.

Summarize a section's content by supplying the section with an `Aggregate`
method.

*Section*'s  `aggregate` argument (AggregateCallableOptions, optional): A function used to
                collect and format, the processor output into a single object.
                Defaults to None, which returns a list of the processor output.

AggregateFunc = Callable[[ProcessedList, ContextType], AggregatedItem]
AggregateCallableOptions = Union[AggregateFunc,
                               Callable[[ProcessedList], AggregatedItem],
                               Callable[..., ProcessedList]]
                                          

In [46]:
print(Section.__init__.__doc__)

Creates an Section instance that defines a continuous portion of a
        text stream to be processed in a specific way.

        Arguments:
            section_name (str, optional): A label to be applied to the section.
                Defaults to 'Section'.
            start_section (BreakOptions, optional): The SectionBreak(s) used
                to identify the location of the start of the section. Defaults
                to None, indicating the section begins with the first text
                line in the iterator.
            end_section (BreakOptions, optional): The SectionBreak(s) used
                to identify the location of the end of the section. Defaults
                to None, indicating the section ends with the last text line
                in the iterator.
            processor (ProcessMethodOptions, optional): Instructions for
                processing and the section items.  processor can be None, in
                which case the section will use the defaul


> For the directory line, extract the directory name from the full path:
`'Folder Name:\t' + dir_line.rsplit('\\', 1)[1]`

> Get the number of files in the directory:
`'\tNumber of Files:\t' + dir_line.strip().split(' ', 1)[0]`

> Identify subdirectories:
`'\tSubdirectory:\t' + dir_line[36:]`

> Identify files:
`'\tFile:\t' + dir_line[36:]`   

In [41]:
def process_directory(dir_line):
    # Get the directory name
    if 'Directory of' in dir_line:
        output_line = 'Folder Name:\t' + dir_line.rsplit('\\', 1)[1]
    # Label the subdirectories
    elif '<DIR>' in dir_line:
        output_line = '\tSubdirectory:\t' + dir_line[36:]
    # Label the file counts
    elif 'File(s)' in dir_line:
        output_line = 'Number of Files:\t' + dir_line.strip().split(' ', 1)[0]
    # Label the files
    else:
        output_line = '\tFile:\t\t' + dir_line[36:]
    return output_line 

In [42]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=ProcessingMethods([process_directory]))

output = dir_section.read(dir_text)
for line in output:
    print(line)

Folder Name:	Test Dir Structure
	File:		
	Subdirectory:	.
	Subdirectory:	..
	Subdirectory:	Dir1
	Subdirectory:	Dir2
	File:		TestFile1.txt
	File:		TestFile2.rtf
	File:		TestFile3.docx
	File:		xcopy.txt
Number of Files:	4


### Rule and RuleSets
Instead of having one function `process_directory()` that manages all possible 
text lines in the section, the function can be broken down into parts by 
defining *Rules*.

In [43]:
def dir_name_split(line):
    return ['Folder Name:', line.rsplit('\\', 1)[1]]
dir_name_rule = Rule('Directory of', pass_method=dir_name_split)

def file_count_split(line):
    return ['Number of Files:', line.strip().split(' ', 1)[0]]
file_count_rule = Rule('File(s)', pass_method=file_count_split)

def subfolder(line):
    return ['Subdirectory:', line[36:]]
subfolder_rule = Rule('<DIR>', pass_method=subfolder)

def file(line):
    return ['File:' + line[36:]]

dir_process = RuleSet([dir_name_rule, file_count_rule, subfolder_rule], 
                      default=file)

In [44]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=ProcessingMethods([dir_process]))

output = dir_section.read(dir_text)
for line in output:
    print(line)

['Folder Name:', 'Test Dir Structure']
['File:']
['Subdirectory:', '.']
['Subdirectory:', '..']
['Subdirectory:', 'Dir1']
['Subdirectory:', 'Dir2']
['File:TestFile1.txt']
['File:TestFile2.rtf']
['File:TestFile3.docx']
['File:xcopy.txt']
['Number of Files:', '4']


# *DONE TO HERE*

In [45]:
dir_section = Section(start_section='Directory of', 
                      end_section=SectionBreak('File(s)', break_offset='After'),
                      processor=[dir_process])

output = dir_section.read(dir_text)
for line in output:
    print(line)

TypeError: 'NoneType' object is not callable

In [None]:
def dir_name_split(line):
    # Get the directory name
    if 'Directory of' in line:
        return ['Folder Name:', line.rsplit('\\', 1)[1]]
    return line

def subfolder(line):
    # Label the subdirectories
    if '<DIR>' in line:
        return ['Subdirectory:', line[36:]]
    return line

def file_count_split(line):
    # Label the file counts
    if 'File(s)' in line:
        return ['Number of Files:', line.strip().split(' ', 1)[0]]
    return line

def file(line):
    # Label the files
    return ['File:' + line[36:]] 

In [None]:
print('column index')
print(''.join(str(i)*10 for i in range(10)))
print(''.join(str(i) for i in range(10))*10)
print(dir_text[9])
    

In [None]:


#%% Regex Parsing patterns
# File Count and summary:
     #          1 File(s)          59904 bytes
     #         23 Dir(s)     63927545856 bytes free
folder_summary_pt = re.compile(
    '(?P<files>'       # beginning of files string group
    '[0-9]+'           # Integer number of files
    ')'                # end of files string group
    '[ ]+'             # Arbitrary number of spaces
    '(?P<type>'        # beginning of type string group
    'File|Dir'         # "File" or " Dir" text
    ')'                # end of type string group
    '\\(s\\)'          # "(s)" text
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' bytes'           # "bytes" text
    )
date_pattern = tp.build_date_re(compile_re=False)
file_listing_pt = re.compile(
    f'{date_pattern}'  # Insert date pattern
    '[ ]+'             # Arbitrary number of spaces
    '(?P<size>'        # beginning of size string group
    '[0-9]+'           # Integer size of folder
    ')'                # end of size string group
    ' '                # Single space
    '(?P<filename>'    # beginning of filename string group
    '.*'               # Integer size of folder
    ')'                # end of size string group
    '$'                # end of string
    )


#%% Line Parsing Functions
# Directory Label Rule

def extract_directory(line: str, event, *args,
                    context=None, **kwargs) -> List[List[str]]:
    '''Extract Directory path from folder header.
    '''
    full_dir = line.replace('Directory of', '').strip()
    return [full_dir]


dir_header_rule = Rule(
    name='Dir Header Rule',
    sentinel='Directory of ',
    pass_method=extract_directory
    )


# skip <DIR>
def blank_line(*args, **kwargs) -> List[List[str]]:
    return [['']]


skip_dir_rule = Rule(
    name='Skip <DIR> Rule',
    sentinel=' <DIR> ',
    pass_method='Blank'
    )
skip_totals_rule = Rule(
    name='Skip Total Files Header Rule',
    sentinel='Total Files Listed:',
    pass_method='Blank'
    )


# Regular file listings
def file_parse(line: str, event, *args, **kwargs) -> List[List[str]]:
    '''Break file data into three columns containing Filename, Date, Size.

    Typical file is:
        2016-02-25  22:59     3 TestFile1.txt
    File line is parsed using a regular expression with 3 named groups.
    Output for the example above is:
        [[TestFile1.txt , 2016-02-25  22:59, 3]]

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test on the line.
            Contains 3 named groups: ['date', 'size', 'filename'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: A one-item list containing the parsed file
            information as a 3-item tuple:
                [(filename: str, date: str, file size: int)].
    '''
    file_line_parts = event.test_value.groupdict(default='')
    parsed_line = tuple([
        file_line_parts['filename'],
        tp.make_date_time_string(event),
        int(file_line_parts['size'])
        ])
    return parsed_line


# Regular File Parsing Rule
file_listing_rule = Rule(file_listing_pt, pass_method=file_parse,
                            name='Files_rule')


# File Count Parsing Rule
def file_count_parse(line: str, event, *args, **kwargs) -> List[List[str]]:
    '''Break file data into two rows containing:
           Number of files, & Directory size.

    Output has the following format:
        ['Number of files', file count value: int]
        ['Directory Size', directory size value: int]

    Typical line is:
        4 File(s)           3501 bytes
    File count is parsed using a regular expression with 2 named groups.

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test size the line.
            Contains 3 named groups: ['files', 'type', 'size'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: The parsed file information.
            The parsed file information consists of three lines with the
            following format:
                'Number of files', file count value: int
                'Directory Size', directory size value: int
    '''
    file_count_parts = event.groupdict(default='')
    # Manage case where bytes free is given:
    # 23 Dir(s)     63927545856 bytes free
    if line.strip().endswith('free'):
        file_count_parts['size_label'] = 'Free Space'
    else:
        file_count_parts['size_label'] = 'Size'
    parsed_line_template = ''.join([
        'Number of {type}s, {files}\n',
        'Directory {size_label}, {size}'
        ])
    parsed_line_str = parsed_line_template.format(**file_count_parts)
    parsed_line = [new_line.split(',')
                   for new_line in parsed_line_str.splitlines()]
    return parsed_line
file_count_rule = Rule(folder_summary_pt, pass_method=file_count_parse,
                          name='Files_rule')


skip_file_count_rule = Rule(
    name='Skip File(s) Rule',
    sentinel=folder_summary_pt,
    pass_method='Blank'
    )


# Files / DIRs Parse
def make_files_rule() -> Rule:
    '''If  File(s) or  Dir(s) extract # files & size
        '''
    def files_total_parse(line, event, *args, **kwargs) -> List[List[str]]:
        '''Break file counts into three columns containing:
           Type (File or Dir), Count, Size.

        The line:
               11 File(s)          72507 bytes
        Results in:
            [('File', 11, 3501)]
        The line:
           23 Dir(s)     63927545856 bytes free
        Results in:
            [('Dir', 23, 3501)]

    Args:
        line (str): The text line to be parsed.
        event (re.match): The results of the trigger test on the line.
            Contains 3 named groups: ['type', 'files', 'size'].
        *args & **kwargs: Catch unused extra parameters passed to file_parse.

    Returns:
        tp.ParseResults: A one-item list containing the parsed file count
            information as a 3-item tuple:
                [(Type: str (File or Dir), Count: int, Size: int)].
        '''
        files_dict = event.test_value.groupdict(default='')
        parsed_line = tuple([
            files_dict["type"],
            files_dict["files"],
            files_dict["size"]
            ])
        return [parsed_line]

    files_total_rule = Rule(folder_summary_pt,
                               pass_method=files_total_parse,
                               name='Files_Total_rule')
    return files_total_rule


default_csv = tp.define_csv_parser('dir_files', delimiter=':',
                                       skipinitialspace=True)


#%% Line Processing
def print_lines(parsed_list):
    output = list()
    for item in parsed_list:
        pprint(item)
        output.append(item)
    return output


def to_folder_dict(folder_list):
    '''Combine folder info into dictionary.
    '''
    # TODO separate directory info from file info
    #The first line in the folder list is the directory path
    directory = ''
    if folder_list:
        d_list = folder_list[0]
        if d_list:
            directory = d_list[0]
    folder_dict = {'Directory': directory}
    for folder_info in folder_list[1:]:
        filename, date, file_size = folder_info
        full_path = '\\'.join([directory, filename])
        file_parts = filename.rsplit('.', 1)
        if len(file_parts) > 1:
            extension = file_parts[1]
        else:
            extension = ''
        folder_dict = {
            'Path': full_path,
            'Directory': directory,
            'Filename': filename,
            'Extension': extension,
            'Date': date,
            'Size': file_size
            }
    return folder_dict


def make_files_table(dir_gen):
    '''Combine folder info dictionaries into Pandas DataFrame.
    '''
    list_of_folders = list(dir_gen)
    files_table = pd.DataFrame(list_of_folders)
    files_table.set_index('Path')
    return files_table


#%% Reader definitions
default_parser = tp.define_csv_parser('dir_files', delimiter=':',
                                       skipinitialspace=True)
heading_reader = ProcessingMethods([
    default_parser,
    tp.trim_items
    ])
folder_reader = ProcessingMethods([
    RuleSet([skip_dir_rule, file_listing_rule, dir_header_rule,
             skip_file_count_rule], default=default_parser),
    tp.drop_blanks
    ])
summary_reader = ProcessingMethods([
    RuleSet([file_count_rule, skip_totals_rule], default=default_parser),
    tp.drop_blanks
    ])


#%% SectionBreak definitions
folder_start = SectionBreak(
    name='Start of Folder', sentinel='Directory of', break_offset='Before')
folder_end = SectionBreak(name='End of Folder',sentinel=folder_summary_pt,
                             break_offset='After')
summary_start = SectionBreak(sentinel='Total Files Listed:',
                                name='Start of DIR Summary', break_offset='Before')


#%% Section definitions
header_section = Section(
    section_name='Header',
    start_section=None,
    end_section=folder_start,
    processor=heading_reader,
    aggregate=print_lines
    )
folder_section = Section(
    section_name='Folder',
    start_section=folder_start,
    end_section=folder_end,
    processor=folder_reader,
    aggregate=to_folder_dict
    )
all_folder_section = Section(
    section_name='All Folders',
    start_section=folder_start,
    end_section=summary_start,
    processor=[folder_section],
    aggregate=make_files_table
    )
summary_section = Section(
    section_name='Summary',
    start_section=summary_start,
    end_section=None,
    processor=summary_reader,
    aggregate=tp.to_dict
    )


#%% Main Iteration
def main():
    # Test File
    base_path = Path.cwd() / 'examples'
    test_file = base_path / 'test_DIR_Data.txt'

    # Call Primary routine
    context = {
        'File Name': test_file.name,
        'File Path': test_file.parent,
        'top_dir': str(base_path),
        'tree_name': 'Test folder Tree'
        }

    source = tp.file_reader(test_file)
    file_info = all_folder_section.read(source, context)
    #summary = summary_section.read(source, **context)

    # Output  Data
    xw.view(file_info)
    print('done')

if __name__ == '__main__':
    main()

In [None]:
print('column index')
print(''.join(str(i)*10 for i in range(10)))
print(''.join(str(i) for i in range(10))*10)
print(dir_text[9])
    

In [None]:
a =dir_text[3]
a.index('\\')
a.rsplit('\\', 1)
#'Folder Name:\t' + a.rsplit('\\', 1)[0]