#### Sanitizing Pathnames (Flex)

In Posix pathnames, _components_ are separated by `/`. Consecutive multiple `/` have the same meaning as a single `/`. A final `/` has no meaning, but an initial `/` is significant. Leading and trailing spaces are allowed but have no significance. A component consists of `a-z`, `A-Z`, `0-9`, and `.` (dot), with two special cases: a component with a single `.` component refers to the current directory and a `..` component refers to the parent directory. Note that `.` can also be part of a component. Portable pathnames restrict each component to having at most 14 characters, and the whole pathname can have at most 255 characters.

Implement a sanitizer for pathnames using Flex and C. Your implementation should read from standard input and produce a sanitized portable pathname on standard output or an error message on standard error. The implementation has to use the regular expression facilities of flex to check for the well-formedness of the input.

| standard input       | standard output |
|:---------------------|:----------------|
| `/aaa//bb/c/`        | `/aaa/bb/c`     |
| `aaa/b.b/../cc/./dd` | `aaa/cc/dd`     |

The sanitizer should read the input line by line from standard input until the end of the file. For each line, the sanitizer should either produce one line with the sanitized pathname on standard output or an error message on standard error and terminate immediately:

| standard input        | standard error       |
|:----------------------|:---------------------|
| `/a//b/#/c`           | `invalid character`  |
| `/012345678901234/bb` | `component too long` |
| `aa/../..`            | `malformed pathname` |
| `/this/is/a/path/name/that/is/really/too/long/.../way/too/long/` | `pathname too long` |


Hint: use the regular expression features of flex to check for invalid characters, too long components, and to "swallow" leading and trailing spaces, multiple consecutive `/`, and `.` components.

In [62]:
%%writefile spn.l
%option noyywrap
%{
#include <stdio.h>
#include <stdbool.h>
#include <string.h>

#define MAX_COMPONENT_LENGTH 14
#define MAX_PATHNAME_LENGTH 255

int error_flag = 0;

typedef struct Node {
    char* data;
    struct Node* next;
} Node;

Node* stack = NULL;

int total_char_count = 0;
bool last_was_slash = false;

void print_error(const char *msg) {
    fprintf(stderr, "%s", msg);
    error_flag = 1;
}

Node* create_node(const char* data){
    Node* new_node = (Node*)malloc(sizeof(Node));
    new_node->data = (char*)malloc(strlen(data) + 1);
    strcpy(new_node->data, data);
    new_node->next = NULL;
    return new_node;
}

void push(const char* data){
    Node* new_node = create_node(data);
    if(stack == NULL){
        stack = new_node;
        return;
    }
    new_node->next = stack;
    stack = new_node;
}

void pop(){
    if(stack == NULL){
        print_error("malformed pathname");
        return;
    }
    Node* tmp = stack;
    stack = stack->next;
    free(tmp);
}

void clear_stack(){
    free(stack);
    stack = NULL;
}

void print_node(Node* node){
    if(node == NULL){
        return;
    }
    print_node(node->next);
    printf("%s", node->data);
}

void print_sanitized_pathname(){
    print_node(stack);
    printf("\n");
}

char* concat_to_front(const char c, char* str) {
    size_t len = strlen(str);
    char* result = (char*)malloc(len + 2);
    strcpy(result + 1, str);
    result[0] = c;
    result[len + 1] = '\0';
    return result;
}

%}

%%

[\t ]+            ; // Ignore leading and trailing spaces

\.\/*             ; // Ignore 
    
\/                { // Forward slash found
                    if(error_flag){
                        continue;
                    }
                    total_char_count++;
                    last_was_slash = true;
                  }

[a-zA-Z0-9\.]+    { // Component found
                    if(error_flag){
                        continue;
                    }
    
                    int len = strlen(yytext);
                    total_char_count += len;

                    if(strcmp(yytext, "..") == 0){
                        pop();
                        continue;
                    }
                    if(len > MAX_COMPONENT_LENGTH){
                        print_error("component too long");
                    }
                    if(total_char_count > MAX_PATHNAME_LENGTH){
                        print_error("pathname too long");
                    }
                    yytext = last_was_slash ? concat_to_front('/', yytext) : yytext;
                    push(yytext);
                    last_was_slash = false;
                  }

\n                {
                    if(!error_flag){
                        print_sanitized_pathname();
                    }
                    clear_stack();
                    total_char_count = 0;
                    error_flag = 0;
                    last_was_slash = false;
                  }

.                 { // Invalid character
                    if(error_flag){
                        continue;
                    }
                    print_error("invalid character");
                  }

%%

int main() {
    yylex();
    printf("\n");
    return 0;
}

Overwriting spn.l


In [63]:
!flex spn.l

In [64]:
!cc -o spn -std=c99 lex.yy.c -D_POSIX_C_SOURCE=1

The file `goodpaths.txt` contains a set of paths to test. The extra newline is needed as the `%%writefile` trims a trailing newline if it is at the end of the input.

In [65]:
%%writefile goodpaths.txt
/aaa//bb/c/
aaa/b.b/../cc/./dd
a45/b.b/../cc/./dd/.
./////def/ghi///jkl//mno/pqr/../../././../../ghi/./jkl////
/.../.abc/./123/456/789/../../
./test/ing


Overwriting goodpaths.txt


In [66]:
!cat goodpaths.txt

/aaa//bb/c/
aaa/b.b/../cc/./dd
a45/b.b/../cc/./dd/.
./////def/ghi///jkl//mno/pqr/../../././../../ghi/./jkl////
/.../.abc/./123/456/789/../../
./test/ing


In [67]:
%%capture output
!cat goodpaths.txt | ./spn
# Should output
#/aaa/bb/c
#aaa/cc/dd
#a45/cc/dd
#def/ghi/jkl
#/.../.abc/123
#test/ing
#

In [68]:
print(output) # for testing purposes

/aaa/bb/c
aaa/cc/dd
a45/cc/dd
def/ghi/jkl
/.../.abc/123
test/ing




In [69]:
expected =  """/aaa/bb/c\r
aaa/cc/dd\r
a45/cc/dd\r
def/ghi/jkl\r
/.../.abc/123\r
test/ing\r
\r
"""
actual = str(output)
# Use these outputs to help debug line endings if needed
print(repr(actual))
print(repr(expected))
assert actual == expected

'/aaa/bb/c\r\naaa/cc/dd\r\na45/cc/dd\r\ndef/ghi/jkl\r\n/.../.abc/123\r\ntest/ing\r\n\r\n'
'/aaa/bb/c\r\naaa/cc/dd\r\na45/cc/dd\r\ndef/ghi/jkl\r\n/.../.abc/123\r\ntest/ing\r\n\r\n'


In [70]:
%%capture output
!echo "/a//b/#/c" | ./spn # Should output `invalid character`

In [71]:
print(output) # for testing purposes

invalid character



In [72]:
assert str(output) == 'invalid character\r\n'

In [73]:
%%capture output
!echo "/012345678901234/bb" | ./spn # Should output `component too long`

In [74]:
print(output) # for testing purposes

component too long



In [75]:
assert str(output) == 'component too long\r\n'

In [76]:
%%capture output
!echo "aa/../.." | ./spn # Should output `malformed pathname`

In [77]:
print(output) # for testing purposes

malformed pathname



In [78]:
assert str(output) == 'malformed pathname\r\n'

In [79]:
%%capture output
#long_path = '/'.join(['0123456789' for _ in range(26)])
long_path = '/'.join(['0123456789' for _ in range(26)]) + '/'.join(['..' for _ in range(23)])
!echo $long_path | ./spn # should output `pathname too long`

In [80]:
print(output) # for testing purposes

pathname too long



In [81]:
assert str(output) == 'pathname too long\r\n'

In [82]:
%%capture output
!echo "/abc/123/abcdefghijklmno" | ./spn # Should output `component too long`

In [83]:
print(output) # for testing purposes

component too long



In [84]:
assert str(output) == 'component too long\r\n'

In [85]:
%%capture output
!echo "/abc/def\xE2\x98\xA0/" | ./spn # Should output `invalid character`

In [86]:
print(output) # for testing purposes

invalid character



In [87]:
assert str(output) == 'invalid character\r\n'

In [88]:
%%capture output
!echo "abcdef/./def/../../.." | ./spn # Should output `malformed pathname`

In [89]:
print(output) # for testing purposes

malformed pathname



In [90]:
assert str(output) == 'malformed pathname\r\n'