# Structural pattern matching for data science

## Matching a string to parse malformatted CSV format

Messy data is everywhere. Let's say we have a string, `bad_csv`, in malformatted CSV format which, for example, could have been read from a file and looks like this:

In [1]:
bad_csv = """
0,1,2
1,2,3
1,2
0
1
"""

We now want to convert `bad_csv` to a rectangular list of lists according to the following rules:

- keep lines with three values
- for lines with two values only, add a `None`
- skip empty lines
- for lines with one value is a 1, add 2 and 3
- for lines with one value that is not a 1, add None and None

This can easily translated into a structural pattern matching expression:

In [2]:
values = []
for line in bad_csv.split("\n"):
    match line.split(","):
        case [x, y, z]:
            values.append([x, y, z])
        case [x, y]:
            values.append([x, y, None])
        case [""]:
            continue
        case ["1"]:
             values.append([1, 2, 3])
        case [x]:
            values.append([x, None, None])
        case _:  # always matches
            raise Exception("This should not happen as we want to handle every case explicitely.")

For comparison, doing the above with if-else blocks would involve repeated calls of `len(line.split(","))` which is harder to understand.

In [3]:
values

[['0', '1', '2'],
 ['1', '2', '3'],
 ['1', '2', None],
 ['0', None, None],
 [1, 2, 3]]

Note that the order of the cases matters and that there is no fall-through once a case matches.
Compare

In [4]:
match ["a"]:
    case [x]:
        print("x")
    case ["a"]:
        print("a")

x


to

In [5]:
match ["a"]:
    case ["a"]:
        print("a")
    case [x]:
        print("x")

a


Before going all-in on structural pattern matching, note that the the popular code formatter **black does not support reformating structural pattern matching and fails with an error when encountering such a construct** at the time of this writing.

TODO: can case expressions be nested?

In [7]:
values = []
for line in bad_csv.split("\n"):
    match line.split(","):
        case [x, y, z]:
            values.append([x, y, z])
        case [x, y]:
            values.append([x, y, None])
        case [x]:
            case [""]:
                continue
            case ["1"]:
                 values.append([1, 2, 3])
        case [x]:
            values.append([x, None, None])
        case _:  # always matches
            raise Exception("This should not happen as we want to handle every case explicitely.")

SyntaxError: invalid syntax (1997687921.py, line 9)

## Matching a dataclass

In [6]:
from dataclasses import dataclass

## Matching a REST response

## Matching a pandas DataFrame

In [6]:
# TODO

## Matching a scikit-learn model

In [7]:
# TODO

## Matching a keras model

In [7]:
# TODO

## Further reading

[PEP 636 -- Structural Pattern Matching: Tutorial](https://www.python.org/dev/peps/pep-0636/)