array/tensor refinement #14

dvolgyes · 2020-11-13T20:37:32Z

Hi,

Great project!
I think it would be relatively easy to make it even more awesome. :)

Many people deal with data science nowadays, and there
tensor properties, like shape, are more frequent source of error
than e.g. values in the array.

Currently arrays / tensors are displayed like this:

x: ndarray = array([ 1, 2...  ])
y: Tensor = tensor([ 1, 2...  ])

I would recommend at least two extra piece of information: shape and dtype, something like this:

x: ndarray[float64] (200,300) = array([ 1, 2...  ])

Or maybe:

x: Tensor, cpu/f32, 2d, (200,300) = array([ 1, 2...  ])

Maybe it would be even nicer to make the hook configurable, like:

import frosch
frosch.hook(array_shapes=True, array_types=True, array_device=False, requires_grad=True, array_values=False)

Of course, you can always add almost infinite useful tricks to numpy/pytorch, e.g.
isnan/isinf, so here is a final variant to consider:

x: Tensor, type=cpu/f32, grad=F, shape=(200,300), inf:0 NaN:0, min: -3, max:1e8, mean: 0.0001

Of course, these might take time to print, but if it is configurable, then it doesn't really matter.

Another way, which would be also cool, is having a configurable extra printer.
E.g.

import frosch
import my_tricks
frosch.hook( (np.ndarray, my_tricks.numpy_printer) )

In this case anybody could customize their own printer, and add an extra representation.
I imagine the print like this:

x: ndarray "Whatever string the function returned"

And the function interface would be:

        extra_info = ....
        return extra_info + default_printers_text

This might mess up colors, i see it.

There could be other ways to define values, like returning a dictionary of texts,
like "pre_type", "post_type", "pre_value", "post_value" , and format them something like this:

f'{pre_type}{type}{post_type} = {pre_value}{value}{post_value}

In this way you don't have much issues with the coloring, it is still controlled by you,
but people could customize the printer.

Which one to choose?
Well, i like my own printers, e.g. testing for NaNs, but for most people a reasonable built-in would be the easiest,
maybe with some verbosity parameter in the hook installer.

But once again: it is a great project already, and much more readable than any default.
I really appreciate your efforts, and even if you don't incorporate anything from the above,
it is already a lot of help for lot of people. But maybe the above could make it even better. :)

The text was updated successfully, but these errors were encountered:

HallerPatrick · 2020-11-14T23:23:59Z

I really like the idea of some extra formatting for well established data structures like np array or pandas DataFrames. I can see that frosch will have a lot more configuration options backed up by some default settings. Allowing for extra "callback" hooks for different types is a really cool idea.

Something like this, what you already showed:

def nparray_printer(np_array: np.array) -> str:
        return f"Custom repr here"

# Hook provides a interface where type hooks can be custom
hook(
    type_hooks={
        np.array: nparray_printer,
        # Other <type>: <function>
    }
)

I really appreciate your suggestions and the time you put into them. I will definitely take a look at this. But you are also very welcome in implementing them yourself :)

dvolgyes · 2020-11-14T23:39:44Z

Hi, I would like to discuss the direction first. :) Many of the ideas are mutually exclusive, so nobody should waste effort. Of course, if i deeply disagree, i could have my own fork/version, but i think everybody sees a different aspect, so common wisdom leads to better solution. If we agree on a generic direction, i will send PRs. :) E.g. i could imagine an exception formatter for logging purposes too (file/string) which uses the same backbone, but instead of installing hook, it could be called as frosch.pretty_exception_str(exception) in a try/catch block. Anyway, so basically i think there are two parts: interface for custom types. Here it would be nice to use isinstance(...) match, so people could write a generic solution (e.g. all tensor types). The other part is giving a mechanism to access the default formatter. E.g. frosch.default_printer(type) Because implementing array formatting is pain, having the default is nice, but extending it would be cool.

…

-------- Original Message --------

On Nov 15, 2020, 00:24, Patrick Haller wrote: I really like the idea of some extra formatting for well established data structures like np array or pandas DataFrames. I can see that frosch will have a lot more configuration options backed up by some default settings. Allowing for extra "callback" hooks for different types is a really cool idea. Something like this, what you already showed: def nparray_printer ( np_array : np . array ) -> str : return f"Custom repr here" # Hook provides a interface where type hooks can be custom hook ( type_hooks = { np . array : nparray_printer , # Other <type>: <function> } ) I really appreciate your suggestions and the time you put into them. I will definitely take a look at this. But you are also very welcome in implementing them yourself :) — You are receiving this because you authored the thread. Reply to this email directly, [view it on GitHub](#14 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AADH4WEV3ZYZ2JEI3547IELSP4GRVANCNFSM4TVAZ3XA).

HallerPatrick · 2020-11-17T23:51:14Z

@dvolgyes Hey I am fiddling around with this feature right now. Could you provide me with a sample "display" function, with which I can test around?

Something in the form of:

def display_np_array(np_array) -> str:
    # Here your implementation
    return "Here the string representation you want"

Thanks in advance :)

dvolgyes · 2020-11-18T13:04:55Z

Hi,

I meant something like this below.
I assume that the base array formatting is still provided,
it is just about the customization of the other parts.
Having options would make it more configurable.
There could be generic options too, e.g. verbose, etc. which would apply to all plugins.

Marking: i just made a dummy example, but a generic marker would be nice, e.g.
yielding warning results in red.
This could be provided by the base library, e.g. frosch.warning(x),
or provided as a parameter, e.g.
def display_np_array(np_array, config=None, warning:Callable=None) -> str:

Anyway, here is a short example:

#!/usr/bin/env python3

import numpy as np
import pprint

x = np.arange(400*400).reshape(4,1,100,20,20).astype(np.float32)
x[1,0,3,2,1] = np.inf
x[1,0,2,2,1] = np.nan


def display_np_array(np_array, config=None) -> str:
    if config == None:
        config = dict()

    result = ""
    if config.get('np_array.shape', True):
        result = f'{result} {np_array.shape}'

    if config.get('np_array.dtype', True):
        dtype = { np.dtype('float32'): 'f32',
                  np.dtype('float64'): 'f64',
                  np.dtype('int8'): 'i8',
                  np.dtype('int16'): 'i16',
                  np.dtype('int32'): 'i32',
                  np.dtype('int64'): 'i64',
                  np.dtype('uint8'): 'ui8',
                  np.dtype('uint16'): 'ui16',
                  np.dtype('uint32'): 'ui32',
                  np.dtype('uint64'): 'ui64',
                  np.dtype('bool'): 'ui64',
                  np.dtype('str'): 'str',
                  np.dtype('object'): 'obj',
                  }.get(np_array.dtype, 'unknown')

        result = f'{result}[{dtype}]'

    if config.get('np_array.finite', True):
        nan = np.isnan(np_array).sum()
        inf = np.isinf(np_array).sum()
        if nan>0:
            warning_marker = lambda x: f'!!{x}!!' # It would mark the content as warning, e.g. coloring to red
            nan = warning_marker(f'NaNs:{nan}')
        else:
            nan = f'NaNs:{nan}'

        result = f'{result} {nan} Infs:{inf}'

    if config.get('np_array.stat', True):
        if np_array.dtype not in (np.dtype('str'),np.dtype('object')):
            tmp = np_array.ravel()
            tmp = tmp[np.isfinite(tmp)]
            min_ = tmp.min()
            max_ = tmp.max()
            med_ = np.median(tmp)
            result = f'{result} min:{min_} max:{max_} median:{med_}'

    # I assume the array will be printed anyway
    # So i expect it in this form:
    # VARIABLE = {result} \n
    #            [[ .... ],
    #               .....]]

    return result


print(display_np_array(x))

HallerPatrick · 2020-11-19T16:07:45Z

Doesn't look to shabby.

I am not sure about the configs for those functions. This could turn out to be a little to confusing, allowing for custom datatype display functions, but also be able to configure preexisting display functions. For those big datatypes like df and np array this makes sense. But in most scenarios this would be overkill.

While messing with this around I realised that this feature is not far away from print debugging on steroids. Which I am a fan of. I like tools like icecream. But we would make it even better with our custom datatype display functions.

from frosch import frosch_print # Needs a shorter name tho

frosch_print(np_array)

# Output: Like the one in the image

Frosch is going in a direction of a multifunctional debugging tool...

What is our opinion @dvolgyes ?

dvolgyes · 2020-11-21T10:45:04Z

There are many different ways, in the end, you need to choose what you prefer. :)
I also like icecream, who doesn't? :) But there i need to manually use print (ic),
while frosch can customize the exceptions i don't expect. But information-wise
i think the same content is more or less expected: as much debug information as possible,
but still keep it concise. And of course, there is always a performance tradeoff.
Finding nans, infs, median, etc., it might not be important, might be extended (min/max, 5/95 percentile, etc.),
and might take a lot of time, or even getting into trouble, e.g. imagine out of memory exception.
The above instructions also take significant memory, so there will be newer exceptions.
And some of these exceptions, like pytorch's CUDA out of memory is not the standard
python out of memory error, so you would need special considerations.

More or less that is way i used configuration parameters, like balancing between tradeoffs.
For the user-facing part, i would go for sensible defaults, and/or replacable formatters.

Advanced users could replace anything and everything they want, in worst case, just spin their own version,
but for average users sensible defaults are the best.
Replacing/customizing: there could be a million ways, see the functional version above,
but it also could be that you need to provide a class, and this class get called with
all the information, and the customization is to handle the print.
e.g.

class MyPrint(forsch.CustomPrinter):

    def __init__(self, configuration):
          super().__init__(configuration)
          pass

   def  print_type(var_name, var):
          if  not isinstance(var, np.ndarray):
               # either call 'super'
               return super().print_type(var_name, var)
              # or use exception, and fall back to built-in, e.g:
              return None
              raise NotImplementedError
              ...

          if isinstance(var, dict))
                name=var_name
                meta=f'size:{len(var)} None-key:{None in var.keys()} None-value:{None in var.values()} ...'
                value=pprint.pformat(var)
          elif ...

          return {'name':name, 'meta': meta, 'value':value}

In this formulation, you could have your own class, even frosch could provide a few alternatives, e.g.

import frosch
frosch.hook()
frosch.hook(frosch.data_science_tweaks)

Another point to consider: making the hook install directly in module import,
so the user-side would be just:

import frosch
# frosch.set_custom_printer(...)

The class-based exception signaled way have its own nice part: if there is an error, it would automatically fall back
to a less advanced implementation. From your point of view, it could be 3 stage:

basic printer: just names and types
advanced printer, e.g. median
user printer: whatever

If there is a CUDA error, user code falls back to your advanced plugin.
If there is a bug/exception there, it falls back to the basic printer.

However, of course, messing with the exception hooks, and using exceptions at the same time
might cause some challenges to nicely test all corner cases.

From my point of view, I don't like IDE's, I usually work in command line and basic editors,
and long runs (hours/days). So if my code gives an exception, it is a pain to figure it out,
because re-run might be very complicated. So I prefer to have as much information as possible.
But I am not a typical user, I probably focus on other parts than e.g. a pandas user.
But there are so many options, just machine learning has numpy, pandas, h2o, tensorflow, pytorch, ...
Not to mention the web frameworks, like urllib, where error codes could be printed,
or regex errors, where regex could be printed next to the input string, etc.

It is a never ending story, but plugins could make it flexible enough, but at the same time,
you don't have to implement all the possible types. (Well, proper types should have a nice 'repr'
format anyway, but we all know this isn't always the case.)

Of course, you could have a similar plugin system to flake8 which discovers plugins through
python package entry points, but i think this is overkill. One or multiple custom functions/classes
should be enough, and config values could be looked up from environment (e.g. $home/.frosch.config, etc.),
or from the hook function arguments.

HallerPatrick added the feature New feature or request label Nov 14, 2020

HallerPatrick added this to the 0.2.0 Release milestone Nov 17, 2020

alexmojaki mentioned this issue Feb 6, 2021

Values with large reprs #34

Closed

HallerPatrick closed this as completed Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

array/tensor refinement #14

array/tensor refinement #14

dvolgyes commented Nov 13, 2020

HallerPatrick commented Nov 14, 2020

dvolgyes commented Nov 14, 2020 via email

HallerPatrick commented Nov 17, 2020

dvolgyes commented Nov 18, 2020

HallerPatrick commented Nov 19, 2020

dvolgyes commented Nov 21, 2020

array/tensor refinement #14

array/tensor refinement #14

Comments

dvolgyes commented Nov 13, 2020

HallerPatrick commented Nov 14, 2020

dvolgyes commented Nov 14, 2020 via email

HallerPatrick commented Nov 17, 2020

dvolgyes commented Nov 18, 2020

HallerPatrick commented Nov 19, 2020

dvolgyes commented Nov 21, 2020