Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array/tensor refinement #14

Closed
dvolgyes opened this issue Nov 13, 2020 · 6 comments
Closed

array/tensor refinement #14

dvolgyes opened this issue Nov 13, 2020 · 6 comments
Labels
feature New feature or request
Milestone

Comments

@dvolgyes
Copy link

Hi,

Great project!
I think it would be relatively easy to make it even more awesome. :)

Many people deal with data science nowadays, and there
tensor properties, like shape, are more frequent source of error
than e.g. values in the array.

Currently arrays / tensors are displayed like this:

x: ndarray = array([ 1, 2...  ])
y: Tensor = tensor([ 1, 2...  ])

I would recommend at least two extra piece of information: shape and dtype, something like this:

x: ndarray[float64] (200,300) = array([ 1, 2...  ])

Or maybe:

x: Tensor, cpu/f32, 2d, (200,300) = array([ 1, 2...  ])

Maybe it would be even nicer to make the hook configurable, like:

import frosch
frosch.hook(array_shapes=True, array_types=True, array_device=False, requires_grad=True, array_values=False)

Of course, you can always add almost infinite useful tricks to numpy/pytorch, e.g.
isnan/isinf, so here is a final variant to consider:

x: Tensor, type=cpu/f32, grad=F, shape=(200,300), inf:0 NaN:0, min: -3, max:1e8, mean: 0.0001

Of course, these might take time to print, but if it is configurable, then it doesn't really matter.

Another way, which would be also cool, is having a configurable extra printer.
E.g.

import frosch
import my_tricks
frosch.hook( (np.ndarray, my_tricks.numpy_printer) )

In this case anybody could customize their own printer, and add an extra representation.
I imagine the print like this:

x: ndarray "Whatever string the function returned"

And the function interface would be:

        extra_info = ....
        return extra_info + default_printers_text

This might mess up colors, i see it.

There could be other ways to define values, like returning a dictionary of texts,
like "pre_type", "post_type", "pre_value", "post_value" , and format them something like this:

f'{pre_type}{type}{post_type} = {pre_value}{value}{post_value}

In this way you don't have much issues with the coloring, it is still controlled by you,
but people could customize the printer.

Which one to choose?
Well, i like my own printers, e.g. testing for NaNs, but for most people a reasonable built-in would be the easiest,
maybe with some verbosity parameter in the hook installer.

But once again: it is a great project already, and much more readable than any default.
I really appreciate your efforts, and even if you don't incorporate anything from the above,
it is already a lot of help for lot of people. But maybe the above could make it even better. :)

@HallerPatrick
Copy link
Owner

I really like the idea of some extra formatting for well established data structures like np array or pandas DataFrames. I can see that frosch will have a lot more configuration options backed up by some default settings. Allowing for extra "callback" hooks for different types is a really cool idea.

Something like this, what you already showed:

def nparray_printer(np_array: np.array) -> str:
        return f"Custom repr here"

# Hook provides a interface where type hooks can be custom
hook(
    type_hooks={
        np.array: nparray_printer,
        # Other <type>: <function>
    }
)

I really appreciate your suggestions and the time you put into them. I will definitely take a look at this. But you are also very welcome in implementing them yourself :)

@HallerPatrick HallerPatrick added the feature New feature or request label Nov 14, 2020
@dvolgyes
Copy link
Author

dvolgyes commented Nov 14, 2020 via email

@HallerPatrick
Copy link
Owner

@dvolgyes Hey I am fiddling around with this feature right now. Could you provide me with a sample "display" function, with which I can test around?

Something in the form of:

def display_np_array(np_array) -> str:
    # Here your implementation
    return "Here the string representation you want"

Thanks in advance :)

@HallerPatrick HallerPatrick added this to the 0.2.0 Release milestone Nov 17, 2020
@dvolgyes
Copy link
Author

Hi,

I meant something like this below.
I assume that the base array formatting is still provided,
it is just about the customization of the other parts.
Having options would make it more configurable.
There could be generic options too, e.g. verbose, etc. which would apply to all plugins.

Marking: i just made a dummy example, but a generic marker would be nice, e.g.
yielding warning results in red.
This could be provided by the base library, e.g. frosch.warning(x),
or provided as a parameter, e.g.
def display_np_array(np_array, config=None, warning:Callable=None) -> str:

Anyway, here is a short example:

#!/usr/bin/env python3

import numpy as np
import pprint

x = np.arange(400*400).reshape(4,1,100,20,20).astype(np.float32)
x[1,0,3,2,1] = np.inf
x[1,0,2,2,1] = np.nan


def display_np_array(np_array, config=None) -> str:
    if config == None:
        config = dict()

    result = ""
    if config.get('np_array.shape', True):
        result = f'{result} {np_array.shape}'

    if config.get('np_array.dtype', True):
        dtype = { np.dtype('float32'): 'f32',
                  np.dtype('float64'): 'f64',
                  np.dtype('int8'): 'i8',
                  np.dtype('int16'): 'i16',
                  np.dtype('int32'): 'i32',
                  np.dtype('int64'): 'i64',
                  np.dtype('uint8'): 'ui8',
                  np.dtype('uint16'): 'ui16',
                  np.dtype('uint32'): 'ui32',
                  np.dtype('uint64'): 'ui64',
                  np.dtype('bool'): 'ui64',
                  np.dtype('str'): 'str',
                  np.dtype('object'): 'obj',
                  }.get(np_array.dtype, 'unknown')

        result = f'{result}[{dtype}]'

    if config.get('np_array.finite', True):
        nan = np.isnan(np_array).sum()
        inf = np.isinf(np_array).sum()
        if nan>0:
            warning_marker = lambda x: f'!!{x}!!' # It would mark the content as warning, e.g. coloring to red
            nan = warning_marker(f'NaNs:{nan}')
        else:
            nan = f'NaNs:{nan}'

        result = f'{result} {nan} Infs:{inf}'

    if config.get('np_array.stat', True):
        if np_array.dtype not in (np.dtype('str'),np.dtype('object')):
            tmp = np_array.ravel()
            tmp = tmp[np.isfinite(tmp)]
            min_ = tmp.min()
            max_ = tmp.max()
            med_ = np.median(tmp)
            result = f'{result} min:{min_} max:{max_} median:{med_}'

    # I assume the array will be printed anyway
    # So i expect it in this form:
    # VARIABLE = {result} \n
    #            [[ .... ],
    #               .....]]

    return result


print(display_np_array(x))

@HallerPatrick
Copy link
Owner

Doesn't look to shabby.
Screenshot 2020-11-19 at 16 58 29

I am not sure about the configs for those functions. This could turn out to be a little to confusing, allowing for custom datatype display functions, but also be able to configure preexisting display functions. For those big datatypes like df and np array this makes sense. But in most scenarios this would be overkill.

While messing with this around I realised that this feature is not far away from print debugging on steroids. Which I am a fan of. I like tools like icecream. But we would make it even better with our custom datatype display functions.

from frosch import frosch_print # Needs a shorter name tho

frosch_print(np_array)

# Output: Like the one in the image

Frosch is going in a direction of a multifunctional debugging tool...

What is our opinion @dvolgyes ?

@dvolgyes
Copy link
Author

There are many different ways, in the end, you need to choose what you prefer. :)
I also like icecream, who doesn't? :) But there i need to manually use print (ic),
while frosch can customize the exceptions i don't expect. But information-wise
i think the same content is more or less expected: as much debug information as possible,
but still keep it concise. And of course, there is always a performance tradeoff.
Finding nans, infs, median, etc., it might not be important, might be extended (min/max, 5/95 percentile, etc.),
and might take a lot of time, or even getting into trouble, e.g. imagine out of memory exception.
The above instructions also take significant memory, so there will be newer exceptions.
And some of these exceptions, like pytorch's CUDA out of memory is not the standard
python out of memory error, so you would need special considerations.

More or less that is way i used configuration parameters, like balancing between tradeoffs.
For the user-facing part, i would go for sensible defaults, and/or replacable formatters.

Advanced users could replace anything and everything they want, in worst case, just spin their own version,
but for average users sensible defaults are the best.
Replacing/customizing: there could be a million ways, see the functional version above,
but it also could be that you need to provide a class, and this class get called with
all the information, and the customization is to handle the print.
e.g.

class MyPrint(forsch.CustomPrinter):

    def __init__(self, configuration):
          super().__init__(configuration)
          pass

   def  print_type(var_name, var):
          if  not isinstance(var, np.ndarray):
               # either call 'super'
               return super().print_type(var_name, var)
              # or use exception, and fall back to built-in, e.g:
              return None
              raise NotImplementedError
              ...

          if isinstance(var, dict))
                name=var_name
                meta=f'size:{len(var)} None-key:{None in var.keys()} None-value:{None in var.values()} ...'
                value=pprint.pformat(var)
          elif ...

          return {'name':name, 'meta': meta, 'value':value}

In this formulation, you could have your own class, even frosch could provide a few alternatives, e.g.

import frosch
frosch.hook()
frosch.hook(frosch.data_science_tweaks)

Another point to consider: making the hook install directly in module import,
so the user-side would be just:

import frosch
# frosch.set_custom_printer(...)

The class-based exception signaled way have its own nice part: if there is an error, it would automatically fall back
to a less advanced implementation. From your point of view, it could be 3 stage:

  • basic printer: just names and types
  • advanced printer, e.g. median
  • user printer: whatever

If there is a CUDA error, user code falls back to your advanced plugin.
If there is a bug/exception there, it falls back to the basic printer.

However, of course, messing with the exception hooks, and using exceptions at the same time
might cause some challenges to nicely test all corner cases.

From my point of view, I don't like IDE's, I usually work in command line and basic editors,
and long runs (hours/days). So if my code gives an exception, it is a pain to figure it out,
because re-run might be very complicated. So I prefer to have as much information as possible.
But I am not a typical user, I probably focus on other parts than e.g. a pandas user.
But there are so many options, just machine learning has numpy, pandas, h2o, tensorflow, pytorch, ...
Not to mention the web frameworks, like urllib, where error codes could be printed,
or regex errors, where regex could be printed next to the input string, etc.

It is a never ending story, but plugins could make it flexible enough, but at the same time,
you don't have to implement all the possible types. (Well, proper types should have a nice 'repr'
format anyway, but we all know this isn't always the case.)

Of course, you could have a similar plugin system to flake8 which discovers plugins through
python package entry points, but i think this is overkill. One or multiple custom functions/classes
should be enough, and config values could be looked up from environment (e.g. $home/.frosch.config, etc.),
or from the hook function arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants