Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging refactor #842

Open
aaron-blox opened this issue Feb 8, 2023 · 1 comment
Open

Logging refactor #842

aaron-blox opened this issue Feb 8, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@aaron-blox
Copy link
Contributor

Background

SSV project had one type of logger. A refactor was started but not completed.

Assumptions/Goals

  1. We want to optimize for logging performance
  2. We want to optimize for understanding the lifecycle of an event though logging
  3. We don’t want un-logged debug logs to use resources when they will not be logged anyway

Existing problems

1. Log quality/content needs to be better

Message and field names are not standardized/uniform

Potential Solutions

  1. Wrappers for field names, specific for zap fields
  2. Declare/enforce that logs use field names which are defined as constants
  3. Add contextual information to logs earlier so they don’t need to get added to each log line. Fixing problem (3.) will naturally improve this.

2. Too many logs

For example, an operator has a problem and enables debug logs. This results in too many logs to be useful

Potential Solutions

  1. Tracing could potentially solve this issue. In the meantime until tracing is implemented, we can just dealing with it with a hacky solution in the meantime, like commenting out logs/wrapping logs in if globalconfig.VerboseDebugLogEnabled 

  2. Specify logging levels by component. We maybe already have this to some extent in runners, not everywhere. Specify components by regex in config. Change config struct to a more key:value design.

  3. Additional log level to support production debugging vs development debugging

    1. Criticism: why in principle is a production issue separate from an issue in development such that it requires different logs to solve?
    2. We will skip doing this for now, a better solution is 6. anyway
  4. Global variable which is used to gate additional debug logs. 6. Is better than this don’t do this

  5. Devs use debugger locally & in staging. Debugging server can be easily started locally. Debugging server can be started in staging and can be connected to with some amount of work/configuration.

  6. Write a zap middleware which operates on a global bool similar to 4. to selectively log logs with a specific magic value like zap.Bool(“verboseDebugLog”, true)

3. Logging is not request-scoped for global shared variables 

When instances are shared between different go-routines and use the same logger, for the logger to log request specific values, values are typically added into each log execution of the logger in a redundant manner. A global or instance level logger should not be used for the same reason.

For example: The database

Potential Solutions

  1. A logger with contextual fields already loaded should be passed in to the instance methods, for structs/instances which are shared between multiple message processing routines.

4. Methods return errors which are debug logged

Errors are encountered and are “handled” by debug logging the error.

Examples:

   ev, err := contractAbi.EventByID(vLog.Topics[0])

   if err != nil { // unknown event -> ignored

       ec.logger.Debug("could not read event by ID",

           zap.String("eventId", vLog.Topics[0].Hex()),

           zap.Uint64("block", vLog.BlockNumber),

           zap.String("txHash", vLog.TxHash.Hex()),

           zap.Error(err),

       )

       return "", nil

   }

           if isCloseError(err) {

               logger.Debug("failed to read message as the connection was closed", zap.Error(err))

               return

           }

if err != nil {

           dvs.logger.Debug("could not read subnets", zap.String("enr", e.Node.String()))

           return

       }

Potential Solutions

  1. Wrap these errors in a specific error type to handle them in specific ways, in a central place, higher up the stack.
  2. Log these errors at a trace level, which we don’t have today.

5. Heavy/Large logs

Logs are too large in size/length. This causes performance issues in the system, performance issues with the logging service (kibana), and makes reading logs difficult. 

Potential Solutions

  1. Zap middleware to truncate long logs. We maybe don’t want to truncate these.
  2. Zap middleware to change log level

6. Some log field values are expensive functions

Example: log.Debug(“msg”, zap.String(“field”, parsing.ExpensiveComputation(functionInput))

In this case, if debug logging is not enabled,  parsing.ExpensiveComputation(functionInput) is still executed anyway.

Potential Solutions

  1. Create a custom zap field type for lazy evaluation of log values, wrapped in an anonymous function:

log.Debug(“msg”, logginglib.LazyLoad(“field”, func() { parsing.ExpensiveComputation(functionInput) } )

7. Log search is difficult when trying to find one single “trace”

It would be nice to search for one ID that is logged for the entire flow of a certain message process

Fox example, in the blox project, a globally unique key is constructed and added to logs like: 

<taskType>-e<epoch>-s<slot>-v<validatorIndex>

Potential Solutions

  1. Generate a unique deterministic key which can be added to a loggers field like in the blox project.
  2. Mark critical event logs with an emoji which is easily identifiable in logs

Overall solution

Commit to these changes

  1. Pass request specific logger through program via context.Context

  2. Identify which fields/types of data are most useful to query

  3. Create documentation about how to write good logs. Have some process/mechanism to verify this in pull requests

  4. Get rid of global / struct level loggers whos individual instance is used by multiple go-routines/message processings

  5. Add contextual information to loggers which are instantiated when a new request/message enters the system (like a ctx.Context)

  6. Add zap middleware to change the log level of too long logs

  7. Define constants for log field names, or write functions which wrap values and return a full zap field.

  8. Change logger level config to a key:value structure, regex-path:level. Logger names/subNames should be defined as constants.

  9. Standardize logger naming so regex can be used to map components to different log levels

  10. Devs use debugger locally & in staging. Debugging server can be easily started locally. Debugging server can be started in staging and can be connected to with some amount of work/configuration.

  11. Custom zap field type for lazy evaluation of log values. Need one field per type, or create a type switch in the implementation.

  12. Identify where errors are being logged and not handled. Determine if that is good practice for the few cases where it happens.

  13. Create documentation &/or tooling to help devs use debugger locally and in staging easier.

  14. Rethink which logs we need when we are debugging. Look though current logs and validate if current is good (delete bad/add good)

  15. Identify a way to create a composite key to set in logs which is unique per message/process, like in the blox project.

Won’t do

  1. Refactor project to use a different logger that gets us another log level

Implementation Phases

Code changes which have no dependencies

  • Pass request specific logger through program via context.Context
  • Get rid of global / struct level loggers whos individual instance is used by multiple go-routines/message processings
  • Add zap middleware to change the log level of too long logs
  • Custom zap field type for lazy evaluation of log values. Need one field per type, or create a type switch in the implementation.

Path based log level configuration

  • Standardize logger naming so regex can be used to map components to different log levels. Right now it seems filepath based, is that really useful? Why not use package name? Current path used / it should use .
  • Change logger level config to a key:value structure, regex-path:level. Logger names/subNames should be defined as constants.

Standardize 

  • Identify which fields/types of data are most useful to query
  • Identify a way to create a composite key to set in logs which is unique per message/process, like in the blox project.
  • Mark critical event logs with an emoji which is easily identifiable in logs.
  • Define functions for log fields which wrap values and return a zap field type
  • Clean up existing logging “message” field.
  • Create documentation &/or tooling to help devs use debugger locally and in staging easier

Clean up unhandled errors

  • Identify where errors are being handled as being logged as debug logs. Determine the cases where the error occurs and what the intent of logging the error is. 
  • Create a new error type, or types, for these kinds of errors

Documentation (do last)

  • Create documentation about how to write good logs. Have some process/mechanism to verify this in pull requests. Maybe just use github PR template.
  • Create documentation &/or tooling to help devs use debugger locally and in staging easier. Give examples like conditional breakpoint & how to connect to remove debugging server. Docs for VSCode and goland

Review and improve existing logs (do last)

  • Rethink which logs we need when we are debugging. Look though current logs and validate if current is good (delete bad/add good)
@aaron-blox aaron-blox added the enhancement New feature or request label Feb 8, 2023
@aaron-blox aaron-blox self-assigned this Feb 8, 2023
@y0sher
Copy link
Contributor

y0sher commented Mar 15, 2023

Hi @aaron-blox , I have some experience with the zap logger.
I want to add some additional points to the "Too many logs" issue that might come in handy,

  • Use zap's With(k,v) for adding "fixed" fields and Named(string) for names, for example the node ID should be attached to probably almost every log line for a specific operator.
    This is done by initializing a logger in main and using With to add the ID when we have it ready (load from file/cfg/generate), then passing the logger down to any other component, of course this approach can be used within every component, for example adding the P2P ID on the P2P constructor to the logger using With and adding p2p using Named.
    resulting in a logger that logs something like this:
    TIME LEVEL node.p2p messagehere { "operator_id": "", "p2p_id": "" } for every log from this component.
  • Using config or GRPC to toggle log levels for a specific component. This wasn't too straight forward with zap but I've done it before, this way if a developer debugs a specific component and wants verbose logs he can just turn on the DEBUG level for this specific component instead of flooding the test/prod env with debug logs from all the components.

Also I agree that relaying only on logs is an issue, tracing and metrics might give much more reliable feedback, but tracing is hard adding to an existing big project.

let me know what you think, I can help/give examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants