Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collector v.3 - log4j2 - log-file per crawler #790

Open
jetnet opened this issue Jun 17, 2022 · 5 comments
Open

collector v.3 - log4j2 - log-file per crawler #790

jetnet opened this issue Jun 17, 2022 · 5 comments

Comments

@jetnet
Copy link

jetnet commented Jun 17, 2022

it is possible to configure a log-file per crawler as it worked in v.2? I tried the following config, but sd:type does not get resolved. Thanks

<Configuration status="INFO" name="Norconex HTTP Collector">
  <Appenders>
    <Console name="Console" target="SYSTEM_OUT">
      <PatternLayout>
        <pattern>%d{HH:mm:ss.SSS} [%t] %-5level %c{1} - %msg%n</pattern>
      </PatternLayout>
    </Console>
    <RollingFile name="RollingFile" fileName="${env:NC_LOGDIR}/latest/logs/${sd:type}.log"
                 filePattern="${env:NC_LOGDIR}/backup/logs/$${date:yyyy}/$${date:MM}/$${date:dd}/${sd:type}.log.%i.gz">
      <PatternLayout>
        <Pattern>%d %p %c{1.} [%t] %m%n</Pattern>
      </PatternLayout>
      <Policies>
        <OnStartupTriggeringPolicy />
      </Policies>
    </RollingFile>
  </Appenders>
@jetnet jetnet changed the title collection v.3 - log4j2 - log-file per crawler collector v.3 - log4j2 - log-file per crawler Jun 17, 2022
@essiembre
Copy link
Contributor

As you found out, in v3, the code base no longer controls log writing so people can implement logging however they want with their favourite logger implementation. For Log4J2, your approach looks conceptually good, but I believe variables are replaced before the RollingFile is created. To get around that, I think you have to use Routing, likely combined with filters. By default, the crawler id is printed out with each log line (except for logging that is not specific to a crawler). This means you can use a mix of filters and regular expressions to get around this.

Still, it is possible you will have a hard time getting it to work since while the crawler id is available in Log4j2 pattern layout resolution, I am not so sure about routing variable substitutions. For the latter, you may need to rely on Log4j2 MDC (Mapped Diagnostic Context). Unfortunately, the crawler ids are currently not set in the logger thread context. This has to be done explicitly in coding.

SLF4J (the logging abstraction framework used) supports MDC and will pass it to supporting logging implementations. For that reason, I am marking this as a feature request since I think we could take advantage of that. I'd like to make use of MDC in the code base to simplify routing to different files without the need for regular expressions or filters.

Once implemented, I'll share a Log4j2 configuration sample here.

@essiembre
Copy link
Contributor

Just to add, since the crawler name appears in the thread name, you can use the following variable in your routing:

${event:ThreadName}

See https://logging.apache.org/log4j/log4j-2.15.1/manual/lookups.html#EventLookup

@jetnet
Copy link
Author

jetnet commented Jun 29, 2022

Thanks Pascal! I had no luck with Routing and ended up with a workaround using an environment variable, which is set to the site name in the collector-http.sh. I'll try event:ThreadName and let you know, if that works.

@essiembre
Copy link
Contributor

Wouldn't that give you only 1 log file per collector, as opposed to one per crawler? I thought you were trying to get one log per crawler in cases where you have multiple crawlers defined in a single collector config. If you just want 1 log per collector it is already like that, you should just need to change the Console appender to a file-based one (which you can parameterize as you did). Alternatively, you can keep the default logging (to STDOUT), and redirect the command-line output to a file when you launch the script.

@essiembre
Copy link
Contributor

I just made a snapshot release that adds a few attributes to the logging context. They are:

  • crawler.id → the crawler id, as configured.
  • crawler.id.safe → the crawler id encoded to be safe to use as a file name on any file system.
  • collector.id → the collector id, as configured.
  • collector.id.safe → the collector id encoded to be safe to use as a file name on any file system.

Using Log4j2, the following produces one file per crawler configured and any non-crawler-specific log entries goes in a collector log.

<Configuration status="INFO" name="my-collector-logs">
  <Properties>
    <Property name="pattern">%d{HH:mm:ss.SSS} [%t] %-5level %c{1} - %msg%n</Property>
  </Properties>
  <Appenders>
    <Routing name="Routing">
      <Routes pattern="$${ctx:crawler.id.safe}">
        <Route>
          <RollingFile
              name="Cralwer-${ctx:crawler.id.safe:-${ctx:collector.id.safe}}"
              fileName="/path/to/my/logs/${ctx:crawler.id.safe:-${ctx:collector.id.safe}}.log"
              filePattern="/path/to/my/logs/${ctx:crawler.id.safe:-${ctx:collector.id.safe}}-%d{yyyyMMdd-HHmm}.log.gz">
            <PatternLayout>
              <pattern>${pattern}</pattern>
            </PatternLayout>
            <SizeBasedTriggeringPolicy size="10 MB" />   
          </RollingFile>
        </Route>
      </Routes>
      <IdlePurgePolicy timeToLive="15" timeUnit="minutes"/>
    </Routing>    
  </Appenders>
  
  <Loggers>
    <Logger name="com.norconex.collector.http" level="INFO" additivity="false">
      <AppenderRef ref="Routing"/>
    </Logger>
    <Logger name="com.norconex.collector.core" level="INFO" additivity="false">
      <AppenderRef ref="Routing"/>
    </Logger>
    <Logger name="com.norconex.importer" level="INFO" additivity="false">
      <AppenderRef ref="Routing"/>
    </Logger>
    <Logger name="com.norconex.committer" level="INFO" additivity="false">
      <AppenderRef ref="Routing"/>
    </Logger>
    <Logger name="com.norconex.commons.lang" level="INFO" additivity="false">
      <AppenderRef ref="Routing"/>
    </Logger>
    <!-- ... -->

    <Root level="INFO">
      <AppenderRef ref="Routing"/>
    </Root>
  </Loggers>
</Configuration>  

If we pretend having a collector id my-collector which defines 3 crawlers with ids: my-crawler-A, my-crawler-B, and my-crawler-C, then you would get in the /path/to/my/logs/ folder:

  • my-collector.log
  • my-crawler-A.log
  • my-crawler-B.log
  • my-crawler-C.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants