Skip to content

ExpediaGroup/housekeeping

Repository files navigation

Housekeeping

Start using

You can obtain Housekeeping from Maven Central :

Maven Central Build Status Coverage Status GitHub license

Overview

A database-backed module that stores orphaned paths in a table for later clean up.

Housekeeping Configuration

The housekeeping module defaults to using the H2 Database Engine, however this module can be configured to use any flavour of SQL database that is supported by JDBC, Spring Boot and Hibernate. Using a database which is not in-memory should be preferred when spinning up short-lived instances for jobs before tearing them down. This ensures that the orphaned data will be stored in a persistent database and will be considered for housekeeping even if the cluster ceases to exist.

Database Connectors

In order to connect to your SQL database, you must place a database connector jar that is compatible with your database onto your application's classpath.

Spring YAML Housekeeping Configuration

If your project utilises Spring YAML you can define your Housekeeping within the YAML. For example, Housekeeping can be set up to use a MariaDB schema:

housekeeping:
  # Name of the schema/database to use - defaults to housekeeping
  schema-name: my_db
  # Connection details
  data-source:
    # The name of your JDBC Driver class
    driver-class-name: org.mariadb.jdbc.Driver
    # JDBC URL for your database
    url: jdbc:mariadb://foo1baz123.us-east-1.rds.amazonaws.com:3306/${housekeeping.schema-name}
    # Database username
    username: bdp
    # Database password
    password: Ch4ll3ng3

Notes:

  • Some users have seen issues using the MySQL driver on AWS Aurora for MySQL. AWS recommends using the MariaDB driver for those use cases.
  • To use MariaDB, MySQL and similar database systems, the schema specified in the configuration needs to exist (i.e. the value for housekeeping.data-source.url needs to be a valid URI).

Houseekeping can also be set up to use the default database engine (H2) and schema:

housekeeping:
  schema-name: my_db
  db-init-script: classpath:/schema.sql
  data-source:
    username: bdp
    password: Ch4ll3ng3

If the schema does not already exist and the db-init-script is not in the default location (classpath:/schema.sql), then a custom path can be provided to initialise it, as shown in the following example:

housekeeping:
  schema-name: my_db
  db-init-script: file:///tmp/schema.sql
...

Where /tmp/schema.sql contains: CREATE SCHEMA IF NOT EXISTS my_db;

Full list of configuration options:

Property Required Description
housekeeping.expired-path-duration No Time To Live (TTL) of legacy replica paths in ISO 8601 format: only days, hours, minutes and seconds can be specified in the expression.
housekeeping.schema-name Yes Database schema name to use. Tables will be created or used (if already existing) using this schema. Default: 'housekeeping'
housekeeping.db-init-script No Database init script to use.
housekeeping.data-source.driver-class-name No Java classname of the database JDBC driver.
housekeeping.data-source.url No JDBC connection URL.
housekeeping.h2.database No If the housekeeping.data-source.url is not overridden then the default H2 database can be configured using this property which also controls where H2 will write its database files. Defaults to ${instance.home}/data/${instance.name}/${housekeeping.schema-name} (where instance.home, instance.name and housekeeping.schema-name can be configured separately for more fine-grained control).
housekeeping.data-source.username No Database user with access to schema.
housekeeping.data-source.password No Database user's password.
housekeeping.fetch-legacy-replica-path-page-size No Number of paths to fetch on each call to the database. Tune this if you run out of memory or if the query seems too slow. The higher the number, the more memory is required. Default: '500'
housekeeping.cleanup-threads No Number of threads used to cleanup the files. Default: '10'

You can override Spring Boot (HikariCP/Hibernate) settings in the YAML by providing the relevant properties. Housekeeping defaults are added with lower precedence. For example, to override the default connection pool maximum active size to 5 add this:

spring.datasource.max-active: 5

Refer to the Spring Boot documentation for a full list of properties that can be set.

Programmatic Housekeeping Configuration

Housekeeping allows you to configure your housekeeping job in a more fine-grained manner by providing a certain set of Spring beans in your application.

You can configure your housekeeping data source in code by defining the bean DataSource housekeepingDataSource(...). For example:

@Bean(destroyMethod = "close")
DataSource housekeepingDataSource(
    String driverClassName,
    String jdbcUrl,
    String username,
    String encryptedPassword) {
  return DataSourceBuilder
      .create()
      .driverClassName(driverClassName)
      .url(jdbcUrl)
      .username(username)
      .password(encryptedPassword)
      .build();
}

Housekeeping comes with a default HousekeepingService implementation, however you can choose to provide your own. To run housekeeping you must provide a HousekeepingService bean which either constructs the default FileSystemHousekeepingService or a custom implementation of the HousekeepingService interface:

@Bean
HousekeepingService housekeepingService(
    LegacyReplicaPathRepository legacyReplicaPathRepository, Housekeeping housekeeping) {
  return new FileSystemHousekeepingService(legacyReplicaPathRepository, new org.apache.hadoop.conf.Configuration(), housekeeping.getFetchLegacyReplicaPathPageSize());
}

The default provided housekeeping implementation creates a database named housekeeping and a table named legacy_replica_path to store the housekeeping data. To enable this database you must provide a schema.sql file which contains any SQL code that must be run to initialise your database upon application startup. This is particularly important if running Housekeeping in your application for the first time.

An example schema.sql file for use with the default housekeeping entity configuration is given below:

CREATE SCHEMA IF NOT EXISTS housekeeping;

Applications which leverage housekeeping support can define their own schema and table within which housekeeping data is to be stored. This can be achieved by following the steps below.

You must create your database initialisation schema.sql script and either add it to your classpath, provide it as a resource in your application or configure the path to it via the YAML configuration property housekeeping.db-init-script. The simplest schema.sql initialisation script will create your schema if it does not exist.

CREATE SCHEMA IF NOT EXISTS my_custom_schema;

The database name must be configured in the YAML property housekeeping.schema-name.

Whether you are using a custom housekeeping configuration, or the defaults, your application must provide two crucial annotations which will load the Entities and CrudRepositories that you require. These are the @EntityScan and @EnableJpaRepositories annotations. These annotations are best demonstrated in an example:

//The class annotated with `@Entity` that defines the required LegacyReplicaPath implementation
@EntityScan(basePackageClasses = HousekeepingLegacyReplicaPath.class)
//The class which extends LegacyReplicaPathRepository and contains your desired `CrudRepository` implementation
@EnableJpaRepositories(basePackageClasses = HousekeepingLegacyReplicaPathRepository.class)

Customising the Housekeeping table name

By default Housekeeping will create a legacy_replica_path table in the specified schema. If you need to customize the table name you can do this by extending the base classes and configuring the JPA annotations as desired. The class which extends EntityLegacyReplicaPath must be annotated with the @Entity annotation and the @Table annotation. An example is given below which will provide the basis for creating a schema named my_custom_schema in your database, and a table named my_custom_replica_path within the my_custom_schema schema.

@Entity
@Table(schema = "my_custom_schema", name = "my_custom_replica_path",
       uniqueConstraints = @UniqueConstraint(columnNames = { "path", "creation_timestamp" }))
public class MyJobsLegacyReplicaPath extends EntityLegacyReplicaPath {
  //required inherited constructors etc. go here
}

To accompany the custom EntityLegacyReplicaPath implementation you need to extend the LegacyReplicaPathRepository interface providing the custom EntityLegacyReplicaPath implementation as a generic type argument. This simplifies the creation of a CrudRepository for your EntityLegacyReplicaPath. For example:

public interface MyJobLegacyReplicaPathRepository
    extends LegacyReplicaPathRepository<MyJobsLegacyReplicaPath> {
}

Password Encryption

Housekeeping allows you to provide encrypted passwords in your configuration or programs. The Housekeeping project depends on the jasypt library that can be used to generate encrypted passwords which in turn can be decrypted by Spring Boot's jasypt support.

An encrypted password can be generated by doing the following:

java -cp jasypt-1.9.2.jar  org.jasypt.intf.cli.JasyptPBEStringEncryptionCLI input="Ch4ll3ng3" password=db_password algorithm=PBEWithMD5AndDES

----ENVIRONMENT-----------------

Runtime: Oracle Corporation OpenJDK 64-Bit Server VM 25.121-b13


----ARGUMENTS-------------------

algorithm: PBEWithMD5AndDES
input: Ch4ll3ng3
password: db_password


----OUTPUT----------------------

EHL/foiBKY2Ucy3oYmxdkFiXzWnOu7by

The 'input' is your database password. The 'password' is a password specified by you that can be used to decrypt the data. The 'output' is your encrypted password. This encrypted password can then be used in the yaml configuration:

housekeeping:
  data-source:
    # The name of your JDBC Driver class
    driver-class-name: org.mariadb.jdbc.Driver
    # JDBC URL for your Database
    url: jdbc:mariadb://housekeeping.foo1baz123.us-east-1.rds.amazonaws.com:3306/housekeeping_db
    # Database Username
    username: bdp
    # Encrypted Database Password
    password: ENC(EHL/foiBKY2Ucy3oYmxdkFiXzWnOu7by)

Or be decrypted from special properties file(s) on your classpath:

@Configuration
@EncryptablePropertySources({@EncryptablePropertySource("classpath:encrypted.properties"), @EncryptablePropertySource("classpath:encrypted2.properties")})
public class MyApplication {
  ...
}

The encrypted.properties file would look something like this:

database.username=ENC(nrmZtkF7T0kjG/VodDvBw93Ct8EgjCA+)
database.password=ENC(EHL/foiBKY2Ucy3oYmxdkFiXzWnOu7by)

You can then access the decrypted username and password in your application by doing something akin to the following:

private @Autowired ConfigurableEnvironment env;

@Bean(destroyMethod = "close")
DataSource housekeepingDataSource(
    String driverClassName,
    String jdbcUrl) {

  String username = env.getProperty("database.username");
  String password = env.getProperty("database.password");

  return DataSourceBuilder
      .create()
      .driverClassName(driverClassName)
      .url(jdbcUrl)
      .username(username)
      .password(password)
      .build();
}

Or

@Bean(destroyMethod = "close")
DataSource housekeepingDataSource(
    String driverClassName,
    String jdbcUrl,
    @Value("${database.username}") username,
    @Value("${database.password}") password) {
  return DataSourceBuilder
      .create()
      .driverClassName(driverClassName)
      .url(jdbcUrl)
      .username(username)
      .password(password)
      .build();
}

Finally, if you are using an encrypted password, when you run your application you must provide the application with your jasypt.encryptor.password.

There are a few approaches to doing so:

Run your application with the jasypt.encryptor.password parameter:

java jar <your-jar>  --jasypt.encryptor.password=db_password

Pass jasypt.encryptor.password as a system property by creating application.properties or application.yml and adding:

jasypt.encryptor.password=${JASYPT_ENCRYPTOR_PASSWORD:}

Or in YAML

jasypt:
  encryptor:
    password: ${JASYPT_ENCRYPTOR_PASSWORD:}

Housekeeping Vacuum Tool

Housekeeping also provides a "Vacuum" tool that can be run against a Hive table to detect any orphaned data that is available for Housekeeping. For more information on running and configuring this refer to the housekeeping-vacuum-tool documentation.

Legal

This project is available under the Apache 2.0 License.

Copyright 2016-2019 Expedia Inc.