Skip to content
This repository has been archived by the owner on Oct 12, 2021. It is now read-only.

README for Shunting Yard #35

Merged
merged 20 commits into from
Apr 16, 2019
Merged

README for Shunting Yard #35

merged 20 commits into from
Apr 16, 2019

Conversation

abhimanyugupta07
Copy link
Member

@abhimanyugupta07 abhimanyugupta07 commented Apr 12, 2019

First attempt at an exhaustive README for Shunting Yard.

@abhimanyugupta07 abhimanyugupta07 changed the title Exh README for Shunting Yard Apr 12, 2019
@coveralls
Copy link

coveralls commented Apr 12, 2019

Pull Request Test Coverage Report for Build 397

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 78.632%

Totals Coverage Status
Change from base Build 363: 0.0%
Covered Lines: 736
Relevant Lines: 936

💛 - Coveralls

README.md Outdated
@@ -1,61 +1,147 @@
# Shunting Yard

A Spring Boot app that reads serialized Hive MetaStore Events and builds a YAML file with the information provided in the event which is then passed to [Circus Train](https://github.com/HotelsDotCom/circus-train) to perform the replication.
Shunting Yard reads serialized Hive MetaStore Events from a queue (currently supports [AWS SQS](https://aws.amazon.com/sqs/)) and replicates the data between two Data lakes. It does this by building a YAML file with the information provided in the event which is then passed to [Circus Train](https://github.com/HotelsDotCom/circus-train) to perform the replication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"between two Data lakes": does "data" have to be Uppercase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

README.md Outdated
hive.metastore.event.listeners = com.hotels.shunting.yard.event.emitter.sqs.listener.SqsMetaStoreEventListener
com.hotels.shunting.yard.event.emitter.sqs.queue = https://sqs.<region>.amazonaws.com/<account-id>/<topic-name>-queue.fifo
com.hotels.shunting.yard.event.emitter.sqs.group.id = <group-id>
Note that the paths above are correct as of when this document was last updated but may differ across EMR versions, refer to the [EMR release guide](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html) for more up to date information if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a full stop instead of comma, as there seem to be two separate sentences in the same sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

README.md Outdated
database-name: replica_database
table-name: test_table_1

#### Only Change the target database but the table name remains same as source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"change" with lowercase "c"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe it would be good to rephrase this as "Change only the target database while keeping the same source table name" to keep the same tone of the writing (as an instruction)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the second half?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this and I'm ok with what's in the document now :)

README.md Outdated
replica-table:
database-name: replica_database

#### Only Change the target table name but the database remains same as source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, but the database and table switched around :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

README.md Outdated
|`source-catalog.hive-metastore-uris`|No|Fully qualified URI of the source cluster's Hive metastore Thrift service.|
|`replica-catalog.name`|Yes|A name for the replica catalog for events and logging.|
|`replica-catalog.hive-metastore-uris`|Yes|Fully qualified URI of the replica cluster's Hive metastore Thrift service.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.queue`|Yes|Fully qualified URI of the [AWS SQS](https://aws.amazon.com/sqs/) Queue to read the hive events from.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uppercase H on "hive"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

README.md Outdated
|`replica-catalog.name`|Yes|A name for the replica catalog for events and logging.|
|`replica-catalog.hive-metastore-uris`|Yes|Fully qualified URI of the replica cluster's Hive metastore Thrift service.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.queue`|Yes|Fully qualified URI of the [AWS SQS](https://aws.amazon.com/sqs/) Queue to read the hive events from.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.wait.time.seconds`|No|Wait time in seconds for which the receiver will poll the SQS queue for a batch of messages. Default is 10 seconds. Read more about long polling with AWS SQS [here](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop at the end 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

README.md Outdated
|`replica-catalog.hive-metastore-uris`|Yes|Fully qualified URI of the replica cluster's Hive metastore Thrift service.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.queue`|Yes|Fully qualified URI of the [AWS SQS](https://aws.amazon.com/sqs/) Queue to read the hive events from.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.wait.time.seconds`|No|Wait time in seconds for which the receiver will poll the SQS queue for a batch of messages. Default is 10 seconds. Read more about long polling with AWS SQS [here](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html)|
|`source-table-filter.table-names`|No|A list of tables selected for Shunting Yard replication. Supported Format:`database_1.table_1, database_2.table_2`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after colon:
"format: database_1.table_1, database_2.table_2"

and lowercase f on "Format"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still no space after colon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is there now :)

README.md Outdated
|`table-replications[n].replica-table.database-name`|No|The name of the destination database in which to replicate the table. Defaults to source database name.|
|`table-replications[n].replica-table.table-name`|No|The name of the table at the destination. Defaults to source table name.|

### Configuring Graphite Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Metrics" should be with lowercase m. "System architecture" has 2 hashes (compared to 3 here) and doesn't have uppercase for each letter in the subtitle, so this shouldn't either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graphite can stay with uppercase, because it's the name of the tool, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

Copy link
Contributor

@AnanaMJ AnanaMJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely wasn't looking for a missing full stop on purpose :)

@abhimanyugupta07
Copy link
Member Author

I definitely wasn't looking for a missing full stop on purpose :)

I think you were 🙈

Copy link
Contributor

@massdosage massdosage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good skeleton, just needs some fleshing out.

README.md Outdated
@@ -1,6 +1,6 @@
# Shunting Yard

Shunting Yard reads serialized Hive MetaStore Events from a queue (currently supports [AWS SQS](https://aws.amazon.com/sqs/)) and replicates the data between two Data lakes. It does this by building a YAML file with the information provided in the event which is then passed to [Circus Train](https://github.com/HotelsDotCom/circus-train) to perform the replication.
Shunting Yard reads serialized Hive MetaStore Events from a queue (currently supports [AWS SQS](https://aws.amazon.com/sqs/)) and replicates the data between two data lakes. It does this by building a YAML file with the information provided in the event which is then passed to [Circus Train](https://github.com/HotelsDotCom/circus-train) to perform the replication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently supports AWS SQS -> AWS SQS is currently supported

README.md Outdated

You can obtain Shunting Yard from Maven Central:
1. Download the version to use from [Maven Central](https://mvnrepository.com/artifact/com.hotels/shunting-yard-binary) and uncompress it in a directory of your choosing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear here which file one should actually download, for a new user we really need to spell it out for them. I know this is what it says in the Circus Train docs but that could also be improved (once we get the wording sorted out here let's do the same for Circus Train). I think Waggle Dance is a better example of this: https://github.com/HotelsDotCom/waggle-dance/#install

README.md Outdated

[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.hotels/shunting-yard/badge.svg?subject=com.hotels:shunting-yard)](https://maven-badges.herokuapp.com/maven-central/com.hotels/shunting-yard) [![Build Status](https://travis-ci.org/HotelsDotCom/shunting-yard.svg?branch=master)](https://travis-ci.org/HotelsDotCom/shunting-yard) [![Coverage Status](https://coveralls.io/repos/github/HotelsDotCom/shunting-yard/badge.svg?branch=master)](https://coveralls.io/github/HotelsDotCom/shunting-yard?branch=master) ![GitHub license](https://img.shields.io/github/license/HotelsDotCom/shunting-yard.svg)
2. Download and install the latest version of [Circus Train](http://mvnrepository.com/artifact/com.hotels/circus-train/) and set the `CIRCUS_TRAIN_HOME` environment variable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we sort out the "Install" section for Circus Train let's rather link to the section in its README instead of duplicating links to Maven repos etc. here.

README.md Outdated
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.hotels/shunting-yard/badge.svg?subject=com.hotels:shunting-yard)](https://maven-badges.herokuapp.com/maven-central/com.hotels/shunting-yard) [![Build Status](https://travis-ci.org/HotelsDotCom/shunting-yard.svg?branch=master)](https://travis-ci.org/HotelsDotCom/shunting-yard) [![Coverage Status](https://coveralls.io/repos/github/HotelsDotCom/shunting-yard/badge.svg?branch=master)](https://coveralls.io/github/HotelsDotCom/shunting-yard?branch=master) ![GitHub license](https://img.shields.io/github/license/HotelsDotCom/shunting-yard.svg)
2. Download and install the latest version of [Circus Train](http://mvnrepository.com/artifact/com.hotels/circus-train/) and set the `CIRCUS_TRAIN_HOME` environment variable:

export CIRCUS_TRAIN_HOME=/home/hadoop/circus-train-<circus-train-version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use /opt as a base instead of /home/hadoop in the examples?

README.md Outdated

On the target cluster, download and install the latests version of Circus Train and set the `CIRCUS_TRAIN_HOME` environment variable:
The YAML fragment below shows some common options for setting up the base source (where data is coming from), replica (where data is going to) and the SQS queue to read hive events from.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive

README.md Outdated
|Property|Required|Description|
|:----|:----:|:----|
|`source-catalog.name`|Yes|A name for the source catalog for events and logging.|
|`source-catalog.hive-metastore-uris`|No|Fully qualified URI of the source cluster's Hive metastore Thrift service.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you don't provide this? We should probably explain...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. That is mandatory. Changed now.

README.md Show resolved Hide resolved
README.md Outdated
|`replica-catalog.hive-metastore-uris`|Yes|Fully qualified URI of the replica cluster's Hive metastore Thrift service.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.queue`|Yes|Fully qualified URI of the [AWS SQS](https://aws.amazon.com/sqs/) Queue to read the Hive events from.|
|`event-receiver.configuration-properties.com.hotels.shunting.yard.event.receiver.sqs.wait.time.seconds`|No|Wait time in seconds for which the receiver will poll the SQS queue for a batch of messages. Default is 10 seconds. Read more about long polling with AWS SQS [here](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html).|
|`source-table-filter.table-names`|No|A list of tables selected for Shunting Yard replication. Supported format:`database_1.table_1, database_2.table_2`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the values which are optional we should explain what the behaviour is if they're not set to anything.

README.md Outdated
|`source-table-filter.table-names`|No|A list of tables selected for Shunting Yard replication. Supported format:`database_1.table_1, database_2.table_2`|
|`table-replications[n].source-table.database-name`|No|The name of the database in which the table you wish to replicate is located.|
|`table-replications[n].source-table.table-name`|No|The name of the table which you wish to replicate.|
|`table-replications[n].replica-table.database-name`|No|The name of the destination database in which to replicate the table. Defaults to source database name.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here and the next one we explain what happens if not set.

README.md Outdated
|`table-replications[n].replica-table.database-name`|No|The name of the destination database in which to replicate the table. Defaults to source database name.|
|`table-replications[n].replica-table.table-name`|No|The name of the table at the destination. Defaults to source table name.|

### Configuring Graphite metrics

Graphite configurations can be passed to Shunting Yard using an optional `--ct-config` argument which takes a YAML file and passes it directly to internal Circus Train instance. Refer to the [Circus Train README](https://github.com/HotelsDotCom/circus-train#graphite) for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make clear this is a different YAML configuration file to the one described above.

README.md Outdated
@@ -1,61 +1,191 @@
# Shunting Yard

A Spring Boot app that reads serialized Hive MetaStore Events and builds a YAML file with the information provided in the event which is then passed to [Circus Train](https://github.com/HotelsDotCom/circus-train) to perform the replication.
Shunting Yard reads serialized Hive MetaStore Events from a queue ([AWS SQS](https://aws.amazon.com/sqs/) is currently supported) and replicates the data between two data lakes. It does this by building a YAML file with the information provided in the event which is then passed to [Circus Train](https://github.com/HotelsDotCom/circus-train) to perform the replication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaStore → Metastore

README.md Outdated

export HIVE_LIB=/usr/lib/hive/lib/
export HCAT_LIB=/usr/lib/hive-hcatalog/share/hcatalog/
Note that the paths above are correct as of when this document was last updated but may differ across EMR versions. Refer to the [EMR release guide](http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html) for more up to date information if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to date information → up-to-date information

Up to date is without hyphens when it's used after the noun it describes: "The information is up to date."
When it's before the noun, it's with hyphens: "This is an up-to-date information."
source: https://dictionary.cambridge.org/dictionary/english/up-to-date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... closes laptop and goes home ;)

README.md Outdated
#### Sample ct-config.yml:
### Specifying target database & table names

Shunting Yard will by default replicate the data into the replica data lake with same replica database name and table name as the source. Sometimes a user might need to change the replica database name or table name or both. The YAML fragments below shows some common options for specifying the replica database and table name for the selected tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with same replica database name and table name → with the same replica database name and table name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to not have that many "the", you can have
"Shunting Yard will by default replicate data into a replica data lake with the same replica database name and table name as the source."

README.md Outdated

#### Change only the replica database but the table name remains same as source

In this case, the replica table name is not provided in the `table-replications` and hence, it will be same as source table name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and hence → and hence
It's like saying "and therefore" or "and so"

@abhimanyugupta07 abhimanyugupta07 merged commit d4a2b2f into master Apr 16, 2019
@abhimanyugupta07 abhimanyugupta07 deleted the readme-file branch April 16, 2019 10:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants