Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor datasqueeze code structure and namespace as well as add publishing to maven central #4

Merged
merged 6 commits into from
Nov 27, 2018

Conversation

michaelzhu2003
Copy link
Contributor

No description provided.

@yashrajrs
Copy link
Contributor

Please update the readme to include all the changes. From what I see there are more changes than just publishing to maven central

@yashrajrs
Copy link
Contributor

Can we add the DataSqueeze logo commit from our internal code base for readMe as a part of this pull request?

pom.xml Outdated
<repository>
<id>central</id>
<name>public-maven-virtual</name>
<url>https://artylab.expedia.biz:443/public-maven-virtual</url>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal to Expedia.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks.

@yashrajrs
Copy link
Contributor

The branch fails to build locally

@michaelzhu2003
Copy link
Contributor Author

It builds on my local from both IntelliJ and maven command line.
What is the failure on your local?

README.md Outdated

## Documentation
This README is intended to provide detailed technical documentation for advanced users.

## Changes since last release

* Edided the pom file for publishing this project's artifacts to the Maven Central
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type - Edited

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much.

README.md Outdated

1. CLI -
a. For TEXT/ORC/SEQ
```java
hadoop jar dataSqueeze-manager-1.0-SNAPSHOT.jar com.expedia.edw.data.squeeze.Utility
hadoop jar Datasqueeze.jar com.expedia.dsp.data.squeeze.Utility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - "datasqueeze"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Thanks so much.

README.md Outdated
-sp s3a://edwprod/user/ysontakke/compactiontest1/ -tp s3a://edwprod/user/ysontakke/compactionoutput_text_yash_1/
-threshold 12345
```

b. For AVRO
```java
hadoop jar dataSqueeze-manager-1.0-SNAPSHOT.jar com.expedia.edw.data.squeeze.Utility
hadoop jar Datasqueeze.jar com.expedia.dsp.data.squeeze.Utility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - "datasqueeze"

README.md Outdated
@@ -87,22 +95,22 @@ There are two different ways of running DataSqueeze:
* fileType - Type of file to be compacted (AVRO / TEXT / SEQ / ORC). It is mandatory for AVRO
* schemaPath - schema used for compaction (mandatory for AVRO)

2. API - [CompactionManager](dataSqueeze-manager/src/main/java/com/expedia/edw/data/squeeze/CompactionManager.java)
2. API - [CompactionManager](Datasqueeze/src/main/java/com/expedia/dsp/data/squeeze/CompactionManager.java)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - "datasqueeze"

pom.xml Outdated
<apache.commons.io.version>1.3.2</apache.commons.io.version>
<aws.java.sdk.version>1.11.98</aws.java.sdk.version>

<commons-cli.version>1.2</commons-cli.version>
<commons.io.version>2.4</commons.io.version>
<commons.lang3.version>3.0</commons.lang3.version>

<hadoop.version>2.7.4</hadoop.version>
<!-- Use 2.7.4 for github-->
<hadoop.version>2.7.1.2.4.3.0-227</hadoop.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change

pom.xml Outdated

<version>2.0-SNAPSHOT</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to move right below artifactId

baumandm
baumandm previously approved these changes Nov 27, 2018
README.md Outdated


a. Fetch the file paths to be compacted from the source path provided.
b. Perform mapreduce job using the following configuration
1. Mapper maps records together based on same parent directory and emits parent directory as key.
2. Reducer reduces records based on same key but writes data to the target directory provided by the user, retaining
the directory structure.
2. Reducer reduces records based on same key but writes data to the target directory provided by the user, retaining the directory structure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format looks off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the format looks better now than before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not adding this in the commit range if the text is not updated.
Essentially, keep the master copy for this as is

@yashrajrs
Copy link
Contributor

Update Pull request description. Looks good otherwise

@yashrajrs
Copy link
Contributor

can you also update readMe to say where the jars can be found

@baumandm
Copy link
Contributor

can you also update readMe to say where the jars can be found

This will be easier to update after we actually publish the 2.0 release. So I would suggest we wait on this.

@michaelzhu2003 michaelzhu2003 changed the title Refactor datasqueeze for publishing to maven central. Refactor datasqueeze code structure and namespace as well as add publishing to maven central Nov 27, 2018
Copy link
Contributor

@yashrajrs yashrajrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yashrajrs yashrajrs merged commit 42d886a into master Nov 27, 2018
@yashrajrs yashrajrs deleted the feature/RefactForPublishToMavenCentral branch November 27, 2018 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants