Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
49 changed files
with
3,286 additions
and
191 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
Contributions require sign-off. The sign-off is required for all patch or pull requests, which certifies the following agreement given below. | ||
|
||
Contributor Agreement | ||
--------------------- | ||
|
||
By making a contribution to this project, I certify that: | ||
|
||
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or | ||
|
||
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or | ||
|
||
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. | ||
|
||
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. | ||
|
||
(e) I also agree to the following terms and conditions: | ||
|
||
(1) Grant of Copyright License. Subject to the terms and conditions of this agreement, You hereby grant to the maintainer and to recipients of software distributed by the maintainer a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute your contributions and such derivative works. | ||
|
||
(2) Grant of Patent License. Subject to the terms and conditions of this agreement, You hereby grant to the maintainer and to recipients of software distributed by the maintainer a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the work, where such license applies only to those patent claims licensable by you that are necessarily infringed by your contribution(s) alone or by combination of your contribution(s) with the work to which such contribution(s) was submitted. If any entity institutes patent litigation against you or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that your contribution, or the work to which you have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this agreement for that contribution or work shall terminate as of the date such litigation is filed. | ||
|
||
Committing | ||
---------- | ||
|
||
Add a line stating | ||
|
||
Signed-off-by: Random J Developer <random@developer.example.org> | ||
|
||
When committing using the command line you can sign off using the --signoff or -s flag. This adds a Signed-off-by line by the committer at the end of the commit log message. | ||
|
||
git commit -s -m "Commit message" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,69 @@ | ||
# MegaSparkDiff | ||
at scale data comparison tool that can compare pair combination of data sources | ||
<h1>MegaSparkDiff</h1> | ||
|
||
MegaSparkDiff is an open source tool that helps you compare any pair | ||
combinations of the following | ||
|
||
MegaSparkDiff is an open source tool that helps you compare any pair | ||
combination of data sets that are of the following types: | ||
|
||
(HDFS, JDBC, S3, Hbase, Text Files, Hive). | ||
|
||
MegaSparkDiff can run on Amazon EMR (Elastic Map Reduce), | ||
Amazon EC2 instances and cloud environments | ||
with compatible Spark distributions. | ||
|
||
How to Use form Within a Java or SCALA Project | ||
---------------------------------------------- | ||
```sh | ||
<dependency> | ||
<groupId>org.finra.megasparkdiff</groupId> | ||
<artifactId>mega-spark-diff</artifactId> | ||
<version>0.1</version> | ||
</dependency> | ||
``` | ||
|
||
SparkFactory | ||
----------- | ||
It parallelizes source/target data. | ||
|
||
The data sources can be in following forms: | ||
Text File | ||
HDFS File | ||
SQL query over a JDBC data source | ||
Hive Table | ||
|
||
SparkCompare | ||
------------ | ||
Compares pair combinations of supported sources, | ||
Please note in case of comparing a schema-based source to a non-schema based source, the SparkCompare | ||
class will attempt to flatten the schema based source to delimited values and then do the comparison. The delimiter | ||
can be specified while launching the compare job. | ||
|
||
How to use via shell script in EMR | ||
---------------------------------- | ||
There will exist a shell script named a3a.sh that will wrap around | ||
this Java/Scala project. This script will accept several parameters | ||
related to source definitions, output destination, and run | ||
configurations, as well as which two data sets to compare. | ||
|
||
The parameters are as follows: | ||
-ds=<data_source_folder>: The folder where the database | ||
connection parameters and data queries reside | ||
-od=<output_directory>: The directory where MegaSparkDiff will write | ||
its output | ||
-rc=<run_config_file_name>: The file that will be used to load | ||
any special run and Spark configurations. This parameter is | ||
optional | ||
To specify a data set to compare, pass in the name of one of the | ||
data queries found in a config file inside <data_source_folder> | ||
prepended by "--". The program will execute the queries assigned to | ||
the names passed into the command line, store them into tables, and | ||
perform the comparison. | ||
|
||
Example call: | ||
./msd.sh -ds=./data_sources/ -od=output --shraddha --carlos | ||
Additionally, the user will have the option to add JDBC Driver jar | ||
files by including them in the classpath. This is to enable them to | ||
extract from whichever database they choose. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
(/\*|#|<!--)(| -->)$ | ||
( \*|#|<!--) Copyright 2014 DataGenerator Contributors(| -->)$ | ||
( \*|#|<!--)(| -->)$ | ||
( \*|#|<!--) Licensed under the Apache License, Version 2\.0 \(the "License"\);(| -->)$ | ||
( \*|#|<!--) you may not use this file except in compliance with the License\.(| -->)$ | ||
( \*|#|<!--) You may obtain a copy of the License at(| -->)$ | ||
( \*|#|<!--)(| -->)$ | ||
( \*|#|<!--) http://www\.apache\.org/licenses/LICENSE-2\.0(| -->)$ | ||
( \*|#|<!--)(| -->)$ | ||
( \*|#|<!--) Unless required by applicable law or agreed to in writing, software(| -->)$ | ||
( \*|#|<!--) distributed under the License is distributed on an "AS IS" BASIS,(| -->)$ | ||
( \*|#|<!--) WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied\.(| -->)$ | ||
( \*|#|<!--) See the License for the specific language governing permissions and(| -->)$ | ||
( \*|#|<!--) limitations under the License\.(| -->)$ | ||
( \*/|#|<!--)(| -->)$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<checkstyle version="5.0"> | ||
|
||
</checkstyle> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<checkstyle version="5.0"> | ||
|
||
</checkstyle> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
<scalastyle commentFilter="enabled"> | ||
<name>Scalastyle standard configuration</name> | ||
<check level="warning" class="org.scalastyle.file.FileTabChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.file.FileLengthChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maxFileLength"><![CDATA[800]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.file.HeaderMatchesChecker" enabled="false"> | ||
<parameters> | ||
<parameter name="header"><![CDATA[<!-- | ||
~ Copyright 2014 DataGenerator Contributors | ||
~ | ||
~ Licensed under the Apache License, Version 2.0 (the "License"); | ||
~ you may not use this file except in compliance with the License. | ||
~ You may obtain a copy of the License at | ||
~ | ||
~ http://www.apache.org/licenses/LICENSE-2.0 | ||
~ | ||
~ Unless required by applicable law or agreed to in writing, software | ||
~ distributed under the License is distributed on an "AS IS" BASIS, | ||
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
~ See the License for the specific language governing permissions and | ||
~ limitations under the License. | ||
-->]]></parameter> | ||
</parameters> | ||
</check> | ||
<!-- Catches plus sign in type parameter variance annotations, not just the plus operator. --> | ||
<check level="warning" class="org.scalastyle.scalariform.SpacesAfterPlusChecker" enabled="false"></check> | ||
<!-- Not sure what the point of this check is. --> | ||
<check level="warning" class="org.scalastyle.file.WhitespaceEndOfLineChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.SpacesBeforePlusChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.file.FileLineLengthChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maxLineLength"><![CDATA[160]]></parameter> | ||
<parameter name="tabSize"><![CDATA[4]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.ClassNamesChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[[A-Z][A-Za-z]*]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.ObjectNamesChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[[A-Z][A-Za-z]*]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.PackageObjectNamesChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[^[a-z][A-Za-z]*$]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.EqualsHashCodeChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.IllegalImportsChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="illegalImports"><![CDATA[sun._,java.awt._]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.ParameterNumberChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maxParameters"><![CDATA[8]]></parameter> | ||
</parameters> | ||
</check> | ||
<!-- | ||
The magic-number checker has limited usefulness because it flags default values in method parameters. | ||
--> | ||
<check level="warning" class="org.scalastyle.scalariform.MagicNumberChecker" enabled="false"> | ||
<parameters> | ||
<parameter name="ignore"><![CDATA[-1,0,1,2,3]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.NoWhitespaceBeforeLeftBracketChecker" | ||
enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.NoWhitespaceAfterLeftBracketChecker" | ||
enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.ReturnChecker" enabled="true"></check> | ||
<!-- | ||
Best to avoid null in Scala code, but sometimes necessary, e.g., if writing a library to be used by Java code, | ||
even something like assert(param != null) would violate the NullChecker. | ||
--> | ||
<check level="warning" class="org.scalastyle.scalariform.NullChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.NoCloneChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.NoFinalizeChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.CovariantEqualsChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.StructuralTypeChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.file.RegexChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[println]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.NumberOfTypesChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maxTypes"><![CDATA[30]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.CyclomaticComplexityChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maximum"><![CDATA[75]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.UppercaseLChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.SimplifyBooleanExpressionChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.IfBraceChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="singleLineAllowed"><![CDATA[true]]></parameter> | ||
<parameter name="doubleLineAllowed"><![CDATA[false]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.MethodLengthChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maxLength"><![CDATA[125]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.MethodNamesChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[^[a-z][A-Za-z0-9_]*(_=)?$]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.NumberOfMethodsInTypeChecker" enabled="true"> | ||
<parameters> | ||
<parameter name="maxMethods"><![CDATA[30]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.PublicMethodsHaveTypeChecker" enabled="true"></check> | ||
<!-- Not sure what the point of this check is. --> | ||
<check level="warning" class="org.scalastyle.file.NewLineAtEofChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.file.NoNewLineAtEofChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.WhileChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.VarFieldChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.VarLocalChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.RedundantIfChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.TokenChecker" enabled="false"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[println]]></parameter> | ||
</parameters> | ||
</check> | ||
<check level="warning" class="org.scalastyle.scalariform.DeprecatedJavaChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.EmptyClassChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.ClassTypeParameterChecker" enabled="false"> | ||
<parameters> | ||
<parameter name="regex"><![CDATA[^+?[A-Za-z0-9_]( [<>]: [A-Za-z0-9]+)?$]]></parameter> | ||
</parameters> | ||
</check> | ||
<!-- | ||
IntelliJ auto-generates wildcard imports... Anyway, they're easier to read and maintain. | ||
--> | ||
<check level="warning" class="org.scalastyle.scalariform.UnderscoreImportChecker" enabled="false"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.LowercasePatternMatchChecker" enabled="true"></check> | ||
<check level="warning" class="org.scalastyle.scalariform.MultipleStringLiteralsChecker" enabled="false"> | ||
<parameters> | ||
<parameter name="allowed"><![CDATA[2]]></parameter> | ||
<parameter name="ignoreRegex"><![CDATA[^"\s*"$]]></parameter> | ||
</parameters> | ||
</check> | ||
<!-- Disabled because we might not want an import to take effect in the first part of some code. --> | ||
<check level="warning" class="org.scalastyle.scalariform.ImportGroupingChecker" enabled="false"></check> | ||
</scalastyle> |
Oops, something went wrong.