Skip to content

Commit

Permalink
Initial Commit
Browse files Browse the repository at this point in the history
  • Loading branch information
aosama committed Dec 11, 2017
1 parent 868db4b commit 2980203
Show file tree
Hide file tree
Showing 49 changed files with 3,286 additions and 191 deletions.
31 changes: 31 additions & 0 deletions DCO
@@ -0,0 +1,31 @@
Contributions require sign-off. The sign-off is required for all patch or pull requests, which certifies the following agreement given below.

Contributor Agreement
---------------------

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or

(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.

(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.

(e) I also agree to the following terms and conditions:

(1) Grant of Copyright License. Subject to the terms and conditions of this agreement, You hereby grant to the maintainer and to recipients of software distributed by the maintainer a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute your contributions and such derivative works.

(2) Grant of Patent License. Subject to the terms and conditions of this agreement, You hereby grant to the maintainer and to recipients of software distributed by the maintainer a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the work, where such license applies only to those patent claims licensable by you that are necessarily infringed by your contribution(s) alone or by combination of your contribution(s) with the work to which such contribution(s) was submitted. If any entity institutes patent litigation against you or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that your contribution, or the work to which you have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this agreement for that contribution or work shall terminate as of the date such litigation is filed.

Committing
----------

Add a line stating

Signed-off-by: Random J Developer <random@developer.example.org>

When committing using the command line you can sign off using the --signoff or -s flag. This adds a Signed-off-by line by the committer at the end of the commit log message.

git commit -s -m "Commit message"
368 changes: 179 additions & 189 deletions LICENSE

Large diffs are not rendered by default.

71 changes: 69 additions & 2 deletions README.md
@@ -1,2 +1,69 @@
# MegaSparkDiff
at scale data comparison tool that can compare pair combination of data sources
<h1>MegaSparkDiff</h1>

MegaSparkDiff is an open source tool that helps you compare any pair
combinations of the following

MegaSparkDiff is an open source tool that helps you compare any pair
combination of data sets that are of the following types:

(HDFS, JDBC, S3, Hbase, Text Files, Hive).

MegaSparkDiff can run on Amazon EMR (Elastic Map Reduce),
Amazon EC2 instances and cloud environments
with compatible Spark distributions.

How to Use form Within a Java or SCALA Project
----------------------------------------------
```sh
<dependency>
<groupId>org.finra.megasparkdiff</groupId>
<artifactId>mega-spark-diff</artifactId>
<version>0.1</version>
</dependency>
```

SparkFactory
-----------
It parallelizes source/target data.

The data sources can be in following forms:
Text File
HDFS File
SQL query over a JDBC data source
Hive Table

SparkCompare
------------
Compares pair combinations of supported sources,
Please note in case of comparing a schema-based source to a non-schema based source, the SparkCompare
class will attempt to flatten the schema based source to delimited values and then do the comparison. The delimiter
can be specified while launching the compare job.

How to use via shell script in EMR
----------------------------------
There will exist a shell script named a3a.sh that will wrap around
this Java/Scala project. This script will accept several parameters
related to source definitions, output destination, and run
configurations, as well as which two data sets to compare.

The parameters are as follows:
-ds=<data_source_folder>: The folder where the database
connection parameters and data queries reside
-od=<output_directory>: The directory where MegaSparkDiff will write
its output
-rc=<run_config_file_name>: The file that will be used to load
any special run and Spark configurations. This parameter is
optional
To specify a data set to compare, pass in the name of one of the
data queries found in a config file inside <data_source_folder>
prepended by "--". The program will execute the queries assigned to
the names passed into the command line, store them into tables, and
perform the comparison.

Example call:
./msd.sh -ds=./data_sources/ -od=output --shraddha --carlos
Additionally, the user will have the option to add JDBC Driver jar
files by including them in the classpath. This is to enable them to
extract from whichever database they choose.
15 changes: 15 additions & 0 deletions config/java.header
@@ -0,0 +1,15 @@
(/\*|#|<!--)(| -->)$
( \*|#|<!--) Copyright 2014 DataGenerator Contributors(| -->)$
( \*|#|<!--)(| -->)$
( \*|#|<!--) Licensed under the Apache License, Version 2\.0 \(the "License"\);(| -->)$
( \*|#|<!--) you may not use this file except in compliance with the License\.(| -->)$
( \*|#|<!--) You may obtain a copy of the License at(| -->)$
( \*|#|<!--)(| -->)$
( \*|#|<!--) http://www\.apache\.org/licenses/LICENSE-2\.0(| -->)$
( \*|#|<!--)(| -->)$
( \*|#|<!--) Unless required by applicable law or agreed to in writing, software(| -->)$
( \*|#|<!--) distributed under the License is distributed on an "AS IS" BASIS,(| -->)$
( \*|#|<!--) WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied\.(| -->)$
( \*|#|<!--) See the License for the specific language governing permissions and(| -->)$
( \*|#|<!--) limitations under the License\.(| -->)$
( \*/|#|<!--)(| -->)$
4 changes: 4 additions & 0 deletions config/scalastyle-output.xml
@@ -0,0 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?>
<checkstyle version="5.0">

</checkstyle>
4 changes: 4 additions & 0 deletions config/scalastyle-output_dg-common.xml
@@ -0,0 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?>
<checkstyle version="5.0">

</checkstyle>
158 changes: 158 additions & 0 deletions config/scalastyle_config.xml
@@ -0,0 +1,158 @@
<scalastyle commentFilter="enabled">
<name>Scalastyle standard configuration</name>
<check level="warning" class="org.scalastyle.file.FileTabChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.file.FileLengthChecker" enabled="true">
<parameters>
<parameter name="maxFileLength"><![CDATA[800]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.file.HeaderMatchesChecker" enabled="false">
<parameters>
<parameter name="header"><![CDATA[<!--
~ Copyright 2014 DataGenerator Contributors
~
~ Licensed under the Apache License, Version 2.0 (the "License");
~ you may not use this file except in compliance with the License.
~ You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->]]></parameter>
</parameters>
</check>
<!-- Catches plus sign in type parameter variance annotations, not just the plus operator. -->
<check level="warning" class="org.scalastyle.scalariform.SpacesAfterPlusChecker" enabled="false"></check>
<!-- Not sure what the point of this check is. -->
<check level="warning" class="org.scalastyle.file.WhitespaceEndOfLineChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.scalariform.SpacesBeforePlusChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.file.FileLineLengthChecker" enabled="true">
<parameters>
<parameter name="maxLineLength"><![CDATA[160]]></parameter>
<parameter name="tabSize"><![CDATA[4]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.ClassNamesChecker" enabled="true">
<parameters>
<parameter name="regex"><![CDATA[[A-Z][A-Za-z]*]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.ObjectNamesChecker" enabled="true">
<parameters>
<parameter name="regex"><![CDATA[[A-Z][A-Za-z]*]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.PackageObjectNamesChecker" enabled="true">
<parameters>
<parameter name="regex"><![CDATA[^[a-z][A-Za-z]*$]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.EqualsHashCodeChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.IllegalImportsChecker" enabled="true">
<parameters>
<parameter name="illegalImports"><![CDATA[sun._,java.awt._]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.ParameterNumberChecker" enabled="true">
<parameters>
<parameter name="maxParameters"><![CDATA[8]]></parameter>
</parameters>
</check>
<!--
The magic-number checker has limited usefulness because it flags default values in method parameters.
-->
<check level="warning" class="org.scalastyle.scalariform.MagicNumberChecker" enabled="false">
<parameters>
<parameter name="ignore"><![CDATA[-1,0,1,2,3]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.NoWhitespaceBeforeLeftBracketChecker"
enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.NoWhitespaceAfterLeftBracketChecker"
enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.ReturnChecker" enabled="true"></check>
<!--
Best to avoid null in Scala code, but sometimes necessary, e.g., if writing a library to be used by Java code,
even something like assert(param != null) would violate the NullChecker.
-->
<check level="warning" class="org.scalastyle.scalariform.NullChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.NoCloneChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.NoFinalizeChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.CovariantEqualsChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.StructuralTypeChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.file.RegexChecker" enabled="true">
<parameters>
<parameter name="regex"><![CDATA[println]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.NumberOfTypesChecker" enabled="true">
<parameters>
<parameter name="maxTypes"><![CDATA[30]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.CyclomaticComplexityChecker" enabled="true">
<parameters>
<parameter name="maximum"><![CDATA[75]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.UppercaseLChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.SimplifyBooleanExpressionChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.IfBraceChecker" enabled="true">
<parameters>
<parameter name="singleLineAllowed"><![CDATA[true]]></parameter>
<parameter name="doubleLineAllowed"><![CDATA[false]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.MethodLengthChecker" enabled="true">
<parameters>
<parameter name="maxLength"><![CDATA[125]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.MethodNamesChecker" enabled="true">
<parameters>
<parameter name="regex"><![CDATA[^[a-z][A-Za-z0-9_]*(_=)?$]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.NumberOfMethodsInTypeChecker" enabled="true">
<parameters>
<parameter name="maxMethods"><![CDATA[30]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.PublicMethodsHaveTypeChecker" enabled="true"></check>
<!-- Not sure what the point of this check is. -->
<check level="warning" class="org.scalastyle.file.NewLineAtEofChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.file.NoNewLineAtEofChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.scalariform.WhileChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.scalariform.VarFieldChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.scalariform.VarLocalChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.scalariform.RedundantIfChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.TokenChecker" enabled="false">
<parameters>
<parameter name="regex"><![CDATA[println]]></parameter>
</parameters>
</check>
<check level="warning" class="org.scalastyle.scalariform.DeprecatedJavaChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.EmptyClassChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.ClassTypeParameterChecker" enabled="false">
<parameters>
<parameter name="regex"><![CDATA[^+?[A-Za-z0-9_]( [<>]: [A-Za-z0-9]+)?$]]></parameter>
</parameters>
</check>
<!--
IntelliJ auto-generates wildcard imports... Anyway, they're easier to read and maintain.
-->
<check level="warning" class="org.scalastyle.scalariform.UnderscoreImportChecker" enabled="false"></check>
<check level="warning" class="org.scalastyle.scalariform.LowercasePatternMatchChecker" enabled="true"></check>
<check level="warning" class="org.scalastyle.scalariform.MultipleStringLiteralsChecker" enabled="false">
<parameters>
<parameter name="allowed"><![CDATA[2]]></parameter>
<parameter name="ignoreRegex"><![CDATA[^"\s*"$]]></parameter>
</parameters>
</check>
<!-- Disabled because we might not want an import to take effect in the first part of some code. -->
<check level="warning" class="org.scalastyle.scalariform.ImportGroupingChecker" enabled="false"></check>
</scalastyle>

0 comments on commit 2980203

Please sign in to comment.