hadoop-small-files-merger

A Spark application to merge small files

Hadoop Small Files Merger Application
Usage: hadoop-small-files-merger.jar [options]

  -b, --blockSize <value>  Specify your clusters blockSize in bytes, Default is set at 131072000 (125MB) 
			   which is slightly less than actual 128MB block size.
			   It is intentionally kept at 125MB to fit the data of the single partition into a block of 128MB.
			   As spark does not create exact file sizes after partitioning but will always be approximately equal to the specified block size.
  -f, --format Values: avro,text,parquet
                           
  -d, --directory <value>  Starting with hdfs:///
  -c, --compression Values: `none`, `snappy`, `gzip`, and `lzo`. Default: none
                           
  -s, --schemaStr <value>  A stringified avro schema
  -s, --schemaPath <value>
                           HDFS Path to .avsc file. if format specified is avro

Specify either `schemaStr` or `schemaPath`


Below options work if directory is partitioned by date inside `directory`

  --from <value>           From Date
  --to <value>             To Date
  --partitionBy Values: day or hour
                           Directory partitioned by. Default: day
  --partitionFormat <value>
                           Give directory partition format using valid SimpleDateFormat pattern 
                           Eg. "'/year='yyyy'/month='MM'/day='dd", "'/'yyyy'/'MM'/'dd"

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
src/main/scala/hadoop/small/files/merger		src/main/scala/hadoop/small/files/merger
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
hadoop-small-files-merger.ipr		hadoop-small-files-merger.ipr
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

gradle/wrapper

gradle/wrapper

src/main/scala/hadoop/small/files/merger

src/main/scala/hadoop/small/files/merger

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.gradle

build.gradle

gradle.properties

gradle.properties

gradlew

gradlew

gradlew.bat

gradlew.bat

hadoop-small-files-merger.ipr

hadoop-small-files-merger.ipr

settings.gradle

settings.gradle

Repository files navigation

hadoop-small-files-merger

About

Releases

Packages

Languages

License

Guru107/hadoop-small-files-merger

Folders and files

Latest commit

History

Repository files navigation

hadoop-small-files-merger

About

Topics

Resources

License

Stars

Watchers

Forks

Languages