# Top 4 Open-source Tools For Object Storage
## TODO
![](images/pexels.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@timmossholder?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Tim Mossholder</a>
        on 
        <a href='https://www.pexels.com/photo/shallow-focus-photo-of-white-open-sigange-3345876/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Intro, Beginning

As noted in Forbes, more than 80% of data in organizations is unstructured. Traditionally, companies ignored this type of data because of the challenges that occur when trying to analyze it and generate meaningful insights. However, the landscape is rapidly changing because of the availability of other types of storage systems such as block, file and object-based storage systems. 

Among the three, object storage seems most promising, proven by the fact that massive companies like Amazon, Google and IBM already offer enterprise solutions to object-based data repositories. While such commercial options certainly offer many features, it is worth exploring free alternatives that can contribute to a successful object storage implementation within your companies. In this article, we will discuss the top 4 open-source object storage tools and how they compare to each other.

### 1. [JuiceFS](https://juicefs.com/?hl=en)

![image.png](attachment:9b36b153-a15b-468f-ae8f-453d2c704508.png)
<figcaption style="text-align: center;">
    <strong>
        JuiceFS landing page
    </strong>
</figcaption>

The first in the list with 3.3k GitHub stars is the [JuiceFS project](https://github.com/juicedata/juicefs). Its main purpose is turning any object storage into a full file system compatible with POSIX, HDFs and NFS. 

A defining feature of object storage is its lack of any organizational hierarchy. All data is stored in a single, central repository and data can only be accessed by GUI (Globally Unique Identifier). While this feature provides high speed and storage flexibility, it can create problems with existing file system-based applications. 

JuiceFS solves this by providing a fully POSIX-compatible tool that allows you to seamlessly work with other applications without any business intrusions. It can be built on top of almost any cloud storage providers like Amazon S3 to store data as objects. Besides, it offers a better management by saving metadata in familiar database engines such as Redis, MySQL, PostgreSQL, SQLite, etc.

JuiceFS prides itself in its outstanding performance offering latency as low as milliseconds. The tool also offers Hadoop Java SDK making it readily integrate into the Hadoop ecosystem and provides Kubernetes CSI driver for business that use Kubernetes. You can refer to the [quick start guide](https://github.com/juicedata/juicefs/blob/main/docs/en/quick_start_guide.md) to start using it immediately.

### 2. [SeaweedFS](https://github.com/chrislusf/seaweedfs)

![image.png](https://raw.githubusercontent.com/chrislusf/seaweedfs/master/note/seaweedfs.png)
<figcaption style="text-align: center;">
    <strong>
        SeaweedFS brand logo from GitHub
    </strong>
</figcaption>

SeaweedFS is direct alternative to JuiceFS in terms of features but much more popular. The credibility and the very future of an open-source project depends on its community and how active it is, and SeaweedFS checks out in both aspects. The GitHub repository has more than 12k stars and 119 active contributors as of June, 2021.

SeaweedFS makes speed and scalability a top priority. Quoting their objective:

1. To store billions of files!
2. To serve those files fast!

Unlike other object storage systems, SeaweedFS does not save all data in a single repository. Instead, there is a single, central master that controls clusters of volume servers and these volume servers manage the files and metadata. This feature allows the tool to be much faster because it relieves the concurrency issues from the central master. 

SeaweedFS introduce and handle directories with their stateless server called Filer. It is linearly scalable and supports dozens of customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, etc.

Here is the [quick start guide](https://github.com/chrislusf/seaweedfs#quick-start) of the tool.

### 3. [LakeFS](https://lakefs.io/)

![image.png](attachment:768848d0-8956-4e71-b795-9a5618618c87.png)
<figcaption style="text-align: center;">
    <strong>
        LakeFS landing page
    </strong>
</figcaption>

Among several open-source tools that simply enhance the underlying storage system, LakeFS project allows you to version control your object storage repository, mainly data lakes. 

Its objective is to provide Git-like data versioning tool while also being compatible with existing cloud storages. With LakeFS, you can version control terabytes of data just like code. It allows you to build repeatable, atomic operations on your data repository making it possible to perform large scale ETL jobs, data analytics and machine learning.

LakeFS allows you to create a development environment where you can perform experiments and document them in a reproducible manner. Just like Git, you can create commits, branches making it possible to move along the timeline of your application development and try out new features in isolation. And by the way, LakeFS performs all these without a single duplication of data - everything is done using special metadata management.

LakeFS also implements strict data integration and deployment best practices. It provides format, schema and file metadata validation to prevent low quality data from entering the data lake and turning it into a data swamp. 

You can try out the LakeFS command-line tool on [Katacoda playground](https://www.katacoda.com/lakefs/scenarios/lakefs-play) and learn how to use it from the [official documentation](https://docs.lakefs.io/) and the [GitHub repository](https://github.com/treeverse/lakeFS).

### 4. [MinIO](https://min.io/)

![image.png](attachment:56ccfb92-ef05-40a9-b966-40d888a0c027.png)
<figcaption style="text-align: center;">
    <strong>
        MinIO landing page
    </strong>
</figcaption>

Another, more powerful alternative to JuiceFS and SeaweedFS is MinIO. Even though it is fairly young, MinIO has been the leader in the [hybrid cloud](https://en.wikipedia.org/wiki/Cloud_computing#Hybrid_cloud) sphere. It runs seamlessly in the private and public cloud providing the widest range of use cases from AI/ML, analytics, backup/restore, mobile and web applications.

The project boasts itself with more than 28k stars on GitHub and almost 300 active contributors making it the leading open-source object storage system tool. For a stricter security and continuous support, there are two paid plans as well.

MinIO also provides the highest quality of software design, it is Kubernetes-native, S3 compatible from inception and it has more than 7.7M running instances in AWS, Azure and GCP, which is more than the rest of the private cloud combined.

In terms of performance, it can operate with READ/WRITE speeds of 183 GB/s and 171 GB/s and can integrate seamlessly into the Hadoop ecosystem.

You can experiment around with it by reading its [docs](https://docs.min.io/).

### Summary

Today, we discussed the 4 most popular open-source tools for working with object storage systems. While JuiceFS, SeaweedFS, MinIO provide object storage solutions that are built on top of cloud providers, LakeFS offers Git-like data versioning system that can be used on top of any of the tools introduced today. 

Choosing one over the other depends on your company and business needs. If you only want completely open-source tool, Seaweed might be a good option. If you want to implement a tool that is backed by massive community AND offer enterprise solutions for your specific needs, MinIO is the perfect candidate. If you are considering to 