Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
122 lines (118 sloc) 13.3 KB


Big Data and Data Pipelines

  • Application Architecture, Configuration, and Deployment
    • Apache Brooklyn - A framework for modeling, monitoring, and managing applications through autonomic blueprints
    • Apache Bigtop - Project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components
    • Apache REEF - Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos
    • Apache Slider - An application to deploy existing distributed applications on an Apache Hadoop YARN cluster, monitor them and make them larger or smaller as desired -even while the application is running
  • Data Storage, Resource Management, and Architecture
    • Apache Hadoop - Open-source software for reliable, scalable, distributed computing
    • Apache Hadoop HDFS - The primary distributed storage used by Hadoop applications
    • Apache Hadoop YARN - Pluggable architecture and resource management for data processing engines to interact with data stored in HDFS
    • Kite - A high-level data layer for Hadoop
  • Data Access
    • Apache Pig - A platform for analyzing large data sets that consists of a high-level language for expressing data analysis `programs, coupled with infrastructure for evaluating these programs
    • Apache Hive - Data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
    • Apache Tez - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data
    • Apache HBase - Apache HBase is the Hadoop database, a distributed, scalable, big data store
    • Apache Kudu - Completes Hadoop's storage layer to enable fast analytics on fast data
    • Cloudera Impala - The open source, analytic MPP database for Apache Hadoop that provides the fastest time-to-insight
    • Apache Hive HCatalog - A table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid
    • Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
    • Presto - An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes
  • Data Cleaning and Integrity
    • OpenRefine - A powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data
    • DataCleaner - A strong data profiling engine for discovering and analyzing the quality of your data
  • Data Ingestion and Integration
    • Apache Kafka - A distributed streaming platform
    • Apache Spark Streaming - An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
    • Apache Sqoop - A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
    • Apache Flume - A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
    • Apache Storm - A free and open source distributed realtime computation system
    • Talend Open Studio - Open source integration software provider to data-driven enterprises
    • Pentaho Kettle
    • Blockspring
    • Apache Falcon - A feed processing and feed management system aimed at making it easier for end consumers to onboard their feed processing and feed management on hadoop clusters
    • Logstash - Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite “stash.”
  • Data Processing (Batch, real-time/streaming, ...)
    • Apache Spark - A fast and general engine for large-scale data processing
      • Spark SQL - A Spark module for structured data processing
      • MLlib - Spark’s machine learning (ML) library
      • GraphX - Graphs and graph-parallel computation
      • Spark Streaming - An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
    • AWS Kinesis - Real-time streaming data in the AWS cloud
      • Firehouse - Easily load real-time streaming data into AWS
      • Analytics - Get actionable insights from streaming data in real-time
      • Streams - Build custom applications that process or analyze streaming data for specialized needs
    • Apache Hadoop MapReduce - A YARN-based system for parallel processing of large data sets
    • Apache Apex - Enterprise-grade unified stream and batch processing engine
    • Apache Samza - A distributed stream processing framework
    • Apache Storm - A free and open source distributed realtime computation system
    • Apache Ignite - A high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies
    • Apache Flink - A streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams
    • Google Cloud Dataflow - A fully-managed cloud service and programming model for batch and streaming big data processing
    • AWS Data Pipeline - Easily automate the movement and transformation of data
    • AWS EMR - Easily Run and Scale Apache Hadoop, Spark, HBase, Presto, Hive, and other Big Data Frameworks
    • Google Cloud Dataproc - A managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning
  • Serverless, real-time analytics
  • Scheduling and Jobs
  • Analytics
  • Hadoop Enabled Applications
    • Cascading - The proven application development platform for building data applications on Hadoop
    • Cascalog - Fully-featured data processing and querying library for Clojure or Java
    • Scalding - An extension to Cascading that enables application development with Scala, a powerful language for solving functional problems
    • PyCascading - A Python wrapper for Cascading
  • Security
  • Operations
    • Apache Ambari - Tool for provisioning, managing, and monitoring Apache Hadoop clusters
  • Massively parallel processing database (MPP)

Enterprise Big Data and Analytics Products and Services

  • Databricks - Data integration, real-time exploration, and production pipelines in the cloud, powered by Apache® Spark
  • Talend - Open source integration software provider to data-driven enterprises
  • Teradata
    • Business Analytics Solutions
    • Analytical Architecture Consulting
    • Hybrid Cloud Products
  • Pentaho
  • Matlab - The Language of Technical Computing
  • HPE Vertica - Enables organizations to manage and analyze massive volumes of structured and semi-structured data quickly and reliably with no limits or business compromises
  • IBM SPSS Modeler - A predictive analytics platform that helps you build accurate predictive models quickly and deliver predictive intelligence to individuals, groups, systems and the enterprise
  • IBM SPSS Statistics - An integrated family of products that addresses the entire analytical process, from planning to data collection to analysis, reporting and deployment
  • SAS - Business intelligence software
  • Alteryx
  • Qubole
  • SAP
  • Splunk
  • FICO Big Data Analyzer - Formerly Karmasphere
  • - A powerful platform for enterprise data science
  • DataRobot - Advanced enterprise machine learning platform


  • AWS IoT - Easily and securely connect devices to the cloud


  • Load testing
    • Apache JMeter - Java application designed to load test functional behavior and measure performance