# **1.3 Exercise**
Michael J. Montana
College of Science and Tecnology, Bellevue University
DSC400: Big Data, Technology, and Algorithms
Professor Shawn Hermans
June 11 2023

# Creating a Big Data Stack

## Assignment 2

Big data is a rapidly evolving, ever-changing field. Keeping track of the latest big data stacks, programming libraries, software, and other tools requires constant vigilance. Any book on big data will be out of date by the time it is published. We need a resource that is updated on a more frequent basis. 

This assignment will help create that resource by researching the latest big data tools and technologies. We will use this research to create an *Awesome Big Data* list. Below is a list of similar *awesome* lists that may be useful when creating our *Awesome Big Data* list. 

*[Awesome Python](https://awesome-python.com/)* is a curated list of awesome Python frameworks, libraries, software and resources. It was inspired by [awesome-php](https://github.com/ziadoz/awesome-php). 

*[Awesome Jupyter](https://github.com/markusschanta/awesome-jupyter)* is a curated list of awesome Jupyter projects, libraries and resources. Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

*[Awesome Dash](https://github.com/ucg8j/awesome-dash)* is a curated list of awesome Dash (plotly) resources. Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python.

*[Awesome JavaScript](https://github.com/sorrycc/awesome-javascript)* is a collection of awesome browser-side JavaScript libraries, resources and shiny things. The [data visualization section](https://github.com/sorrycc/awesome-javascript#data-visualization) may be of use. 

*[Awesome Deep Learning](https://github.com/ChristosChristofidis/awesome-deep-learning)* is a curated list of awesome Deep Learning tutorials, projects and communities.

*[Awesome Machine Learning](https://github.com/josephmisiti/awesome-machine-learning)* is a curated list of awesome machine learning frameworks, libraries and software (by language).

*[Awesome Data Engineering](https://github.com/igorbarinov/awesome-data-engineering)* is a curated list of data engineering tools for software developers. 

*[Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)* is a list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. 

*[Awesome](https://github.com/sindresorhus/awesome)* is a list of awesome lists about all kinds of interesting topics.

### Assignment 2.1

Before we get started, we will access your knowledge of big data by taking the [Pokémon or Big Data Quiz](http://pixelastic.github.io/pokemonorbigdata/). Don't worry. The quiz results won't impact your grade. 

Included below is code that fetches the answers to the questions and provides the results in a Pandas dataframe. 

In [None]:
import pandas as pd
quiz_answers_json = 'https://raw.githubusercontent.com/pixelastic/pokemonorbigdata/master/app/questions.json'
df_all = pd.read_json(quiz_answers_json)
# Pokémon answers
df_all[df_all['type'] == 0]

In [None]:
# Big data answers
df_all[df_all['type'] == 1]

### Assignment 2.2

In the next part of the assignment, you will populate the items with categories for our list. The first chapter of the textbook, *Big Data Science & Analytics: A Hands-On Approach*, provides list of categories and subcategories for the big data stack. We will use these categories as a starting point, but will not constrain ourselves to them. 

When creating categories, avoid deeply nested layers of categories and subcategories. At most, define a top-level category with multiple subcategories. We will start with the following high-level categories and subcategories. 

***Categories***

We will use the disutils trove classification convention defined in [PEP 301](https://www.python.org/dev/peps/pep-0301/) when defining a category with a subcategory. This convention uses the double-colon ("::") to separate a category and subcategory. The following is an example of categories and subcategories as defined in the first chapter of the textbook, *Big Data Science & Analytics: A Hands-On Approach*. 

- Batch Analysis :: DAG
- Batch Analysis :: Machine Learning
- Batch Analysis :: MapReduce
- Batch Analysis :: Script
- Batch Analysis :: Search
- Batch Analysis :: Workflow Scheduling
- Data Access Connector :: Custom Connectors
- Data Access Connector :: Publish-Subscribe
- Data Access Connector :: Queues
- Data Access Connector :: SQL
- Data Access Connector :: Source-Sink
- Data Storage :: Distributed File System
- Data Storage :: NoSQL
- Deployment :: NoSQL
- Deployment :: SQL
- Deployment :: Visualization Frameworks
- Deployment :: Web Frameworks
- Interactive Querying :: Analytic SQL
- Real-Time Analysis :: In-Memory
- Real-Time Analysis :: Stream Processing

Below is a list containing categories and suggested starting points for research. Fill in a least two items from each of the suggested categories. Create at least one category that is not listed and add two items to that category. 

* AI and Machine Learning
    * Apache Spark's MLlib
    * H2O
    * Tensorflow
* Batch Processing
    * Apache
    * Apache Spark
    * Dask
    * MapReduce
* Cloud and Data Platforms
    * Amazon Web Services
    * Cloudera Data Platform
    * Google Cloud Platform
    * Microsoft Azure
* Container Engines and Orchestration
    * Docker
    * Docker Swarm
    * Kubernetes
    * Podman
* Data Storage :: Block Storage
    * Amazon EBS
    * OpenEBS
* Data Storage :: Cluster Storage
    * Ceph
    * HDFS
* Data Storage :: Object Storage
    * Amazon S3
    * Minio
* Data Transfer Tools
    * Apache Sqoop
* Full-Text Search
    * Apache Solr
    * Elasticsearch
* Interactive Query
    * Apache Hive
    * Google Big Query
    * Spark SQL
* Message Queues
    * Apache Kafka
    * RabbitMQ
* NoSQL :: Document Databases
    * CouchDB
    * Google Firestore
    * MongoDB
* NoSQL :: Graph Databases
    * DGraph
    * Neo4j
* NoSQL :: Key-Value Databases
    * Amazon DynamoDB
* NoSQL :: Time-Series Databases
    * TSDB
* Serverless Functions
    * AWS Lambda
    * OpenFaaS
* Stream Processing
    * Apache Spark's Structured Streaming
    * Apache Storm
    * Google Dataflow
* Visualization Frameworks
    * Apache Superset
    * Redash
* Workflow Engine
    * Apache Airflow
    * Google Cloud Composer
    * Oozie
    
We populate the list items using the `ListItem` class, defined below. The following is a description of the `ListItem` fields. 

**name**

The proper name of the list item

**website**

Link to the item's website.  Include `http://` or `https://` in the link. 

**category**

Category and optional subcategory for the item. 

**short_description**

Provide a short, one to two-sentence description of the item. 

In [41]:
from dataclasses import dataclass

@dataclass(frozen=True)
class ListItem:
    name: str
    website: str
    category: str
    short_description: str
    
all_items = set()

The following is an example of creating the entry for AWS as a seperate variable and then adding it to the `all_items` set. 

In [42]:
aws = ListItem(
    'Amazon Web Services',
    'https://aws.amazon.com/',
    'Cloud and Data Platforms',
    """Provides on-demand cloud computing platforms and APIs to individuals, 
    companies, and governments, on a metered pay-as-you-go basis."""
)

all_items.add(aws)

You can also add an item to the list directly.

In [43]:
all_items.remove(aws)
all_items.add(ListItem(
    'Amazon Web Services',
    'https://aws.amazon.com/',
    'Cloud and Data Platforms',
    """Provides on-demand cloud computing platforms and APIs to individuals, 
    companies, and governments, on a metered pay-as-you-go basis."""
))

In [44]:
# TODO: Fill in a least two items from each of the suggested categories.

all_items= [
    # AI and Machine Learning
    ListItem(
        "Apache Spark's MLlib",
        "https://spark.apache.org/mllib/",
        "AI and Machine Learning",
        """MLlib is Apache Spark's scalable machine learning library."""
    ),
    ListItem(
        "H2O",
        "https://h2o.ai/",
        "AI and Machine Learning",
        """Powered by world-class automated machine learning, the H2O AI Cloud enables organizations to build predictive models and gain insights from their data quickly and easily."""
    ),
    ListItem(
        "Tensorflow",
        "https://www.tensorflow.org/",
        "AI and Machine Learning",
        """TensorFlow is an end-to-end open source platform for machine learning. """
    ),
    # Batch Processing
    ListItem(
        "Apache Spark",
        "https://spark.apache.org/",
        "Batch Processing",
        """Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters."""
    ),
    ListItem(
        "Dask",
        "https://www.dask.org/",
        "Batch Processing",
        """Dask makes it easy to scale the Python libraries that you know and love like NumPy, pandas, and scikit-learn."""
    ),
    ListItem(
        "MapReduce",
        "https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html",
        "Batch Processing",
        """Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner."""
    ),
    # Cloud and Data Platforms
    ListItem(
        "Amazon Web Services",
        "https://aws.amazon.com/",
        "Cloud and Data Platforms",
        """Provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis."""
    ),
    ListItem(
    "Cloudera Data Platform",
        "https://www.cloudera.com/",
        "Cloud and Data Platforms",
        """Cloudera Data Platform (CDP) is a hybrid data platform designed for unmatched freedom to choose—any cloud, any analytics, any data."""
    ),
    ListItem(
        "Google Cloud Platform",
        "https://cloud.google.com/",
        "Cloud and Data Platforms",
        """A transformation cloud accelerates an organization’s digital transformation through data democratization, app and infrastructure modernization, people connections, and trusted transactions."""
    ),
    ListItem(
        "Microsoft Azure",
        "https://azure.microsoft.com/en-us",
        "Cloud and Data Platforms",
        """On-premises, hybrid, multicloud, or at the edge—create secure, future-ready cloud solutions on Azure"""
    ),
    # Container Engines and Orchestration
    ListItem(
        "Docker",
        "https://www.docker.com/",
        "Container Engines and Orchestration",
        """Docker Engine is an open source containerization technology for building and containerizing your applications."""
    ),
    ListItem(
        "Docker Swarm",
        "https://docs.docker.com/engine/swarm/",
        "Container Engines and Orchestration",
        """Swarm mode is an advanced feature for managing a cluster of Docker daemons."""
    ),
    ListItem(
        "Kubernetes",
        "https://kubernetes.io/",
        "Container Engines and Orchestration",
        """Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications."""
    ),
    ListItem(
        "Podman",
        "https://podman.io/",
        "Container Engines and Orchestration",
        """Manage containers, pods, and images with Podman. Seamlessly work with containers and Kubernetes from your local environment."""
    ),
    # Data Storage :: Block Storage
    ListItem(
        "Amazon EBS",
        "https://aws.amazon.com/ebs/",
        "Data Storage :: Block Storage",
        """Amazon Elastic Block Store (Amazon EBS) is an easy-to-use, scalable, high-performance block-storage service designed for Amazon Elastic Compute Cloud (Amazon EC2)."""
    ),
    ListItem(
        "OpenEBS",
        "https://openebs.io/",
        "Data Storage :: Block Storage",
        """OpenEBS helps Developers and Platform SREs easily deploy Kubernetes Stateful Workloads that require fast and highly reliable container attached storage. OpenEBS turns any storage available on the Kubernetes worker nodes into local or distributed Kubernetes Persistent Volumes."""
    ),
    # Data Storage :: Cluster Storage
    ListItem(
        "Ceph",
        "https://ceph.io/en/",
        "Data Storage :: Cluster Storage",
        """Use Ceph to transform your storage infrastructure. Ceph provides a unified storage service with object, block, and file interfaces from a single cluster built from commodity hardware components."""
    ),
    ListItem(
        "HDFS",
        "https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html",
        "Data Storage :: Cluster Storage",
        """The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project."""
    ),
    # Data Storage :: Object Storage
    ListItem(
        "Amazon S3",
        "https://aws.amazon.com/s3/",
        "Data Storage :: Object Storage",
        """Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize data, and configure fine-tuned access controls to meet specific business, organizational, and compliance requirements."""
    ),
    ListItem(
        "Minio",
        "https://min.io/",
        "Data Storage :: Object Storage",
        """MinIO is a high-performance, S3 compatible object store. It is built for large scale AI/ML, data lake and database workloads. It runs on-prem and on any cloud (public or private) and from the data center to the edge. MinIO is software-defined and open source under GNU AGPL v3."""
    ),
    # Data Transfer Tools
    ListItem(
        "Apache Sqoop",
        "https://sqoop.apache.org/",
        "Data Transfer Tools",
        """This project has retired.  Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases."""
    ),
    # Full-Text Search
    ListItem(
        "Apache Solr",
        "https://solr.apache.org/",
        "Full-Text Search",
        """Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites."""
    ),
    ListItem(
        "Elasticsearch",
        "https://www.elastic.co/",
        "Full-Text Search",
        """Powered by advanced machine learning, Elastic Observability is an open and flexible solution that accelerates problem resolution, provides end-to-end visibility into hybrid and multi-cloud environments, and unifies logs, metrics, and traces."""
    ),
    # Interactive Query
    ListItem(
        "Apache Hive",
        "https://hive.apache.org/",
        "Interactive Query",
        """The Apache Hive ™ is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL."""
    ),
    ListItem(
        "Google Big Query",
        "https://cloud.google.com/bigquery/",
        "Interactive Query",
        """BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data. Use built-in ML/AI and BI for insights at scale."""
    ),
    ListItem(
        "Spark SQL",
        "https://spark.apache.org/sql/",
        "Interactive Query",
        """Spark SQL is Apache Spark's module for working with structured data. Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R."""
    ),
    # Message Queues
    ListItem(
        "Apache Kafka",
        "https://kafka.apache.org/",
        "Message Queues",
        """Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications."""
    ),
    ListItem(
        "RabbitMQ",
        "https://www.rabbitmq.com/",
        "Message Queues",
        """RabbitMQ is an open source message broker.  RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. RabbitMQ can be deployed in distributed and federated configurations to meet high-scale, high-availability requirements."""
    ),
    # NoSQL :: Document Databases
    ListItem(
        "CouchDB",
        "https://couchdb.apache.org/",
        "Document Databases",
        """Seamless multi-master sync, that scales from Big Data to Mobile, with an Intuitive HTTP/JSON API and designed for Reliability."""
    ),
    ListItem(
        "Google Firestore",
        "https://firebase.google.com/docs/firestore",
        "Document Databases",
        """Cloud Firestore is a flexible, scalable database for mobile, web, and server development from Firebase and Google Cloud."""
    ),
    ListItem(
        "MongoDB",
        "https://www.mongodb.com/",
        "Document Databases",
        """MongoDB Atlas combines the leading document-oriented database with a full suite of developer tools for accelerating app development while eliminating integration work."""
    ),
    # NoSQL :: Graph Databases
    ListItem(
        "DGraph",
        "https://dgraph.io/",
        "NoSQL :: Graph Databases",
        """Graph database with native GraphQL and graph query language"""
    ),
    ListItem(
        "Neo4j" ,
        "https://neo4j.com/",
        "NoSQL :: Graph Databases",
        """The world's leading graph database as a fully-managed cloud service — zero-admin, globally available and always-on."""
    ),
    # NoSQL :: Key-Value Databases
    ListItem(
        "Amazon DynamoDB",
        "https://aws.amazon.com/dynamodb/",
        "NoSQL :: Key-Value Databases",
        """Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools."""
    ),
    # NoSQL :: Time-Series Databases
    ListItem(
        "OpenTSDB",
        "http://opentsdb.net/overview.html",
        "NoSQL :: Time-Series Databases",
        """OpenTSDB consists of a Time Series Daemon (TSD) as well as set of command line utilities. Interaction with OpenTSDB is primarily achieved by running one or more of the TSDs."""
    ),
    # Serverless Functions
    ListItem(
        "AWS Lambda",
        "https://aws.amazon.com/lambda/",
        "Serverless Functions",
        """AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. You can trigger Lambda from over 200 AWS services and software as a service (SaaS) applications, and only pay for what you use."""
    ),
    ListItem(
        "OpenFaaS",
        "https://www.openfaas.com/",
        "Serverless Functions",
        """OpenFaaS® makes it easy for developers to deploy event-driven functions and microservices to Kubernetes without repetitive, boiler-plate coding. Package your code or an existing binary in a Docker image to get a highly scalable endpoint with auto-scaling and metrics."""
    ),
    # Stream Processing
    ListItem(
        "Apache Spark's Structured Streaming",
        "https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview",
        "Stream Processing",
        """Spark Structured Streaming is a stream processing engine built on Spark SQL that processes data incrementally and updates the final results as more streaming data arrives. It brought a lot of ideas from other structured APIs in Spark (Dataframe and Dataset) and offered query optimizations similar to SparkSQL."""
    ),
    ListItem(
        "Apache Storm",
        "https://storm.apache.org/",
        "Stream Processing",
        """Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use!"""
    ),
    ListItem(
        "Google Dataflow",
        "https://cloud.google.com/dataflow",
        "Stream Processing",
        """Unified stream and batch data processing that's serverless, fast, and cost-effective.  Fully managed data processing service.  Automated provisioning and management of processing resources."""
    ),
    # Visualization Frameworks
    ListItem(
        "Apache Superset",
        "https://superset.apache.org/",
        "Visualization Frameworks",
        """Apache Superset is a modern data exploration and visualization platform. Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts."""
    ),
    ListItem(
        "Redash",
        "https://redash.io/",
        "Visualization Frameworks",
        """Redash helps you make sense of your data.  Connect and query your data sources, build dashboards to visualize data and share them with your company."""
    ),
    # Workflow Engine
    ListItem(
        "Apache Airflow",
        "https://airflow.apache.org/",
        "Workflow Engine",
        """Airflow is a platform created by the community to programmatically author, schedule and monitor workflows."""
    ),
    ListItem(
        "Google Cloud Composer",
        "https://cloud.google.com/composer",
        "Workflow Engine",
        """A fully managed workflow orchestration service built on Apache Airflow. Author, schedule, and monitor pipelines that span across hybrid and multi-cloud environments."""
    ),
    ListItem(
        "Oozie",
        "https://oozie.apache.org/",
        "Workflow Engine",
        """Oozie is a workflow scheduler system to manage Apache Hadoop jobs.  Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.  Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability."""
    ),

# TODO: Create at least one category that is not listed and add two items to that category.

    #Data Access Connector :: Publish-Subscribe
    ListItem(
        "AWS Kinesis",
        "https://aws.amazon.com/kinesis/",
        "Data Access Connector :: Publish-Subscribe",
        """Amazon Kinesis cost-effectively processes and analyzes streaming data at any scale as a fully managed service. With Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and IoT telemetry data, for machine learning (ML), analytics, and other applications."""
    ),
    ListItem(
        "Azure Event Hubs",
        "https://azure.microsoft.com/en-us/products/event-hubs",
        "Data Access Connector :: Publish-Subscribe",
        """Event Hubs is a fully managed, real-time data ingestion service that’s simple, trusted, and scalable. Stream millions of events per second from any source to build dynamic data pipelines and immediately respond to business challenges."""
    )
]

### Assignment 2.3 (Optional)

Use the `all_items` data to create Markdown output that mirrors the output of [Awesome Python](https://raw.githubusercontent.com/vinta/awesome-python/master/README.md). You can use the `jinja2` template engine to complete this task. This part of the assignment is entirely optional and is not graded. 

In [45]:
import jinja2

template = jinja2.Template("""
# Awesome Big Data

A curated list of awesome big data frameworks, libraries, software and resources.

Inspired by [awesome-php](https://github.com/ziadoz/awesome-php).
""")