Skip to content

KennethNjuguna/Exploration-of-Cloudera-VM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 

Repository files navigation

Training exercises for Cloudera's Distribution for Hadoop; Cloudera is a Hybrid Data company.


Cloudera's distribution for Hadoop provides a number of services

This includes:-

  • HBase
  • Hue
  • Impala
  • Oozie
  • Spark

  • HBase
  • Workflow scheduler system to manage Apache Hadoop jobs.
  • Oozie Coordinator jobs.
  • Supports: MapReduce, Pig, Apache Hive, and Sqoop .
  • Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end, and failure nodes) as well as mechanism to control the workflow execution path(decision fork and jon nodes).


  • Hue (Hadoop User Experience)
  • Hue is a web-based interactive query editor that enables you to interact with data warehouses. For example, the following image shows a graphic representation of Impala SQL query results that you can generate with Hue.

    image

    You can use Hue to:-

  • Explore, browse and import your data through guided navigation in the left panel of the page:
  • image

    This panel enables you to:

    • Browse your databases
    • Drill down to specific tables
    • View HDFS directories and Cloud storage
    • Discover indexes and HDFS or kudu tables
    • Find documents

    Objects can be tagged for quick retrieval, project association, or to assign a more "human readable" name if desired.

    1. Query your data, create a custom dashboard or schedule repetitive jobs
    in the central panel of the page.

    image

    The central panel of the page provides a rich toolset, including:-

    • Versatile editors that enable you to create a wide variety of scripts.
    • Dashboards that you can create "0n-the-fly" by dragging and dropping elements into the central panel of the Hue interface. No programming is required. Then you can use your custom dashboard to explore your data.
    • Schedulers that you can create by dragging and dropping, just like the dashboards feature. This feature enables you to create custom workflows and to schedule them to run automatically on a regular basis. A monitoring interface shows the progress, logd, and makes it possible to stop or pause jobs.
    • Get expert advice on how to complete a task:
    • image

      The assistant panel on the right provides expert advice and hints for whatever application is currently being used in the central panel. For example, in the above image, Impala SQL hints are provided to help construct queries in the central panel.

    • Impala
    • Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.

      Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapRedcue are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

      How Impala works with Apache Hadoop

      The Impala solution is composed of the following components:

    • Clients - Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all interact with Impala. These interfaces are typically used to issue queries or complete administrative tasks such as connecting to Impala.
    • Hive Metastore - Stores information about the data available to Impala. For example, the metastore lets Impala know what databases are available and what the structure of those databases is. As you create, drop, and alter schema objects, load data into tables, and so on through Impala SQL statements, the relevant metadata changes are automatically broadcast to all Impala nodes by the dedicated catalog service introduced in Impala 1.2.
    • Impala - This process, which runs on DataNodes, coordinates and executes queries. Each instance of Impala can receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these nodes then act as workers, executing parallel query fragments.
    • HBase and HDFS - Storage for data to be queried.
    • Queries executed using Impala are handled as follows:

    • User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying interfaces. The user application may connect to any impalad in the cluster. This impalad becomes the coordinator for the query.
    • Impala parses the query and analyzes it to determine what tasks need to be performed by impalad instances across the cluster. Execution is planned for optimal efficiency.
    • Services such as HDFS and HBase are accessed by local impalad instances to provide data..
    • Each impalad returns data to the coordinating impalad, which sends these results to the client.
    • Primary Impala Features.

      The key features of Impala are:-

      • Provides support for in-memory data processing; it can access or analyze data stored on Hadoop Datanodes without any data movement.
      • Using Impala, we can access data using SQL like queries.
      • Apache Impala provides faster data access to data stored in Hadoop Distributed File System compated to other SQL engines like Hive.
      • Impala helps us store data in Storage systems like Hadoop Hbase, HDFS and Amazon s3.
      • We can easily integrate Impala with business intelligence tools such as Tableau, Micro strategy, P Pentaho and Zoom Data.
      • Provides support for various file formats such as LZO, Avro, RCfile, Sequence file and Parquet.
      • Apache impala uses the ODBC driver, user interface metadata, and SQL syntax such as Apache Hive.

      Conclusion.

      In short, we can say that Impala is an open-source and native analytics database for Hadoop. Impala is the most powerful SQL engine that provides the fastest access to data stored in HDFS (Hadoop Distributed File System). Impala uses the same Hive Query Language (SQL) syntax, metadata, user interface, and ODBC drivers as Apache Hive. Unlike traditional storage systems, Apache impala is not tied to its storage core. It consists of three core components Impala daemon, Impala state store, and Impala Catalog. The Impala Shell, the Hue browser, and the JDBC/ODBC driver are three query processing interfaces we can use to interact with Apache Impala.

    • Oozie
    • The blueprint for Enterprise Hadoop includes Apache™ Hadoop’s original data storage and data processing layers and also adds components for services that enterprises must have in a modern data architecture: data integration and governance, security and operations. Apache Oozie provides some of the operational services for a Hadoop cluster, specifically around job scheduling within the cluster.

      What Oozie Does

      Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural center, and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also schedule jobs specific to a system, like Java programs or shell scripts.

      Apache Oozie is a tool for Hadoop operations that allows cluster administrators to build complex data transformations out of multiple component tasks. This provides greater control over jobs and also makes it easier to repeat those jobs at predetermined intervals. At its core, Oozie helps administrators derive more value from Hadoop.

      It consists of two parts:-

      • Workflow engine: Responsibilty of a workflow engine is to store and run workflows composed of hadoop jobs e.g Mapreduce, Oozie and Hive.
      • Coordinator Engine: It runs workflow jobs based on predefined schedules and availability of data.
      • OOzie is scalable and can manage the timely execution of thousands of workflows each consisting of dozens of jobs in a Hadoop cluster.

        image

        Oozie is very much flexible, as well. One can easily start, stop, suspend and rerun jobs. Oozie makes it very easy to rerun failed workflows. One can easily understand how difficult it can be to catch up missed or failed jobs due to downtime or failure. It is even possible to skip a specific failed node.

        How does Oozie work.

        • Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing.
        • Oozie workflow consists of action nodes and control-flow nodes.
        • An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.
        • A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node.
        • Start Node, End Node, and Error Node fall under this category of nodes.
        • Start Node, designates the start of the workflow job.
        • End Node, signals end of the job.
        • Error Node designates the occurrence of an error and corresponding error message to be printed.
        • At the end of execution of a workflow, HTTP callback is used by Oozie to update the client with the workflow status. Entry-to or exit from an action node may also trigger the callback.
        • Spark.
        • Apache Spark is the open standard for flexible in-memory data processing that enables batch, real-time, and advanced analytics on the Apache Hadoop platform. Cloudera is committed to helping the ecosystem adopt Spark as the default data execution engine for analytic workloads.

          Easy, Productive Development.

          Simple, yet rich, APIs for Java, Scala, and Python open up data for interactive discovery and iterative development of applications. Through shared common code, data scientists and developers can increase productivity with rapid prototyping for batch and streaming applications, using the language and third-party tools on which they already rely.

          Fast Processing.

          Take advantage of Spark’s distributed in-memory storage for high performance processing across a variety of use cases, including batch processing, real-time streaming, and advanced modeling and analytics. With significant performance improvements over MapReduce, Spark is the tool of choice for data scientists and analysts to turn their data into real results.

          Integrated across the platform.

          As an integrated part of Cloudera’s platform, Spark benefits from unified resource management (through YARN), simple administration (through Cloudera Manager), and compliance-ready security and governance (through Apache Sentry and Cloudera Navigator) — all critical for running in production.

          The Cloudera difference for Apache Spark.

          The first integrated solution to support Apache Spark, Cloudera not only has the most experience — with production customers across industries — but also has built the deepest engineering integration between Spark and the rest of the ecosystem, including bringing Spark to YARN and adding necessary security and management integrations (500+ patches contributed, to date). Cloudera also has multiple Spark committers on staff, so you get direct access and influence to the roadmap based on your needs and use cases.

          Features of Apache Spark.

        • Batch/streaming data.
        • Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

        • SQL Analytics.
        • Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

        • Data Science at a Scale.
        • Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling.

        • Machine learning.
        • Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

          Differences between Apache spark and Hadoop

          Spark and Hadoop are leading open source big data infrastructure frameworks that are used to store and process large data sets.

          Since Spark's introduction to the Apache Software Foundation in 2014, it has received massive interest from developers, enterprise software providers, and independent software vendors looking to capitalize on its in-memory processing speed and cohesive, uniform APIs.

          However, there is a hot debate on whether Spark can replace Hadoop to become the top big data analystics tool.

          In this post, I have tried to explain the difference between Spark and Hadoop easily so that anyone, even those without a background in computer science, can understand.

          Distributed Storage System.

          Even though Spark is said to work faster than Hadoop in certain circumstances, it doesn't have its own distributed storage system. So first, let's understand the concept of a distributed file system.

          Distributed storage system lets you store large datasets across an infinite number of servers, rather than storing all the datasets on a single server.

          When the number of data increases, you can add as many servers as you want in the distributed storage system. This makes a distributed storage system scalable and cost-efficient because you are using additional hardware (servers) only when there is a demand.

          How Spark and Hadoop Process Data.

          Spark does not have its system to organize files in a distributed way(the file system). For this reason, programmers install Spark on top of Hadoop so that Spark’s advanced analytics applications can make use of the data stored using the Hadoop Distributed File System(HDFS). Hadoop has a file system that is much like the one on your desktop computer, but it allows us to distribute files across many machines. HDFS organizes information into a consistent set of file blocks and storage blocks for each node.

          image

          HDFS uses MapReduce to process and analyze data. MapReduce takes the back of all the data in a physical server after each operation. This was done because data stored in a RAM is volatile than that stored in a physical server.

          image

          In contrast, Spark copies most of the data from a physical server to RAM; this is called “in-memory” operation. It reduces the time required to interact with servers and makes Spark faster than the Hadoop’s MapReduce system. Spark uses a system called Resilient Distributed Datasets to recover data when there is a failure.

          Spark and Hadoop’s Role in Real-time Analytics

          Real-time processing means that the moment data is captured, it is fed into an analytical application, and the analytical application processes and analyses the data and delivers insights quickly to the user through a dashboard. So that the user can take necessary action based on insights provided by the application.

          image

          An excellent example of real-time streaming is a recommendation engine; similar products are shown based on your browsing history.

          image

          Nowadays, Spark is used in machine learning projects due to its ability to process real-time data effectively. Machine learning is a subfield of artificial intelligence. It is a method of teaching computers to make and improve predictions or behaviors based on some data.

          image

          Spark has its machine learning library called MLib, whereas Hadoop must be interfaced with an external machine learning library, for example, Apache Mahout.

          image

          As Spark is faster than Hadoop, it is well capable of handling advanced analytics operations like real-time data processing when compared to Hadoop.

          Why Spark and Hadoop Are Not Competitors

          Many prominent data professionals argue that “Spark is better than Hadoop” or “Hadoop is better than Spark.” In my opinion, both Hadoop and Spark are not competitors because Hadoop was designed to handle data that does not fit in the memory, whereas Spark was designed to deal with data that fits in the memory.

          Even Companies like Cloudera that gives installation and support services to open-source, big data software delivers both Hadoop and Spark as services. These big data companies also help their clients to choose the best big data software depending on their needs.

          For instance, If a corporation has a lot of structured data (customer names and email ids) in their database, they might not need advanced streaming analytics and machine learning capabilities provided by Spark. They need not waste time and money by installing Spark as a layer on top of their Hadoop Stack.

          Conclusion.

          Although the adoption of Spark has increased, it hasn't caused any panic in the big data community. Experts predict that Spark would facilitate the growth of another stack, which could be much more powerful. But this new stack would be very similar to that of Hadoop and its ecosystem of sotware packages.

          Simplicity and speed are the most significant advantages of Spark. Even if Spark is a big winner, unless there is a new distributed file system, we will be using Hadoop alongside Spark for a complete big data package

    About

    Introduction to CDP

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published