Skip to content
This repository has been archived by the owner on Jul 18, 2024. It is now read-only.

IBM/r4ml-on-watson-studio

 
 

Repository files navigation

Big Data Preparation and Exploration using R4ML

In this Code Pattern we will use R4ML, a scalable R package, running on IBM Watson Studio to perform various Machine Learning exercises. For those users who are unfamiliar with Watson Studio, it is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.

When the reader has completed this Code Pattern, they will understand how to:

The Intended audience for this Code Pattern is data scientists who wish to perform scalable feature engineering and data exploration.

This specific Code Pattern will provide an end-to-end example to demonstrate the ease and power of R4ML in implementing data preprocessing and data exploration. R4ML provides various out-of-the-box tools, and a preprocessing utility for doing the feature engineering. It also provides utilities to sample data and do exploratory analysis. For more information about additional R4ML functionality, support, documentation, and roadmap, please vist R4ML

This Code Pattern will walk the user through the following conceptual steps:

  • Large-scale exploratory analytics and data preparation.
  • Dimensionality reduction.
  • How to use your favorite R utilities on big data.
  • Highlights the steps necessary to complete data preparation and exploration.

Source of data

  • We will use the Airline On-Time Statistics and Delay Causes from RITA. A 1% sample of the "airline" dataset is available here. All of the data is in the public domain.
  • For this Code Pattern, we will use a subset of the above dataset, which is shipped with R4ML
  • This Code Pattern can also work with the larger RITA dataset.

Notebooks

Flow

  1. Load the provided notebook into IBM Watson Studio.
  2. The notebook interacts with an Apache Spark instance.
  3. A sample big data dataset is loaded into a Jupyter Notebook.
  4. R4ML, running atop Apache Spark, is used to perform machine data preprocessing and exploratory analysis.

Included Components

  • IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
  • IBM Analytics for Apache Spark: An open-source cluster computing framework optimized for extremely fast and large scale data processing.
  • Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured Technologies

  • Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
  • R4ML: R4ML is a scalable, hybrid approach to ML/Stats using R, Apache SystemML, and Apache Spark.

Steps

  1. Create a new Watson Studio project
  2. Create the notebooks
  3. Run the notebooks
  4. Save and Share
  5. Explore and Analyze the Data

1. Create a new Watson Studio project

  • Log into IBM's Watson Studio. Once in, you'll land on the dashboard.

  • Create a new project by clicking + New project and choosing Data Science:

    studio project

  • Enter a name for the project name and click Create.

  • NOTE: By creating a project in Watson Studio a free tier Object Storage service and Watson Machine Learning service will be created in your IBM Cloud account. Select the Free storage type to avoid fees.

    studio-new-project

  • Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the Assets and Settings tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.

    studio-project-dashboard

2. Create the Notebooks

  • From the new project Overview panel, click + Add to project on the top right and choose the Notebook asset type.

studio-project-dashboard

3. Run the notebooks

First run the exploratory nodebook first. Once Complete, run the data processing notebook.

Note: Running the exploratory notebook first is a requirement. It loads libraries and packages that are required in the data processing notebook.

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

  • A blank, this indicates that the cell has never been executed.
  • A number, this number represents the relative order this code step was executed.
  • A *, this indicates that the cell is currently executing.

There are several ways to execute the code cells in your notebook:

  • One cell at a time.
    • Select the cell, and then press the Play button in the toolbar.
  • Batch mode, in sequential order.
    • From the Cell menu bar, there are several options available. For example, you can Run All cells in your notebook, or you can Run All Below, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
  • At a scheduled time.
    • Press the Schedule button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.

4. Save and Share

How to save your work:

Under the File menu, there are several ways to save your notebook:

  • Save will simply save the current state of your notebook, without any version information.
  • Save Version will save your current state of your notebook with a version tag that contains a date and time stamp. Up to 10 versions of your notebook can be saved, each one retrievable by selecting the Revert To Version menu item.

How to share your work:

You can share your notebook by selecting the Share button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a “read-only” version of your notebook. You have several options to specify exactly what you want shared from your notebook:

  • Only text and output: will remove all code cells from the notebook view.
  • All content excluding sensitive code cells: will remove any code cells that contain a sensitive tag. For example, # @hidden_cell is used to protect your credentials from being shared.
  • All content, including code: displays the notebook as is.
  • A variety of download as options are also available in the menu.

5. Explore and Analyze the Data

Both notebooks are well documented and will guide you through the exercise. Some of the main tasks that will be covered include:

  • Load packages and data and do the initial transformation and various feature engineering.

  • Sample the dataset and use the powerful ggplot2 library from R to do various exploratory analysis.

  • Run PCA (Principal Component Analysis) to reduce the dimensions of the dataset and select the k components to cover 90% of variance.

You will also see the advantages of using R4ML, which is a git-downloadable open-source R packaged from IBM. Some of these include:

  • Created on top of SparkR and Apache SystemML, so it supports features from both.

  • Acts as an R bridge between SparkR and Apache SystemML.

  • Provides a collection of canned algorithms.

  • Provides the ability to create custom ML algorithms.

  • Provides both SparkR and Apache SystemML functionality.

  • APIs that should be familiar to R users.

Sample output

The following screen-shots shows the histogram of the exploratory analysis.

Exploratory Analysis Histogram

The following screen-shots shows the correlation between various features of the exploratory analysis.

Exploratory Analysis Correlation between various features

The following screen-shots shows the output of the dimensionality reduction using PCA and how only 6 components of PCA carries 90% of information.

Dimension Reduction using PCA

Awesome job following along! Now go try and take this further or apply it to a different use case!

Links

Learn more

  • Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
  • AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
  • Watson Studio: Master the art of data science with IBM's Watson Studio

License

This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.

Apache License FAQ

Releases

No releases published

Packages

 
 
 

Languages

  • Jupyter Notebook 100.0%