In this Code Pattern we will use R4ML, a scalable R package, running on IBM Watson Studio to perform various Machine Learning exercises. For those users who are unfamiliar with Watson Studio, it is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.
When the reader has completed this Code Pattern, they will understand how to:
- Use Jupyter Notebooks to load, visualize, and analyze data.
- Run Notebooks in IBM Watson Studio.
- Leverage R4ML to conduct data preparation and exploratory analysis with big data.
The Intended audience for this Code Pattern is data scientists who wish to perform scalable feature engineering and data exploration.
This specific Code Pattern will provide an end-to-end example to demonstrate the ease and power of R4ML in implementing data preprocessing and data exploration. R4ML provides various out-of-the-box tools, and a preprocessing utility for doing the feature engineering. It also provides utilities to sample data and do exploratory analysis. For more information about additional R4ML functionality, support, documentation, and roadmap, please vist R4ML
This Code Pattern will walk the user through the following conceptual steps:
- Large-scale exploratory analytics and data preparation.
- Dimensionality reduction.
- How to use your favorite R utilities on big data.
- Highlights the steps necessary to complete data preparation and exploration.
- We will use the Airline On-Time Statistics and Delay Causes from RITA. A 1% sample of the "airline" dataset is available here. All of the data is in the public domain.
- For this Code Pattern, we will use a subset of the above dataset, which is shipped with R4ML
- This Code Pattern can also work with the larger RITA dataset.
- R4ML_Introduction_Exploratory_DataAnalysis.ipynb: for exploring the data we will be using
- R4ML_Data_Preprocessing_and_Dimension_Reduction.ipynb: performs data pre-processing and dimension reduction analysis.
- Load the provided notebook into IBM Watson Studio.
- The notebook interacts with an Apache Spark instance.
- A sample big data dataset is loaded into a Jupyter Notebook.
- R4ML, running atop Apache Spark, is used to perform machine data preprocessing and exploratory analysis.
- IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
- IBM Analytics for Apache Spark: An open-source cluster computing framework optimized for extremely fast and large scale data processing.
- Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
- Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
- R4ML: R4ML is a scalable, hybrid approach to ML/Stats using R, Apache SystemML, and Apache Spark.
- Create a new Watson Studio project
- Create the notebooks
- Run the notebooks
- Save and Share
- Explore and Analyze the Data
-
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
-
Create a new project by clicking
+ New project
and choosingData Science
: -
Enter a name for the project name and click
Create
. -
NOTE: By creating a project in Watson Studio a free tier
Object Storage
service andWatson Machine Learning
service will be created in your IBM Cloud account. Select theFree
storage type to avoid fees. -
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the
Assets
andSettings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.
- From the new project
Overview
panel, click+ Add to project
on the top right and choose theNotebook
asset type.
-
Fill in the following information:
- Select the
From URL
tab. [1] - Enter a
Name
for the notebook and optionally a description. [2] - Under
Notebook URL
provide the following url: https://github.com/IBM/r4ml-on-watson-studio/blob/master/notebooks/R4ML_Introduction_Exploratory_DataAnalysis.ipynb [3] - For
Runtime
select theSpark R 3.4
option. [4]
- Select the
-
Click the
Create
button. -
Repeat these steps for the second notebook, which has the URL: https://github.com/IBM/r4ml-on-watson-studio/blob/master/notebooks/R4ML_Data_Preprocessing_and_Dimension_Reduction.ipynb
-
TIP: Once successfully imported, the notebook should appear in the
Notebooks
section of theAssets
tab.
First run the exploratory nodebook first. Once Complete, run the data processing notebook.
Note: Running the exploratory notebook first is a requirement. It loads libraries and packages that are required in the data processing notebook.
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
- A blank, this indicates that the cell has never been executed.
- A number, this number represents the relative order this code step was executed.
- A
*
, this indicates that the cell is currently executing.
There are several ways to execute the code cells in your notebook:
- One cell at a time.
- Select the cell, and then press the
Play
button in the toolbar.
- Select the cell, and then press the
- Batch mode, in sequential order.
- From the
Cell
menu bar, there are several options available. For example, you canRun All
cells in your notebook, or you canRun All Below
, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
- From the
- At a scheduled time.
- Press the
Schedule
button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.
- Press the
Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version information.Save Version
will save your current state of your notebook with a version tag that contains a date and time stamp. Up to 10 versions of your notebook can be saved, each one retrievable by selecting theRevert To Version
menu item.
You can share your notebook by selecting the Share
button located in the top
right section of your notebook panel. The end result of this action will be a URL
link that will display a “read-only” version of your notebook. You have several
options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells that contain a sensitive tag. For example,# @hidden_cell
is used to protect your credentials from being shared.All content, including code
: displays the notebook as is.- A variety of
download as
options are also available in the menu.
Both notebooks are well documented and will guide you through the exercise. Some of the main tasks that will be covered include:
-
Load packages and data and do the initial transformation and various feature engineering.
-
Sample the dataset and use the powerful ggplot2 library from R to do various exploratory analysis.
-
Run PCA (Principal Component Analysis) to reduce the dimensions of the dataset and select the k components to cover 90% of variance.
You will also see the advantages of using R4ML, which is a git-downloadable open-source R packaged from IBM. Some of these include:
-
Created on top of SparkR and Apache SystemML, so it supports features from both.
-
Acts as an R bridge between SparkR and Apache SystemML.
-
Provides a collection of canned algorithms.
-
Provides the ability to create custom ML algorithms.
-
Provides both SparkR and Apache SystemML functionality.
-
APIs that should be familiar to R users.
The following screen-shots shows the histogram of the exploratory analysis.
The following screen-shots shows the correlation between various features of the exploratory analysis.
The following screen-shots shows the output of the dimensionality reduction using PCA and how only 6 components of PCA carries 90% of information.
Awesome job following along! Now go try and take this further or apply it to a different use case!
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
- Watson Studio: Master the art of data science with IBM's Watson Studio
This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.