Skip to content
This repository has been archived by the owner on Apr 11, 2019. It is now read-only.
Steve Martinelli edited this page Jul 3, 2018 · 12 revisions

Short Name

Building a recommender using Data Science Experience Local and HortonWorks Data Platform

Short Description

Use IBM's Data Science Experience Local and HortonWorks Data Platform to leverage Spark and HDP Search to build a movie recommender.

Offering Type

Data Analytics

Introduction

Recommendation engines are one of the most well known, widely used and highest value use cases for applying machine learning. Despite this, while there are many resources available for the basics of training a recommendation model, there are relatively few that explain how to actually deploy these models to create a large-scale recommender system.

This Code Pattern demonstrates the key elements of creating such a system by using Apache Spark and HDP Search (Solr)

Author

By Dilip Biswal and Kevin Yu

Code

Demo

Video

  • COMING !!!

Overview

This Code Pattern contains a Jupyter notebook illustrating how to use Spark for training a collaborative filtering recommendation model from ratings data stored in Solr, saving the model factors to Solr, and then using Solr to serve real-time recommendations using the model. The data comes from MovieLens and is a common benchmark dataset in the recommendations community. It consists of a set of ratings given by users of the MovieLens movie rating system, to various movies. It also contains metadata (title and genres) for each movie.

A key element of this Code Pattern is the environment in which it is run. To store, process and analyze the data, we will use the Horton Works Data Platform (HDP). HDP is a massively scalable platform consisting of a number of essential Apache Hadoop projects such as MapReduce and the Hadoop Distributed File System (HDFS). It also provides applications and tools for indexing content from a HDP cluster to Solr, which is an open source enterprise search platform.

On the notebook side, this Code Pattern will utilize DSX Local, which is an on-premise solution for data scientists and data engineers. It offers a suite of data science tools that integrate with RStudio, Spark, Jupyter, and Zeppelin notebook technologies. Here, we will demonstrate how to configure it to use HDP.

When the reader has completed this code pattern, they will understand how to:

  • Ingest and index user event data into Solr using the Solr Spark connector
  • Load event data into Spark DataFrames and use Spark's machine learning library (MLlib) to train a collaborative filtering recommender model
  • Export the trained model into Solr
  • Using a custom Solr plugin, compute personalized user and similar item recommendations and combine recommendations with search and content filtering

Flow

architecture

  1. Load the movie dataset into Apache Hadoop HDFS.
  2. Use Spark DataFrame operations to clean the dataset and use Spark MLlib to train a collaborative filtering recommendation model.
  3. Save the resulting model into Apache Solr.
  4. The user can run the provided notebook in IBM's Data Science Experience Local
  5. As the notebook runs, Apache Livy will be called to interact with the Spark service in HDP.
  6. Using Solr queries and a custom vector scoring plugin, generate example recommendations.
  7. When necessary, retrieve information about movies, such as poster images, using The Movie Database APIs.

Included components

  • IBM Data Science Experience Local: An out-of-the-box on premises solution for data scientists and data engineers. It offers a suite of data science tools that integrate with RStudio, Spark, Jupyter, and Zeppelin notebook technologies.
  • Apache Spark: An open-source, fast and general-purpose cluster computing system.
  • Hortonworks Data Platform (HDP): HDP is a massively scalable platform for storing, processing and analyzing large volumes of data. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari.
  • HDP Search: HDP Search provides applications and tools for indexing content from your HDP cluster to Solr.
  • Apache Livy: Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface.
  • Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured technologies

  • Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
  • Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
  • Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Blog

Links