Skip to content
An example project demonstrating unsupervised learning with the Gaussian Mixture clusterer and synthetic color data generation.
PHP
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs/images Added training and evalutation tutorial Dec 27, 2018
.gitignore
README.md
composer.json
train.php Updated to Rubix ML 0.0.12-beta May 19, 2019

README.md

Color Clusterer

An unsupervised learning problem that involves clustering similar shades of 10 different base colors generated on the fly using Rubix Generators. The objective is to generate a training and testing set full of synthetic data that we'll later use to train and test a Gaussian Mixture clusterer. In this tutorial, you'll learn the concepts of unsupervised clustering and synthetic data generation.

  • Difficulty: Easy
  • Training time: < 1 Minute
  • Memory needed: < 1G

Installation

Clone the repository locally using Git:

$ git clone https://github.com/RubixML/Colors

Install dependencies using Composer:

$ composer install

Requirements

  • PHP 7.1.3 or above

Tutorial

In machine learning, synthetic data are used to either test an estimator or to augment a small dataset with more training data. Rubix provides a number of Generators which output a dataset in a particular shape and dimensionality. For this example project, we are going to generate Blobs of colors using their RGB values as features. We'll form an Aglomerate of color Blobs and give each one a label corresponding to its base color name.

Note: Generators can generate both labeled and unlabeled datasets. The type of Dataset object returned depends on the generator. See the API Reference for more details.

The source code can be found in the train.php file in project root.

use Rubix\ML\Datasets\Generators\Agglomerate;
use Rubix\ML\Datasets\Generators\Blob;

$generator = new Agglomerate([
    'red' => new Blob([255, 0, 0], 20.),
    'orange' => new Blob([255, 128, 0], 20.),
    'yellow' => new Blob([255, 255, 0], 20.),
    'green' => new Blob([0, 128, 0], 20.),
    'blue' => new Blob([0, 0, 255], 20.),
    'aqua' => new Blob([0, 255, 255], 20.),
    'purple' => new Blob([128, 0, 255], 20.),
    'pink' => new Blob([255, 0, 255], 20.),
    'magenta' => new Blob([255, 0, 128], 20.),
    'black' => new Blob([0, 0, 0], 20.),
]);

To generate a dataset, call generate() with the number of samples (n). A Dataset object is returned which allows you to fluently process the data further by stratifying and splitting the dataset into a training and testing set. Stratifying the dataset before splitting creates balanced training and testing sets by label. The proportion of samples in the left (training) set to the right (testing) set is given by the ratio parameter to the stratifiedSplit() method. Let's choose to generate a set of 5000 samples and then split it 80/20 (4000 for training and 1000 for testing).

[$training, $testing] = $generator->generate(5000)->stratifiedSplit(0.8);

Let's take a look at the data we've just generated using plotting software such as Plotly. We've used the label to color the data such that each point is represented by its base color.

Synthetic Color Data

Now we'll define our Gaussian Mixture clusterer. Gaussian Mixture Models (GMMs) are a type of probabilistic model for finding subpopulations within a dataset. They place a Gaussian component over each target cluster that allows a likelihood function to be computed. The learner is then trained with Expectation Maximization (EM) to maximize the likelihood that the area over each Gaussian component contains only samples of the same class. To set the target number of clusters k we need to set the hyper-parameters of the GMM. Since we already know the number of different labeled color Blobs in our dataset we'll choose a value of 10.

use Rubix\ML\Clusterers\GaussianMixture;

$estimator = new GaussianMixture(10);

Once our estimator is instantiated we can call train() passing in the training set we generated earlier.

$estimator->train($training);

Lastly to test the model, let's create a report that compares the clustering to some ground truth given by the labels we've assigned to each Blob. A Contingency Table is a clustering report similar to a Confusion Matrix. It counts the number of times a particular label was assigned to a cluster. A good clustering will show that each cluster contains samples with roughly the same label.

We'll need the predictions made by the Gaussian Mixture clusterer as well as the labels from the testing set to pass to the Contingency Table report's generate() method. Once that's done, we'll save the output to a JSON file so we can review it later.

use Rubix\ML\CrossValidation\Reports\ContingencyTable;

$predictions = $estimator->predict($testing);

$report = new ContingencyTable();

$results = $report->generate($predictions, $testing->labels());

Here is an example of a cluster that contains a misclustered magenta point with the reds.

{
    "8": {
        "red": 100,
        "orange": 0,
        "yellow": 0,
        "green": 0,
        "blue": 0,
        "aqua": 0,
        "purple": 0,
        "pink": 0,
        "magenta": 1,
        "black": 0
    },
}

To run the training script from the project root:

$ php train.php

Wrap Up

  • Clustering is a type of unsupervised learning which aims at predicting the cluster label of a sample
  • A Guassian Mixture model is a type of clusterer
  • Synthetic data can be used as a way to test models or augment small datasets
  • Rubix Generators are used to generate synthetic data in various shapes and dimensionalities
  • A Contingnecy Table is a report that allows you to evaluate a clustering
You can’t perform that action at this time.