# Machine Learning at Scale, Part I

## KMeans clustering at scale

Training models with data that fits in memory is very limiting. But minibatch learners can easily work with data directly from disk. 

We'll use the MNIST data set, which has 8 million images (about 17 GB). The dataset has been partition into groups of 100k images (using the unix split command) and saved in compressed lz4 files. This dataset is very large and doesnt get loaded by default by <code>getdata.sh</code>. You have to load it explicitly by calling <code>getmnist.sh</code> from the scripts directory. The script automatically splits the data into files that are small enough to be loaded into memory. 

Let's load BIDMat/BIDMach

In [3]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GDMat,GMat,GIMat,GSDMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,Model,NMF,RandomForest,SFA}
import BIDMach.datasources.{DataSource,MatSource,FileSource,SFileSource}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
Mat.plotInline = true
if (Mat.hasCUDA > 0) GPUmem

1 CUDA device found, CUDA version 7.0


(0.14971523,1808470016,12079398912)

And define the root directory for this dataset.

In [4]:
val mdir = "../data/MNIST8M/parts/"



../data/MNIST8M/parts/

The files we need are named "alls00.fmat.lz4", "alls01.fmat.lz4" etc. We can create a learner using a pattern for accessing these files:

In [5]:
val (mm, opts) = KMeans.learner(mdir+"alls%02d.fmat.lz4",1024)



BIDMach.models.KMeans$fsopts@6593bce2

The string "%02d" is a C/Scala format string that expands into a two-digit ASCII number to help with the enumeration.

There are several new options that can tailor a files datasource, but we'll mostly use the defaults. One thing we will do is define the last file to use for training (number 70). This leaves us with some held-out files to use for testing. 

In [6]:
opts.nend = 20



20

Note that the training data include image data and labels (0-9). K-Means is an unsupervised algorithm and if we used image data only KMeans will often build clusters containing different digit images. To produce cleaner clusters, and to facilitate classification later on, the <code>alls</code> data includes both labels in the first 10 rows, and image data in the remaining rows. The label features are scaled by a large constant factor. That means that images of different digits will be far apart in feature space. It effectively prevents different digits occuring in the same cluster. 

## Tuning Options

The following options are the important ones for tuning. For KMeans, batchSize has no effect on accracy since the algorithm uses all the data instances to perform an update. So you're free to tune it for best speed. Generally larger is better, as long as you dont use too much GPU ram. 

npasses is the number of passes over the dataset. Larger is typically better, but the model may overfit at some point. 

In [7]:
opts.batchSize = 20000
opts.npasses = 4



4

You invoke the learner the same way as before. You can change the options above after each run to optimize performance. 

In [8]:
mm.train

pass= 0
First pass random centroid initialization
 2.00%, ll=0.00000, gf=62.245, secs=0.5, GB=0.13, MB/s=242.44, GPUmem=0.109619
12.00%, ll=0.00000, gf=32.781, secs=2.0, GB=0.83, MB/s=414.95, GPUmem=0.109087
24.00%, ll=0.00000, gf=28.554, secs=3.4, GB=1.52, MB/s=444.84, GPUmem=0.109000
35.00%, ll=0.00000, gf=25.349, secs=5.1, GB=2.22, MB/s=431.94, GPUmem=0.108913
45.00%, ll=0.00000, gf=22.749, secs=7.2, GB=2.92, MB/s=407.58, GPUmem=0.108826
57.00%, ll=0.00000, gf=22.725, secs=8.6, GB=3.62, MB/s=420.42, GPUmem=0.108826
68.00%, ll=0.00000, gf=22.493, secs=10.2, GB=4.32, MB/s=425.51, GPUmem=0.108826
79.00%, ll=0.00000, gf=22.721, secs=11.5, GB=5.02, MB/s=436.92, GPUmem=0.108740
89.00%, ll=0.00000, gf=21.064, secs=13.9, GB=5.72, MB/s=410.19, GPUmem=0.108740
100.00%, ll=0.00000, gf=21.137, secs=15.4, GB=6.35, MB/s=411.61, GPUmem=0.108653
pass= 1
 2.00%, ll=-2463799.75000, gf=26.068, secs=16.3, GB=6.48, MB/s=398.32, GPUmem=0.101600
12.00%, ll=-2467914.25000, gf=65.288, secs=17.0, GB=7.18, MB

Now lets extract the model as a Floating-point matrix. We included the category features for clustering to make sure that each cluster is a subset of images for one digit. 

In [9]:
val modelmat = FMat(mm.modelmat)



      0      0      0      0      0  10000      0      0      0      0...
  10000      0      0      0      0      0      0      0      0      0...
      0      0      0      0  10000      0      0      0      0      0...
  10000      0      0      0      0      0      0      0      0      0...
      0      0      0      0  10000      0      0      0      0      0...
      0      0      0  10000      0      0      0      0      0      0...
      0  10000      0      0      0      0      0      0      0      0...
      0      0      0      0      0  10000      0      0      0      0...
     ..     ..     ..     ..     ..     ..     ..     ..     ..     ..


Next we build a 30 x 10 array of images to view the first 300 cluster centers as images.

In [10]:
val nx = 30
val ny = 10
val im = zeros(28,28)
val allim = zeros(28*nx,28*ny)
for (i<-0 until nx) {
    for (j<-0 until ny) {
        val slice = modelmat(i+nx*j,10->794)
        im(?) = slice(?)
        allim((28*i)->(28*(i+1)), (28*j)->(28*(j+1))) = im
    }
}
Image.show(allim kron ones(2,2))



javax.swing.JFrame[frame0,0,0,1696x598,layout=java.awt.BorderLayout,title=Image 0,resizable,normal,defaultCloseOperation=HIDE_ON_CLOSE,rootPane=javax.swing.JRootPane[,8,30,1680x560,layout=javax.swing.JRootPane$RootLayout,alignmentX=0.0,alignmentY=0.0,border=,flags=16777673,maximumSize=,minimumSize=,preferredSize=],rootPaneCheckingEnabled=true]

We'll predict using the closest cluster (or 1-NN if you like). First we read some data directly. We could also try to do evaluation directly from disk, but this would usually be overkill.

In [11]:
val test = loadFMat(mdir+"alls70.fmat.lz4")   // Load a test data file
val testdata = test.copy                      // copy it
testdata(0->10,?) = 0                         // and remove the digit labels
val preds = izeros(1, test.ncols)             // make a container to hold the predictions
1                                             // avoids a monster data cell being printed



1

Next we define a predictor from the just-computed model and the testdata, with the preds matrix to catch the predictions.

In [12]:
val (pp, popts) = KMeans.predictor(mm.model, testdata)



: 



Lets run the predictor

In [None]:
pp.predict 

The <code>preds</code> matrix now contains the numbers of the best-matching cluster centers. We still need to look up the category label for each one. We also need to look up the category for each of the test inputs.

In [None]:
val (vmax, predcat) = maxi2(modelmat(preds,0->10).t)   // Lookup the cat for the matching cluster
val (wmax, truecat) = maxi2(test(0->10,?))             // Reference cats for test items
val inds = predcat.t \ truecat.t                       // Concatenate them into a two-column matrix

From the actual and predicted categories, we can compute a confusion matrix:

In [None]:
val conf = accum(inds, 1f, 10, 10)  // accumulate the (estimate,exact) ids into a matrix
conf ~ conf / sum(conf)             // normalize

Now lets create an image by multiplying each confusion matrix cell by a white square:

In [None]:
Image.show((conf * 250f) ⊗ ones(64,64))

Its useful to isolate the correct classification rate by digit, which is:

In [None]:
val dacc = getdiag(conf).t

We can take the mean of the diagonal accuracies to get an overall accuracy for this model. 

In [None]:
mean(dacc)

Run the experiment again with a larger number of clusters (3000, then 30000). You should reduce the batchSize option to 20000 to avoid memory problems.

Include the training time output by the call to <code>nn.train</code> but not the evaluation time (the evaluation code above is not using the GPU). Rerun and fill out the table below: 

<table>
<tr>
<th>KMeans Clusters</th>
<th>Training time</th>
<th>Avg. gflops</th>
<th>Accuracy</th>
</tr>
<tr>
<td>300</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>3000</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>30000</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</table>