Lab
===

**Repeat the exercise from class, but this time use StarCluster.**

Add the `ipcluster` plugin if you haven't already.

Near the bottom of `.starcluster/config`:
```bash
######################
## Built-in Plugins ##
######################
# The following plugins ship with StarCluster and should work out-of-the-box.
# Uncomment as needed. Don't forget to update your PLUGINS list!
# See http://star.mit.edu/cluster/docs/latest/plugins for plugin details.
# .
# .
# .
[plugin ipcluster]
SETUP_CLASS = starcluster.plugins.ipcluster.IPCluster
# Enable the IPython notebook server (optional)
ENABLE_NOTEBOOK = True
# Set a password for the notebook for increased security
# This is optional but *highly* recommended
NOTEBOOK_PASSWD = a-secret-password
```

Set `CLUSTER_SIZE` to `3` for more memory *(see [aws.amazon.com/ec2/instance-types](http://aws.amazon.com/ec2/instance-types/) for details)*:
```bash
[cluster smallcluster]
# number of ec2 instances to launch
CLUSTER_SIZE = 3
NODE_IMAGE_ID = ami-6b211202
PLUGINS = ipcluster
SPOT_BID = 0.10
```
Also set `SPOT_BID` to `0.10` (or less?) to save \$\$\$ *(see [aws.amazon.com/ec2/purchasing-options/spot-instances](http://aws.amazon.com/ec2/purchasing-options/spot-instances/) for details)*

Start your new cluster:

`$ starcluster start my_cluster`

Copy your credentials to your cluster:

`$ starcluster put my_cluster --user sgeadmin ~/Downloads/credentials.csv /home/sgeadmin/`

This should, as a side effect, add your cluster to the list of known hosts on your machine. 
In my experience, it often doesn't, however. 
Therefore, **you will want to:**

```bash
starcluster sshmaster my_cluster
```
NOTE: Logs me into my master/leader machine

**before you do the following (or `Client` will hang forever):**

In [1]:
from IPython.parallel import Client
from os.path import expanduser

url_file = expanduser('~/.starcluster/ipcluster/SecurityGroup:@sc-my_cluster-us-east-1.json')
sshkey = expanduser('~/.ssh/Amazon_AWS_DataGuy.pem') 
client = Client(url_file, 
                sshkey = sshkey)

# the 'client' object can be used to reference the leader & worker instances 
# that are working on the cloud cluster

In [2]:
# Check to see how many engines you have running:
# One engine is the Leader
# The other two engines are the followers

dview = client.direct_view()
len(client.ids)

3

In [3]:
# enables us to better functionally use each individual engine
all_engines = client[:]

In [4]:
def hostname():
    """Return the name of the host where the function is being called"""
    import socket
    return socket.gethostname()

hostname_apply_result = all_engines.apply(hostname)

In [5]:
# get engine names
hostname_apply_result.get()

['master', 'node002', 'node001']

In [6]:
# organize engine names with key values paris 
hostnames = hostname_apply_result.get_dict()
hostname_apply_result.get_dict()

{0: 'master', 1: 'node002', 2: 'node001'}

In [7]:
# By using the engine name as the key, 
# we can refer to unique engines 
# and assign tasks to a specific engine 
one_engine_by_host = dict((hostname, engine_id) for engine_id, hostname
                      in hostnames.items())

In [8]:
one_engine_by_host

{'master': 0, 'node001': 2, 'node002': 1}

In [56]:
# import needed libraries to all engines being used 
with all_engines.sync_imports():
    import numpy

importing numpy on engine(s)


In [49]:
%%px  --targets=1 

!pip install scikit-learn

Downloading/unpacking scikit-learn
  Running setup.py egg_info for package scikit-learn
    Partial import of sklearn during the build process.
    
Installing collected packages: scikit-learn
  Running setup.py install for scikit-learn
    Partial import of sklearn during the build process.
    blas_opt_info:
    blas_mkl_info:
      libraries mkl,vml,guide not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    openblas_info:
      libraries openblas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_info:
      libraries f77blas,cblas,atlas not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
        Atlas (http://math

In [10]:
# insert cross validation data files into starcluster instance 

! starcluster put my_cluster --user sgeadmin digits_cv_00* /mnt/sgeadmin/

StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

digits_cv_000.pkl 100% |||||||||||||||||||||||||||||| Time: 00:00:00 577.39 K/s
digits_cv_000.pkl_01.npy 100% ||||||||||||||||||||||| Time: 00:00:01 639.64 K/s
digits_cv_000.pkl_02.npy 100% ||||||||||||||||||||||| Time: 00:00:00  25.84 M/s
digits_cv_000.pkl_03.npy 100% ||||||||||||||||||||||| Time: 00:00:00  28.14 M/s
digits_cv_000.pkl_04.npy 100% ||||||||||||||||||||||| Time: 00:00:00   4.09 M/s
digits_cv_001.pkl 100% |||||||||||||||||||||||||||||| Time: 00:00:00 572.60 K/s
digits_cv_001.pkl_01.npy 100% ||||||||||||||||||||||| Time: 00:00:00 875.33 K/s
digits_cv_001.pkl_02.npy 100% ||||||||||||||||||||||| Time: 00:00:00  14.99 M/s
digits_cv_001.pkl_03.npy 100% ||||||||||||||||||||||| Time: 00:00:00  27.39 M/s
digits_cv_001.pkl_04.npy 100% ||||||||||||||||||||||| Time: 00:00:00   7.16 M/s
digits_cv_002.pkl 100% |||||||||||||||||||

We need to copy the files from the ephemeral drive (*i.e* `/mnt/`) on the host node to each of the other nodes. 
*e.g.*

In [11]:
%%px -t0
%%bash
scp /mnt/sgeadmin/digits_cv_00* node001:/mnt/sgeadmin/
scp /mnt/sgeadmin/digits_cv_00* node002:/mnt/sgeadmin/

```
You will also want to create a new list of filenames:

```

## Memmaping CV Splits for Multiprocess Dataset Sharing

We can leverage the previous tools to build a utility function that $\textbf{extracts Cross Validation splits ahead of time}$ to persist them on the hard drive in a format suitable for memmaping by IPython engine processes.

In [12]:
from sklearn.externals import joblib
from sklearn.cross_validation import ShuffleSplit
import os

def persist_cv_splits(X, y, n_cv_iter=5, name='data',
    suffix="_cv_%03d.pkl", test_size=0.25, random_state=None):
    """Materialize randomized train test splits of a dataset."""

    cv = ShuffleSplit(X.shape[0], n_iter=n_cv_iter,
        test_size=test_size, random_state=random_state)
    cv_split_filenames = []
    
    for i, (train, test) in enumerate(cv):
        cv_fold = (X[train], y[train], X[test], y[test])
        cv_split_filename = name + suffix % i
        cv_split_filename = os.path.abspath(cv_split_filename)
        joblib.dump(cv_fold, cv_split_filename)
        cv_split_filenames.append(cv_split_filename)
    
    return cv_split_filenames

In [13]:
from sklearn.datasets import load_digits

digits = load_digits()
digits_split_filenames = persist_cv_splits(digits.data, digits.target,
    name='digits', random_state=42)

In [31]:
digits_split_filenames

['/Users/Alexander/DSCI6007-student/week1/1.4/digits_cv_000.pkl',
 '/Users/Alexander/DSCI6007-student/week1/1.4/digits_cv_001.pkl',
 '/Users/Alexander/DSCI6007-student/week1/1.4/digits_cv_002.pkl',
 '/Users/Alexander/DSCI6007-student/week1/1.4/digits_cv_003.pkl',
 '/Users/Alexander/DSCI6007-student/week1/1.4/digits_cv_004.pkl']

In [14]:
remote_filenames = ['/mnt/sgeadmin/' + filename.split('/')[-1] for filename in digits_split_filenames]

In [15]:
remote_filenames

['/mnt/sgeadmin/digits_cv_000.pkl',
 '/mnt/sgeadmin/digits_cv_001.pkl',
 '/mnt/sgeadmin/digits_cv_002.pkl',
 '/mnt/sgeadmin/digits_cv_003.pkl',
 '/mnt/sgeadmin/digits_cv_004.pkl']

In [16]:
ls -lh digits*

-rw-r--r--  1 Alexander  staff   280B Aug 31 19:40 digits_cv_000.pkl
-rw-r--r--  1 Alexander  staff   674K Aug 31 19:40 digits_cv_000.pkl_01.npy
-rw-r--r--  1 Alexander  staff    11K Aug 31 19:40 digits_cv_000.pkl_02.npy
-rw-r--r--  1 Alexander  staff   225K Aug 31 19:40 digits_cv_000.pkl_03.npy
-rw-r--r--  1 Alexander  staff   3.6K Aug 31 19:40 digits_cv_000.pkl_04.npy
-rw-r--r--  1 Alexander  staff   280B Aug 31 19:40 digits_cv_001.pkl
-rw-r--r--  1 Alexander  staff   674K Aug 31 19:40 digits_cv_001.pkl_01.npy
-rw-r--r--  1 Alexander  staff    11K Aug 31 19:40 digits_cv_001.pkl_02.npy
-rw-r--r--  1 Alexander  staff   225K Aug 31 19:40 digits_cv_001.pkl_03.npy
-rw-r--r--  1 Alexander  staff   3.6K Aug 31 19:40 digits_cv_001.pkl_04.npy
-rw-r--r--  1 Alexander  staff   280B Aug 31 19:40 digits_cv_002.pkl
-rw-r--r--  1 Alexander  staff   674K Aug 31 19:40 digits_cv_002.pkl_01.npy
-rw-r--r--  1 Alexander  staff    11K Aug 31 19:40 digits_cv_002.pkl_02.npy
-rw-r--r--  1 Alexan

## Parallel Model Selection and Grid Search

In [26]:
import numpy as np
from pprint import pprint

svc_params = {
    'C': np.logspace(-1, 2, 4),
    'gamma': np.logspace(-4, 0, 5),
}
pprint (svc_params)

{'C': array([   0.1,    1. ,   10. ,  100. ]),
 'gamma': array([  1.00000000e-04,   1.00000000e-03,   1.00000000e-02,
         1.00000000e-01,   1.00000000e+00])}


`GridSearchCV` internally uses the following `ParameterGrid` utility iterator class to build the possible combinations of parameters:

In [27]:
from sklearn.grid_search import ParameterGrid

list(ParameterGrid(svc_params))

[{'C': 0.10000000000000001, 'gamma': 0.0001},
 {'C': 0.10000000000000001, 'gamma': 0.001},
 {'C': 0.10000000000000001, 'gamma': 0.01},
 {'C': 0.10000000000000001, 'gamma': 0.10000000000000001},
 {'C': 0.10000000000000001, 'gamma': 1.0},
 {'C': 1.0, 'gamma': 0.0001},
 {'C': 1.0, 'gamma': 0.001},
 {'C': 1.0, 'gamma': 0.01},
 {'C': 1.0, 'gamma': 0.10000000000000001},
 {'C': 1.0, 'gamma': 1.0},
 {'C': 10.0, 'gamma': 0.0001},
 {'C': 10.0, 'gamma': 0.001},
 {'C': 10.0, 'gamma': 0.01},
 {'C': 10.0, 'gamma': 0.10000000000000001},
 {'C': 10.0, 'gamma': 1.0},
 {'C': 100.0, 'gamma': 0.0001},
 {'C': 100.0, 'gamma': 0.001},
 {'C': 100.0, 'gamma': 0.01},
 {'C': 100.0, 'gamma': 0.10000000000000001},
 {'C': 100.0, 'gamma': 1.0}]

Let's write a function to load the data from a CV split file and compute the validation score for a given parameter set and model:

In [29]:
def compute_evaluation(cv_split_filename, model, params):
    """Function executed by a worker to evaluate a model on a CV split"""
    # All module imports should be executed in the worker namespace
    from sklearn.externals import joblib

    X_train, y_train, X_validation, y_validation = joblib.load(
        cv_split_filename, mmap_mode='c')
    
    model.set_params(**params)
    model.fit(X_train, y_train)
    validation_score = model.score(X_validation, y_validation)
    return validation_score

In [30]:
def grid_search(lb_view, model, cv_split_filenames, param_grid):
    """Launch all grid search evaluation tasks."""
    all_tasks = []
    all_parameters = list(ParameterGrid(param_grid))
    
    for i, params in enumerate(all_parameters):
        task_for_params = []
        
        for j, cv_split_filename in enumerate(cv_split_filenames):    
            t = lb_view.apply(
                compute_evaluation, cv_split_filename, model, params)
            task_for_params.append(t) 
        
        all_tasks.append(task_for_params)
        
    return all_parameters, all_tasks

In [39]:
from sklearn.svm import SVC

lb_view = client.load_balanced_view()
model = SVC()
svc_params = {
    'C': np.logspace(-1, 2, 4),
    'gamma': np.logspace(-4, 0, 5),
}

all_parameters, all_tasks = grid_search(
   lb_view, model, digits_split_filenames, svc_params)

The `grid_search` function is using the asynchronous API of the `LoadBalancedView`, we can hence monitor the progress:

In [40]:
def progress(tasks):
    return np.mean([task.ready() for task_group in tasks
                                 for task in task_group])

In [42]:
print("Tasks completed: {0}%".format(100 * progress(all_tasks)))

Tasks completed: 100.0%


Even better, we can introspect the completed task to find the best parameters set so far:

In [43]:
def find_bests(all_parameters, all_tasks, n_top=5):
    """Compute the mean score of the completed tasks"""
    mean_scores = []
    
    for param, task_group in zip(all_parameters, all_tasks):
        scores = [t.get() for t in task_group if t.ready()]
        if len(scores) == 0:
            continue
        mean_scores.append((np.mean(scores), param))
                   
    return sorted(mean_scores, reverse=True)[:n_top]

In [50]:
from pprint import pprint

print("Tasks completed: {0}%".format(100 * progress(all_tasks)))
pprint(find_bests(all_parameters, all_tasks))

Tasks completed: 100.0%


RemoteError: ImportError(No module named sklearn.svm.classes)

## A More Complete Parallel Model Selection and Assessment Example

It is often wasteful to search all the possible combinations of parameters as done previously, especially if the number of parameters is large (e.g. more than 3).

To speed up the discovery of good parameters combinations, it is often faster to randomized the search order and allocate a budget of evaluations, e.g. 10 or 100 combinations.

See [this JMLR paper by James Bergstra](http://jmlr.csail.mit.edu/papers/v13/bergstra12a.html) for an empirical analysis of the problem. The interested reader should also have a look at [hyperopt](https://github.com/jaberg/hyperopt) that further refines this parameter search method using meta-optimizers.

Randomized Parameter Search has just been implemented in the master branch of scikit-learn be part of the 0.14 release.

## A More Complete Parallel Model Selection and Assessment Example

In [51]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray();

<matplotlib.figure.Figure at 0x107a81790>

In [52]:
lb_view = client.load_balanced_view()
model = SVC()

In [57]:
import sys, imp
from collections import OrderedDict
sys.path.append('..')
import model_selection, mmap_utils
imp.reload(model_selection), imp.reload(mmap_utils)

lb_view.abort()

svc_params = OrderedDict([
    ('gamma', np.logspace(-4, 0, 5)),
    ('C', np.logspace(-1, 2, 4)),
])

search = model_selection.RandomizedGridSeach(lb_view)
search.launch_for_splits(model, svc_params, digits_split_filenames)

CompositeError: one or more exceptions from call to method: load_in_memory
[1:apply]: ImportError: No module named sklearn.externals
[2:apply]: ImportError: No module named sklearn.externals
[0:apply]: ImportError: No module named sklearn.externals

#I cant go any further without first import (or installing) sklearn!!!
# but starcluster doesn't allow me to do either!?!?!