# Hadoop Livy test - run a remote Spark MLlib job

Load the sparkmagic notebook extension.

In [None]:
%load_ext sparkmagic.magics

*Connect to the Spark cluster running on Hadoop.*

*Replace YOUR_HADOOP_HOSTNAME with the hadoop host that is running Livy.*

*You may need to update the default port.  Common defaults are 8998 and 8999.* 

In [None]:
%spark add -s test -l python -u http://YOUR_HADOOP_HOSTNAME:8999 -a u -k

Execute the Spark KMeans Clustering sample shown in https://spark.apache.org/docs/latest/ml-clustering.html

First, load the sample_kmeans_data.txt into HDFS

In [None]:
%%spark
# Declare a utility function to run commands on the Hadoop cluster
from subprocess import Popen, PIPE, STDOUT
def run_command(command, sleepAfter=None):
    p = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True)
    output = p.stdout.read()
    if (output):
        print(output)
    if (sleepAfter != None):
        time.sleep(sleepAfter)

# download the sample_kmeans_data.txt raw data file 
run_command("curl -X GET https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_kmeans_data.txt --output sample_kmeans_data.txt")

# put the data in hdfs 
run_command("hdfs dfs -put -f sample_kmeans_data.txt /tmp")

Execute the clustering model.  

Note: Spark MLlib requires the python numpy package on the Hadoop worker nodes. You may need to load the package using commands:

`yum install python-pip`

`pip install numpy`

In [None]:
%%spark
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Loads data.
dataset = spark.read.format("libsvm").load("/tmp/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

In [None]:
%spark cleanup