Again, please insert to code to your ApacheCouchDB based Cloudant instance below using the "Insert Code" function of Watson Studio


In [1]:
#your cloudant credentials go here
# @hidden_cell
# @hidden_cell
credentials_1 = {
  'password':"""""",
  'custom_url':'',
  'username':'',
  'url':'https://undefined'
}


Let's create a SparkSession object and put the Cloudant credentials into it

In [2]:
spark = SparkSession\
    .builder\
    .appName("Cloudant Spark SQL Example in Python using temp tables")\
    .config("cloudant.host",credentials_1['custom_url'].split('@')[1])\
    .config("cloudant.username", credentials_1['username'])\
    .config("cloudant.password",credentials_1['password'])\
    .config("jsonstore.rdd.partitions", 1)\
    .getOrCreate()

Now it’s time to have a look at the recorded sensor data. You should see data similar to the one exemplified below….


In [12]:
df=spark.read.load('shake_classification1', "com.cloudant.spark")

df.createOrReplaceTempView("df")
spark.sql("SELECT * from df").show()


+-----+--------+-----+-----+-----+--------------------+--------------------+
|CLASS|SENSORID|    X|    Y|    Z|                 _id|                _rev|
+-----+--------+-----+-----+-----+--------------------+--------------------+
|    0|bronsist| 0.42| 0.42| 0.42|3d6d1bfe7b2d5e286...|1-fd44d2472d5546a...|
|    0|bronsist|-0.13|-0.13|-0.13|3d6d1bfe7b2d5e286...|1-3ba067a4dd6a5ec...|
|    0|bronsist|-0.02|-0.02|-0.02|3d6d1bfe7b2d5e286...|1-a0cf67f39799ea2...|
|    0|bronsist|  0.0|  0.0|  0.0|3d6d1bfe7b2d5e286...|1-bda6b817f4b289f...|
|    0|bronsist|  0.0|  0.0|  0.0|3d6d1bfe7b2d5e286...|1-bda6b817f4b289f...|
|    0|bronsist| 0.01| 0.01| 0.01|3d6d1bfe7b2d5e286...|1-47f59512a7fe600...|
|    0|bronsist| 0.01| 0.01| 0.01|3d6d1bfe7b2d5e286...|1-47f59512a7fe600...|
|    0|bronsist|-0.01|-0.01|-0.01|3d6d1bfe7b2d5e286...|1-4cf105a41e10f81...|
|    0|bronsist|  0.0|  0.0|  0.0|3d6d1bfe7b2d5e286...|1-bda6b817f4b289f...|
|    0|bronsist| 0.01| 0.01| 0.01|3d6d1bfe7b2d5e286...|1-47f59512a7fe600...|

Let’s check if we have balanced classes – this means that we have roughly the same number of examples for each class we want to predict. This is important for classification but also helpful for clustering

In [13]:
spark.sql("SELECT count(class), class from df group by class").show()

+------------+-----+
|count(class)|class|
+------------+-----+
|         834|    0|
|        1064|    1|
+------------+-----+



Let's create a VectorAssembler which consumes columns X, Y and Z and produces a column “features”


In [14]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols=["X","Y","Z"],
                                  outputCol="features")

Please insatiate a clustering algorithm from the SparkML package and assign it to the clust variable. Here we don’t need to take care of the “CLASS” column since we are in unsupervised learning mode – so let’s pretend to not even have the “CLASS” column for now – but it will become very handy later in assessing the clustering performance. PLEASE NOTE – IN REAL-WORLD SCENARIOS THERE IS NO CLASS COLUMN – THEREFORE YOU CAN’T ASSESS CLASSIFICATION PERFORMANCE USING THIS COLUMN 



In [15]:
from pyspark.ml.clustering import KMeans

clust = KMeans().setK(2).setSeed(1)

Let’s train...


In [16]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, clust])
model = pipeline.fit(df)

...and evaluate...

In [17]:
prediction = model.transform(df)
prediction.show()

+-----+--------+-----+-----+-----+--------------------+--------------------+-------------------+----------+
|CLASS|SENSORID|    X|    Y|    Z|                 _id|                _rev|           features|prediction|
+-----+--------+-----+-----+-----+--------------------+--------------------+-------------------+----------+
|    0|bronsist| 0.42| 0.42| 0.42|3d6d1bfe7b2d5e286...|1-fd44d2472d5546a...|   [0.42,0.42,0.42]|         1|
|    0|bronsist|-0.13|-0.13|-0.13|3d6d1bfe7b2d5e286...|1-3ba067a4dd6a5ec...|[-0.13,-0.13,-0.13]|         1|
|    0|bronsist|-0.02|-0.02|-0.02|3d6d1bfe7b2d5e286...|1-a0cf67f39799ea2...|[-0.02,-0.02,-0.02]|         1|
|    0|bronsist|  0.0|  0.0|  0.0|3d6d1bfe7b2d5e286...|1-bda6b817f4b289f...|          (3,[],[])|         1|
|    0|bronsist|  0.0|  0.0|  0.0|3d6d1bfe7b2d5e286...|1-bda6b817f4b289f...|          (3,[],[])|         1|
|    0|bronsist| 0.01| 0.01| 0.01|3d6d1bfe7b2d5e286...|1-47f59512a7fe600...|   [0.01,0.01,0.01]|         1|
|    0|bronsist| 0.01| 0.01|

In [18]:
prediction.createOrReplaceTempView('prediction')
spark.sql('''
select max(correct)/max(total) as accuracy from (

    select sum(correct) as correct, count(correct) as total from (
        select case when class != prediction then 1 else 0 end as correct from prediction 
    ) 
    
    union
    
    select sum(correct) as correct, count(correct) as total from (
        select case when class = prediction then 1 else 0 end as correct from prediction 
    ) 
)
''').rdd.map(lambda row: row.accuracy).collect()[0]

0.6654373024236038

If you reached at least 55% of accuracy you are fine to submit your predictions to the grader. Otherwise please experiment with parameters setting to your clustering algorithm, use a different algorithm or just re-record your data and try to obtain. In case you are stuck. Please note again – in a real-world scenario there is no way in doing this – since there is no class label in your data. Please have a look at this further reading on clustering performance evaluation https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment
