In [None]:
REQUIREMENTS 
Classify sensor data in multiple categories: brushing teeth, climbing stairs, etc.

REFERENCE 
https://spark.apache.org/docs/latest/ml-classification-regression.html 
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/classification.html

This notebook is designed to run in a IBM Watson Studio default runtime (NOT the Watson Studio Apache Spark Runtime as the default runtime with 1 vCPU is free of charge). Therefore, we install Apache Spark in local mode for test purposes only. Don't use it in production.

If running outside Watson Studio, this should work as well. In case you are running in an Apache Spark context outside Watson Studio, remove the Apache Spark setup in the first notebook cells.

In [1]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown('# <span style="color:red">'+string+'</span>'))


if ('sc' in locals() or 'sc' in globals()):
    printmd('<<<<<!!!!! It seems that you are running in a IBM Watson Studio Apache Spark Notebook. Please run it in an IBM Watson Studio Default Runtime (without Apache Spark) !!!!!>>>>>')


In [2]:
!pip install pyspark==2.4.5

Collecting pyspark==2.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/9a/5a/271c416c1c2185b6cb0151b29a91fff6fcaed80173c8584ff6d20e46b465/pyspark-2.4.5.tar.gz (217.8MB)
[K     |████████████████████████████████| 217.8MB 144kB/s  eta 0:00:01   |███▏                            | 21.5MB 7.5MB/s eta 0:00:27| 23.9MB 7.5MB/s eta 0:00:26     |███████▉                        | 53.3MB 38.0MB/s eta 0:00:05     |████████████▎                   | 83.3MB 36.5MB/s eta 0:00:04��█████████████████████████▊  | 202.1MB 40.1MB/s eta 0:00:01
[?25hCollecting py4j==0.10.7 (from pyspark==2.4.5)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 38.9MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheel

In [3]:
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    printmd('<<<<<!!!!! Please restart your kernel after installing Apache Spark !!!!!>>>>>')

In [4]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [5]:
!wget https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet

--2020-10-02 06:37:33--  https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/IBM/skillsnetwork/raw/master/coursera_ml/a2.parquet [following]
--2020-10-02 06:37:33--  https://github.com/IBM/skillsnetwork/raw/master/coursera_ml/a2.parquet
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/IBM/skillsnetwork/master/coursera_ml/a2.parquet [following]
--2020-10-02 06:37:34--  https://raw.githubusercontent.com/IBM/skillsnetwork/master/coursera_ml/a2.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
L

Now it’s time to have a look at the recorded sensor data. You should see data similar to the one exemplified below….


In [6]:
df=spark.read.load('a2.parquet')

df.createOrReplaceTempView("df")
spark.sql("SELECT * from df").show()

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test = splits[1]


+-----+-----------+-------------------+-------------------+-------------------+
|CLASS|   SENSORID|                  X|                  Y|                  Z|
+-----+-----------+-------------------+-------------------+-------------------+
|    0|         26| 380.66434005495194| -139.3470983812975|-247.93697521077704|
|    0|         29| 104.74324299209692| -32.27421440203938|-25.105013725863852|
|    0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|
|    0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|
|    0|17179869241|-190.32584900181487|  234.7849657520335|-206.34483804019288|
|    0|25769803830| 178.62396382387422| -47.07529438881511|  84.38310769821979|
|    0|25769803831|  85.03128805189493|-4.3024316644854546|-1.1841857567516714|
|    0|34359738411| 26.786262674736566| -46.33193951911338| 20.880756008396055|
|    0| 8589934592|-16.203752396859194| 51.080957032176954| -96.80526656416971|
|    0|25769803852|   47.2048142440404| 

Please create a VectorAssembler which consumes columns X, Y and Z and produces a column “features”


In [9]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols=["X","Y","Z"], outputCol="features")

Please instantiate a classifier from the SparkML package and assign it to the classifier variable. Make sure to either
1.	Rename the “CLASS” column to “label” or
2.	Specify the label-column correctly to be “CLASS”


In [17]:
# REMEMBER the REQUIREMENTS: 
# Classify sensor data in multiple categories: brushing teeth, climbing stairs, etc. 
# Note: Since most of the Machine Learning algorithms work with numerical data, we map the target categories: 
# brushing teeth, climbing stairs, etc. to CLASS 0, 1, etc.
# 
# Try solving the problem using the following Spark ML classifiers and compare the results (the prediction accuracy):
# LogisticRegression 
# LinearSVC or Support Vector Machine (SVM)
# DecisionTreeClassifier 
# RandomForestClassifier 
# GBTClassifier or Gradient-Boosted Trees 
# 
# REMEMBER: DECISION TREE vs RANDOM FOREST vs GRADIENT BOOSTING MACHINES 
# A decision tree is a simple, decision making-diagram. 
# Random forests are a large number of trees, combined (using averages or "majority rules") at the end of the process. 
# Gradient boosting machines also combine decision trees, but start the combining process at the beginning, instead of at the end.
# 
# Reference links with code examples: 
# https://spark.apache.org/docs/latest/ml-classification-regression.html 
# https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/classification.html
# 
# NOTE:
# NaiveBayes CANNOT be used in this case since Naive Bayes requires nonnegative feature values
# XGBoost is not included in Spark ML, but it is an implementation of Gradient Boosted Decision Trees designed for improved speed and performance.

# SOULTION 1
from pyspark.ml.classification import LogisticRegression
# Make sure to also specify featuresCol and labelCol
# classifier = ### YOUR CODE HERE ### 

# SOULTION 2: LinearSVC or Support Vector Machine (SVM)
# from pyspark.ml.classification import LinearSVC 
# Make sure to also specify featuresCol and labelCol
# classifier = ### YOUR CODE HERE ### 

# SOULTION 3
# from pyspark.ml.classification import DecisionTreeClassifier
# Make sure to also specify featuresCol and labelCol
# classifier = ### YOUR CODE HERE ### 

# SOULTION 4
# from pyspark.ml.classification import RandomForestClassifier
# Make sure to also specify featuresCol and labelCol
# classifier = ### YOUR CODE HERE ### 

# SOULTION 5
# from pyspark.ml.classification import GBTClassifier
# Make sure to also specify featuresCol and labelCol
# classifier = ### YOUR CODE HERE ### 


Let’s train and evaluate…


In [18]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, classifier])

In [19]:
model = pipeline.fit(df_train)

In [20]:
prediction = model.transform(df_train)

In [21]:
prediction.show()

+-----+--------+-------------------+-------------------+-------------------+--------------------+--------------------+--------------------+----------+
|CLASS|SENSORID|                  X|                  Y|                  Z|            features|       rawPrediction|         probability|prediction|
+-----+--------+-------------------+-------------------+-------------------+--------------------+--------------------+--------------------+----------+
|    0|       1|-122.39060867226797|  46.13548501249578|-45.727305937345506|[-122.39060867226...|[1.32875775550098...|[0.93447269679847...|       0.0|
|    0|       1| 15.798748332829806| -86.21159407546875|   85.2514617870864|[15.7987483328298...|[1.32590267922033...|[0.93412217565278...|       0.0|
|    0|       2|-60.287010425683505| 18.442246406638773|  88.30025324517945|[-60.287010425683...|[1.32590267922033...|[0.93412217565278...|       0.0|
|    0|       3| 122.79284074820067| -88.19527091272191|-185.40334606851977|[122.792840748200.

In [22]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("CLASS")
# prediction accuracy on training data:    
binEval.evaluate(prediction) 

0.9975165562913907

In [23]:
model = pipeline.fit(df_test)
prediction = model.transform(df_test)
# prediction accuracy on test data:
binEval.evaluate(prediction)

0.9984025559105432

If you are happy with the result (I’m happy with > 0.55) share your solution with the others.

This exercise was inspired from the second assignment for the Coursera course: "Advanced Machine Learning and Signal Processing"
Reference: https://www.coursera.org/learn/advanced-machine-learning-signal-processing