## Apache Spark Machine Learning using Dataframes in Google Colab

1.	Setup an Apache Spark instance in Google Colab

In [1]:
# install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.2-bin-hadoop2.7.tgz

# set your spark folder to your system path environment. 
import os
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

# install findspark using pip
!pip install -q findspark
import findspark
findspark.init()

2.	Create a Spark session

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
          .master("local")\
          .appName("Colab")\
          .config('spark.ui.port', '4050')\
          .getOrCreate()

spark

3.	Download the Iris dataset and another dataset of your choosing

In [3]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O sample_data/iris.data

--2022-03-20 12:05:03--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘sample_data/iris.data’


2022-03-20 12:05:03 (89.1 MB/s) - ‘sample_data/iris.data’ saved [4551/4551]



4.	Import the Iris dataset into a dataframe and use df.show() to display.

In [4]:
df = spark.read.csv("sample_data/iris.data", inferSchema=True)\
.toDF("SepalLength", "SepalWidth", "PetalLength", "PetalWidth", "Class")

In [5]:
df.show()

+-----------+----------+-----------+----------+-----------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      Class|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
|        5.0|       3.4|        1.5|       0.2|Iris-setosa|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
|        4.8|       3.0|        1.4|       0.1|Iris-setosa|
|        4.3|       3.0|        1.1|    

5.	Spark ML can only deal with one features column - so we need to vectorise the multiple columns into one.

In [6]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(\
                                   inputCols=["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"],\
                                   outputCol="features")
df_temp = vector_assembler.transform(df)
df_temp.show(3)

+-----------+----------+-----------+----------+-----------+-----------------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|      Class|         features|
+-----------+----------+-----------+----------+-----------+-----------------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-----------+----------+-----------+----------+-----------+-----------------+
only showing top 3 rows



Drop the original feature columns and just display Class & features.

In [7]:
df = df_temp.drop("SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
df.show(3)

+-----------+-----------------+
|      Class|         features|
+-----------+-----------------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|
|Iris-setosa|[4.9,3.0,1.4,0.2]|
|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-----------+-----------------+
only showing top 3 rows



6.	The final data preparation step is to index the Class column - to use numeric rather than text values.

In [8]:
from pyspark.ml.feature import StringIndexer
l_indexer = StringIndexer(inputCol="Class", outputCol="ClassIndex")
df = l_indexer.fit(df).transform(df)

df.show(10)

+-----------+-----------------+----------+
|      Class|         features|ClassIndex|
+-----------+-----------------+----------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.0,1.4,0.2]|       0.0|
|Iris-setosa|[4.7,3.2,1.3,0.2]|       0.0|
|Iris-setosa|[4.6,3.1,1.5,0.2]|       0.0|
|Iris-setosa|[5.0,3.6,1.4,0.2]|       0.0|
|Iris-setosa|[5.4,3.9,1.7,0.4]|       0.0|
|Iris-setosa|[4.6,3.4,1.4,0.3]|       0.0|
|Iris-setosa|[5.0,3.4,1.5,0.2]|       0.0|
|Iris-setosa|[4.4,2.9,1.4,0.2]|       0.0|
|Iris-setosa|[4.9,3.1,1.5,0.1]|       0.0|
+-----------+-----------------+----------+
only showing top 10 rows



7.	Split your data into training and test datasets.

In [9]:
(trainingData, testData) = df.randomSplit([0.7, 0.3])

8.	**Decision Tree Classifier** \
Specify the DecisionTreeClassifier and train the model on your training dataset.


In [10]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

dt = DecisionTreeClassifier(labelCol="ClassIndex", featuresCol="features")
model = dt.fit(trainingData)

9.	Test your model with your test dataset.

In [11]:
predictions = model.transform(testData)

predictions.select("prediction", "ClassIndex").show(15)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 15 rows



10.	Run an evaluator function to show the accuracy of your model.

In [12]:
evaluator = MulticlassClassificationEvaluator(\
                                              labelCol="ClassIndex", predictionCol="prediction",\
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test set accuracy = " + str(accuracy))

Test Error = 0.0444444
Test set accuracy = 0.9555555555555556


11.	**Random Forest Classifier** \
Specify the RandomForestClassifier, train the model on your training dataset, predict using your test dataset, and run an evaluator to test accuracy.


In [13]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol="ClassIndex",\
                            featuresCol="features", numTrees=10)
model = rf.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "ClassIndex").show(10)

+----------+----------+
|prediction|ClassIndex|
+----------+----------+
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
|       0.0|       0.0|
+----------+----------+
only showing top 10 rows



In [14]:
evaluator = MulticlassClassificationEvaluator(labelCol="ClassIndex",\
                                              predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test set accuracy = " + str(accuracy))

Test Error = 0.0444444
Test set accuracy = 0.9555555555555556


12.	**Naive Bayes Classifier** \
Specify the NaiveBayes classifier, train the model on your training dataset, predict using your test dataset, and run an evaluator to test accuracy.


In [15]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(labelCol="ClassIndex",
                featuresCol="features",
                smoothing=1.0,
                modelType="multinomial")
model = nb.fit(trainingData)

In [16]:
predictions = model.transform(testData)
predictions.select("Class", "ClassIndex", "probability", "prediction").show(10)

+-----------+----------+--------------------+----------+
|      Class|ClassIndex|         probability|prediction|
+-----------+----------+--------------------+----------+
|Iris-setosa|       0.0|[0.63661112686987...|       0.0|
|Iris-setosa|       0.0|[0.68041878851573...|       0.0|
|Iris-setosa|       0.0|[0.70492462645364...|       0.0|
|Iris-setosa|       0.0|[0.63729810318337...|       0.0|
|Iris-setosa|       0.0|[0.64764630563287...|       0.0|
|Iris-setosa|       0.0|[0.69848104148986...|       0.0|
|Iris-setosa|       0.0|[0.71056281280751...|       0.0|
|Iris-setosa|       0.0|[0.58320889629176...|       0.0|
|Iris-setosa|       0.0|[0.74812136684639...|       0.0|
|Iris-setosa|       0.0|[0.71469612086946...|       0.0|
+-----------+----------+--------------------+----------+
only showing top 10 rows



In [17]:
evaluator = MulticlassClassificationEvaluator(labelCol="ClassIndex",
                                              predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Test set accuracy = " + str(accuracy))

Test Error = 0.0444444
Test set accuracy = 0.9555555555555556
