<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_10_Spark_MLib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 2.2.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz  
!tar xf spark-3.3.1-bin-hadoop3.tgz  
!rm spark-3.3.1-bin-hadoop3.tgz    
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2022-2023/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/spark-3.3.1-bin-hadoop3 /content/spark
!export SPARK_HOME=/content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

### Start a SparkSession
This will start a local Spark session.

In [2]:
!python -V

#import findspark
#findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

Python 3.9.2


/usr/local/lib/python3.9/dist-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/26 15:07:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/26 15:07:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/12/26 15:07:12 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
PySpark version 3.3.1


                                                                                

[2, 3]

In [3]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


---


# 10 - Spark MLlib

Library of ML parallel algorithms for massive data

-   Machine learning classic algorithms: classification, regression, clustering, collaborative filtering
-   Other algorithms: feature extraction, transformation, dimensionality reduction, and selection
-   Tools to build, evaluate and adjust ML pipelines
-   Other tools: linear algebra, statistics, data processing, etc.


Two packages:

-   **spark.mllib:** Original RDD-based API
-   **spark.ml:** High-level API, based on DataFrames

Documentation and APIS:

- ML
    - Guia: http://spark.apache.org/docs/latest/ml-guide.html
    - API Python: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html
    - API Scala: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.package
- MLlib
    - Guia: http://spark.apache.org/docs/latest/mllib-guide.html
    - API Python: https://spark.apache.org/docs/latest/api/python/reference/pyspark.mllib.html
    - API Scala: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.package



## Example

Use the [KMeans](http://spark.apache.org/docs/latest/mllib-clustering.html#k-means) clustering algorithm to group data from vectors spread over two clusters.


In [4]:
from pyspark.ml.clustering import KMeans, KMeansModel
from pyspark.ml.linalg import Vectors

#  Define an array of 4 sparse vectors, 3 elements each 
sparseData = [
     Vectors.sparse(3, {1: 1.2}),
     Vectors.sparse(3, {1: 1.1}),
     Vectors.sparse(3, {0: 0.9, 2: 1.0}),
     Vectors.sparse(3, {0: 1.0, 2: 1.1})
 ]
print(sparseData)
for i in range(4):
    print(sparseData[i].toArray())

[SparseVector(3, {1: 1.2}), SparseVector(3, {1: 1.1}), SparseVector(3, {0: 0.9, 2: 1.0}), SparseVector(3, {0: 1.0, 2: 1.1})]
[0.  1.2 0. ]
[0.  1.1 0. ]
[0.9 0.  1. ]
[1.  0.  1.1]


In [5]:
# Turn the array into a DataFrame
dfSD = sc.parallelize([
  (1, sparseData[0]),
  (2, sparseData[1]),
  (3, sparseData[2]),
  (4, sparseData[3])
]).toDF(["row", "features"])

dfSD.show()



+---+-------------------+
|row|           features|
+---+-------------------+
|  1|      (3,[1],[1.2])|
|  2|      (3,[1],[1.1])|
|  3|(3,[0,2],[0.9,1.0])|
|  4|(3,[0,2],[1.0,1.1])|
+---+-------------------+



                                                                                

In [6]:
# Create a KMeans model without training, with 2 clusters
# For more information, see https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#module-pyspark.ml.clustering
kmeans = KMeans()\
    .setInitMode("k-means||")\
    .setFeaturesCol("features")\
    .setPredictionCol("prediction")\
    .setK(2)\
    .setSeed(1)

In [7]:
!pyspark --version

/usr/local/lib/python3.9/dist-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
/usr/local/lib/python3.9/dist-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_342
Branch HEAD
Compiled by user yumwang on 2022-10-15T09:47:01Z
Revision fbbcf9434ac070dd4ced4fb9efe32899c6db12a9
Url https://github.com/apache/spark
Type --help for more information.


In [8]:
kmeans

KMeans_5b0235cd5e70

In [10]:
# Adjust the model to the previous DataFrame and show the cluster centres
kmModel = kmeans.fit(dfSD)
print("Clusters centres: {0}".format(kmModel.clusterCenters()))

                                                                                

Clusters centres: [array([0.  , 1.15, 0.  ]), array([0.95, 0.  , 1.05])]


In [11]:
# Verify that the model clusters the data from the previous array
kmModel.transform(dfSD).show()
# Calculate the cost as the addition of the squared distance between the input points
# and the centres of the corresponding clusters
print("Cost = {0}".format(
    kmModel.summary.trainingCost))

+---+-------------------+----------+
|row|           features|prediction|
+---+-------------------+----------+
|  1|      (3,[1],[1.2])|         0|
|  2|      (3,[1],[1.1])|         0|
|  3|(3,[0,2],[0.9,1.0])|         1|
|  4|(3,[0,2],[1.0,1.1])|         1|
+---+-------------------+----------+

Cost = 0.014999999999999236


In [12]:
# Test the model with other points
dfTest = sc.parallelize([
  (1, Vectors.sparse(3, {0: 0.9, 1:1.0, 2: 1.0})),
  (2, Vectors.sparse(3, {1: 1.5, 2: 0.3}))
]).toDF(["row", "features"])

kmModel.transform(dfTest).show(truncate=False)

# Calculate the cost as the addition of the squared distance between the input points
# and the centres of the corresponding clusters
print("Cost = {0}".format(
    kmModel.summary.trainingCost))

+---+-------------------------+----------+
|row|features                 |prediction|
+---+-------------------------+----------+
|1  |(3,[0,1,2],[0.9,1.0,1.0])|1         |
|2  |(3,[1,2],[1.5,0.3])      |0         |
+---+-------------------------+----------+

Cost = 0.014999999999999236


In [13]:
# Save the model in a directory
kmModel.save("/tmp/kmModel")

                                                                                

In [14]:
# Reload the model
sameModel = KMeansModel.load("/tmp/kmModel")

sameModel.transform(dfTest).show(truncate=False)
# Calculate the cost as the addition of the squared distance between the input points
# and the centres of the corresponding clusters
#print("Cost = {0}".format(sameModel.summary.trainingCost))

                                                                                

+---+-------------------------+----------+
|row|features                 |prediction|
+---+-------------------------+----------+
|1  |(3,[0,1,2],[0.9,1.0,1.0])|1         |
|2  |(3,[1,2],[1.5,0.3])      |0         |
+---+-------------------------+----------+



In [15]:
!rm -rf /tmp/kmModel

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 58354)
Traceback (most recent call last):
  File "/usr/lib/python3.9/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.9/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.9/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.9/socketserver.py", line 720, in __init__
    self.handle()
  File "/usr/local/lib/python3.9/dist-packages/pyspark/accumulators.py", line 281, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.9/dist-packages/pyspark/accumulators.py", line 253, in poll
    if func():
  File "/usr/local/lib/python3.9/dist-packages/pyspark/accumulators.py", line 257, in accum_updates
    num_updates = read_int(self.rfile)
  File "/usr/lo