# **K-Means Using SPARk**

# **1.Setting up Spark on Google Colab Environment**

**Lets setup spark by running the following codes**

**Installing pyspark**

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 47 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 57.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=8385a8856ab5f414aa671effa7384bb5661505a9fdb5e0115f6493387db56c08
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


**Installing pydrive as it is a high-level Python wrapper for the Google Drive API. 
It allows us to easily upload, download, and delete files in our Google Drive from a Python script.**

In [None]:
!pip install -U -q PyDrive

**Spark is written in the Scala programming language and requires the Java Virtual Machine to run. Therefore,we have to download Java inorder to proceed further.**

In [None]:
!apt install openjdk-8-jdk-headless -qq

The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei
  fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
  openjdk-8-jdk-headless openjdk-8-jre-headless
0 upgraded, 2 newly installed, 0 to remove and 22 not upgraded.
Need to get 36.6 MB of archives.
After this operation, 143 MB of additional disk space will be used.
Selecting previously unselected package openjdk-8-jre-headless:amd64.
(Reading database ... 123941 files and directories currently installed.)
Preparing to unpack .../openjdk-8-jre-headless_8u342-b07-0ubuntu1~18.04_amd64.deb ...
Unpacking openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~18.04) ...
Selecting previously unselected package openjdk-8-

**Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. The below code will set up the environment path for JAVA**

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

**Now, lets import the necesary libraries which we usually use**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pyplot
%matplotlib  inline

**Now we have to import pyspark and along with some required modules in order to proceed** 

In [None]:
import pyspark
from pyspark.sql import *   #Spark module for structured data processing in Python
from pyspark.sql.types import *  #to create DataFrame with a specific type
from pyspark.sql.functions import * #importing module function inorder to define some functions
from pyspark import SparkContext, SparkConf

**We have imported SparkContext and SparkConf. Lets understand the usecare of these modules:**

**SparkContext** represents the connection to a Spark cluster, and can be used to create RDD(resilient distributed datasets ) and broadcast variables on that cluster.
Sparkcontext is the entry point for spark environment.

**Sparkconf** is the class which gives us the various option to provide configuration parameters


**Now, its time to initialize SparkContex and SparkConf**

In [None]:
#To create a SparkContext, first we need to build a SparkConf object that contains information about our application
#let’s configure Spark ui first
#creating the session
conf = SparkConf().set("spark.ui.port","4050")

In [None]:
Can easily check the current version and get the link of the web interface.
In the Spark UI, I can monitor the progress of my job and debug the performance bottlenecks.

In [None]:
#creating the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate() 
#getOrCreate is used to get the value of a parameter in the user-supplied parameter map or its default value.

**After getting done with the configuration settings and initiating a SparkContext object,which Spark does by default. lets check it now so that we will get full details about**

In [None]:
spark

By running the below code on the Google colab hosted runtime, the cell below will create a ngrok tunnel which will allows us to still check the Spark UI.

In [None]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

--2022-10-20 05:21:39--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 54.237.133.81, 52.202.168.65, 18.205.222.128, ...
Connecting to bin.equinox.io (bin.equinox.io)|54.237.133.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13832437 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2022-10-20 05:21:40 (19.6 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13832437/13832437]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   
Traceback (most recent call last):
  File "<string>", line 1, in <module>
IndexError: list index out of range


# **2.Data Preprocessing**

**Now, lets import the required dataset and proceed with data preprocessing**

In [None]:
from sklearn.datasets import load_breast_cancer #importing breast cancer dataset and loading it
breast_cancer = load_breast_cancer()

**For our convenience, given the dataset is small, we first construct a Pandas dataframe, tune the schema and then convert it into a Spark Dataframe**

In [None]:
pd_df = pd.DataFrame(breast_cancer.data, columns= breast_cancer.feature_names)
df = spark.createDataFrame(pd_df)
df

DataFrame[mean radius: double, mean texture: double, mean perimeter: double, mean area: double, mean smoothness: double, mean compactness: double, mean concavity: double, mean concave points: double, mean symmetry: double, mean fractal dimension: double, radius error: double, texture error: double, perimeter error: double, area error: double, smoothness error: double, compactness error: double, concavity error: double, concave points error: double, symmetry error: double, fractal dimension error: double, worst radius: double, worst texture: double, worst perimeter: double, worst area: double, worst smoothness: double, worst compactness: double, worst concavity: double, worst concave points: double, worst symmetry: double, worst fractal dimension: double]

In [None]:
def set_df_columns_nullable(spark, df, column_list, nullable=False):
    for struct_field in df.schema:
        if struct_field.name in column_list:
            struct_field.nullable = nullable
    df_mod = spark.createDataFrame(df.rdd, df.schema)
    return df_mod
df = set_df_columns_nullable(spark, df, df.columns)
df = df.withColumn('features', array(df.columns))
vectors = df.rdd.map(lambda row: Vectors.dense(row.features))

df.printSchema()

root
 |-- mean radius: double (nullable = false)
 |-- mean texture: double (nullable = false)
 |-- mean perimeter: double (nullable = false)
 |-- mean area: double (nullable = false)
 |-- mean smoothness: double (nullable = false)
 |-- mean compactness: double (nullable = false)
 |-- mean concavity: double (nullable = false)
 |-- mean concave points: double (nullable = false)
 |-- mean symmetry: double (nullable = false)
 |-- mean fractal dimension: double (nullable = false)
 |-- radius error: double (nullable = false)
 |-- texture error: double (nullable = false)
 |-- perimeter error: double (nullable = false)
 |-- area error: double (nullable = false)
 |-- smoothness error: double (nullable = false)
 |-- compactness error: double (nullable = false)
 |-- concavity error: double (nullable = false)
 |-- concave points error: double (nullable = false)
 |-- symmetry error: double (nullable = false)
 |-- fractal dimension error: double (nullable = false)
 |-- worst radius: double (nullable

from the below cell, I am going build the two datastructures that we will be using throughout this Notebook namely,

**features:** a dataframe of Dense vectors, containing all the original features in the dataset;

**labels:** a series of binary labels indicating if the corresponding set of features belongs to a subject with breast cancer, or not.

In [None]:
from pyspark.ml.linalg import Vectors
features = spark.createDataFrame(vectors.map(Row), ["features"])
labels = pd.Series(breast_cancer.target)

# **3.Building machine learning model**

Now I am going to cluster the data with the K-means algorithm included in MLlib (Spark's Machine Learning library). Also, I am setting the k parameter to 2, because we have only two classes in the dataset. Then I am fitting the model, and the computing the Silhouette score) (i.e., a measure of quality of the obtained clustering).

Here,I am using the MLlib implementation of the Silhouette score via ClusteringEvaluator

In [None]:
#importing necessary libraries to build a KMeans model using pyspark
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator


#Training a K-Means Model
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(features)

#Making pedictions
predictions = model.transform(features)

#Evaluating clustering by computing Silhouette  score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.8342904262826145


Here, Clustering Evaluator is used for Clustering results, which expects two input columns: prediction and features. The metric computes the Silhouette measure using the squared Euclidean distance.

The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters.




Silhouette Score using ClusteringEvaluator() measures how close each point in one cluster is to points in the neighboring clusters thus helping in figuring out clusters that are compact and well-spaced out.

# **4.Conclusion**

Finally, we build a K-Means Clustering model using 2 clusters by pyspark for breast cancer dataset.