<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/K_Means_Clustering_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-Means Clustering
It's an unsupervised learning algorithm that will attempt to group similar clusters together in your data.

A typical clustering problem look like:
* Cluster Similar Documents
* Cluster Customers based on Features
* Market Segmentation
* Identify similar physical groups

## Install pyspark and download the data file

In [1]:
# Install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=7c784360d744b8ec242e7e46408d5a1e6b7d998cd26974e276cba335408152ef
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [6]:
# Download the necessary data files
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/KMeans/sample_kmeans_data.txt

--2023-10-04 17:06:38--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/KMeans/sample_kmeans_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120 [text/plain]
Saving to: ‘sample_kmeans_data.txt’


2023-10-04 17:06:38 (7.33 MB/s) - ‘sample_kmeans_data.txt’ saved [120/120]



## Read in the data

In [39]:
# Import libraries
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans

In [4]:
# Create a session
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [7]:
# Read in the data file
dataset = spark.read.format('libsvm').load('sample_kmeans_data.txt')

In [9]:
# Show data
dataset.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



In [10]:
# Select features columns
final_data = dataset.select('features')

In [12]:
# Show features
final_data.show()

+--------------------+
|            features|
+--------------------+
|           (3,[],[])|
|(3,[0,1,2],[0.1,0...|
|(3,[0,1,2],[0.2,0...|
|(3,[0,1,2],[9.0,9...|
|(3,[0,1,2],[9.1,9...|
|(3,[0,1,2],[9.2,9...|
+--------------------+



## Create the KMeans model and see results

In [48]:
# Create a kmeans object
kmeans = KMeans().setK(2).setSeed(1)

In [49]:
# Fit the model
model = kmeans.fit(final_data)

In [50]:
# Let's evaluate
wssse = model.summary.trainingCost

In [51]:
# Get the centers of clusters
centers = model.clusterCenters()

In [52]:
# Show centers
centers

[array([9.1, 9.1, 9.1]), array([0.1, 0.1, 0.1])]

In [53]:
final_data.show()

+--------------------+
|            features|
+--------------------+
|           (3,[],[])|
|(3,[0,1,2],[0.1,0...|
|(3,[0,1,2],[0.2,0...|
|(3,[0,1,2],[9.0,9...|
|(3,[0,1,2],[9.1,9...|
|(3,[0,1,2],[9.2,9...|
+--------------------+



In [54]:
# Predict cluster of final_data
results = model.transform(final_data)

In [55]:
# Print results
results.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         1|
|(3,[0,1,2],[0.1,0...|         1|
|(3,[0,1,2],[0.2,0...|         1|
|(3,[0,1,2],[9.0,9...|         0|
|(3,[0,1,2],[9.1,9...|         0|
|(3,[0,1,2],[9.2,9...|         0|
+--------------------+----------+

