<a href="https://colab.research.google.com/github/FedorTaggenbrock/data_intensive_systems/blob/main/Spark_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# %pip install pyspark

In [2]:
import pyspark
import math
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark import RDD
from pyspark import SparkContext

In [3]:
spark = SparkSession.builder.appName("Practise").getOrCreate()

In [9]:
# Create an RDD of data points
data = spark.sparkContext.parallelize([
    [0.0, 0.0],
    [1.0, 1.0],
    [9.0, 8.0],
    [8.0, 12.0]
])

k = 3
init_centroids = centroids = [tuple(x) for x in data.takeSample(withReplacement=False, num=k)]
print(init_centroids)

[(9.0, 8.0), (1.0, 1.0), (0.0, 0.0)]


In [17]:
def kMeans(data: RDD, k: int, maxIterations: int, debug: bool) -> list:
    # Initialize centroids randomly
    centroids = [tuple(x) for x in data.takeSample(withReplacement=False, num=k)]

    # Define a custom distance function
    def distance(point1, point2):
        return math.sqrt(sum([(x - y)**2 for x, y in zip(point1, point2)]))

    # Iterate until convergence or until the maximum number of iterations is reached
    for i in range(maxIterations):
        # Assign each point to the closest centroid
        clusters = data.map(lambda point: (min(centroids, key=lambda centroid: distance(point, centroid)), point)).groupByKey()

        # Compute new centroids as the mean of the points in each cluster
        newCentroids = clusters.mapValues(lambda points: tuple([sum(x) / len(points) for x in zip(*points)])).collect()

        # Update centroids
        for oldCentroid, newCentroid in newCentroids:
            index = centroids.index(oldCentroid)
            centroids[index] = newCentroid
        
        if debug:
            # Calculate and print for each datapoint which cluster it belongs to
            clusters = data.map(lambda point: (min(centroids, key=lambda centroid: distance(point, centroid)), point)).groupByKey().collect()

            # print(clusters)
            # break
        
            # Print which iter
            print("Iteration {}".format(i))
            for centroid, points in clusters:
                print("Cluster with centroid {}: {}".format(centroid, list(points)))
            print("-------------------------------------------")


    return [list(x) for x in centroids]


This code defines a kMeans function that takes an RDD of data points (represented as lists of floats), the number of clusters k, and the maximum number of iterations as input. The function initializes the centroids randomly and then iterates until convergence or until the maximum number of iterations is reached. In each iteration, each point is assigned to the closest centroid using a custom distance function (in this case, the Euclidean distance), and then new centroids are computed as the mean of the points in each cluster. Finally, the function returns the computed centroids.

This is just one example of how you could implement K-Means clustering with a custom distance function in PySpark using map and reduce functions. You could adapt this code to use a different distance function or to implement other variations of the K-Means algorithm.

In [18]:
# Cluster the data into two clusters
centroids = kMeans(data, k=2, maxIterations=2, debug=True)

# Print the resulting centroids
for centroid in centroids:
    print(centroid)

Iteration 0
Cluster with centroid (8.5, 10.0): [[9.0, 8.0], [8.0, 12.0]]
Cluster with centroid (0.5, 0.5): [[0.0, 0.0], [1.0, 1.0]]
-------------------------------------------
Iteration 1
Cluster with centroid (8.5, 10.0): [[9.0, 8.0], [8.0, 12.0]]
Cluster with centroid (0.5, 0.5): [[0.0, 0.0], [1.0, 1.0]]
-------------------------------------------
Iteration 2
Cluster with centroid (8.5, 10.0): [[9.0, 8.0], [8.0, 12.0]]
Cluster with centroid (0.5, 0.5): [[0.0, 0.0], [1.0, 1.0]]
-------------------------------------------
Iteration 3
Cluster with centroid (8.5, 10.0): [[9.0, 8.0], [8.0, 12.0]]
Cluster with centroid (0.5, 0.5): [[0.0, 0.0], [1.0, 1.0]]
-------------------------------------------


In [None]:
print("test")

In [None]:
print("test3")