<a href="https://colab.research.google.com/github/FedorTaggenbrock/data_intensive_systems/blob/main/Spark_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=9bdecaa438ac985b6011d19b02f3d7323f24439f79e6bddf729142eb42ba6885
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [3]:
import pyspark
import math
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark import RDD
from pyspark import SparkContext

In [4]:
spark = SparkSession.builder.appName("Practise").getOrCreate()

In [135]:
def kMeans(distance, data: RDD, k: int, maxIterations: int) -> list:
    # Initialize centroids randomly
    centroids = [tuple(x) for x in data.takeSample(withReplacement=False, num=k)]

    # Iterate until convergence or until the maximum number of iterations is reached
    for i in range(maxIterations):
        # Assign each point to the closest centroid
        clusters = data.map(lambda point: (min(centroids, key=lambda centroid: distance(point, centroid)), point)).groupByKey()

        # Compute new centroids as the mean of the points in each cluster
        newCentroids = clusters.mapValues(lambda points: tuple([sum(x) / len(points) for x in zip(*points)])).collect()

        # Update centroids
        for oldCentroid, newCentroid in newCentroids:
            index = centroids.index(oldCentroid)
            centroids[index] = newCentroid

    return [list(x) for x in centroids]


This code defines a kMeans function that takes an RDD of data points (represented as lists of floats), the number of clusters k, and the maximum number of iterations as input. The function initializes the centroids randomly and then iterates until convergence or until the maximum number of iterations is reached. In each iteration, each point is assigned to the closest centroid using a custom distance function (in this case, the Euclidean distance), and then new centroids are computed as the mean of the points in each cluster. Finally, the function returns the computed centroids.

This is just one example of how you could implement K-Means clustering with a custom distance function in PySpark using map and reduce functions. You could adapt this code to use a different distance function or to implement other variations of the K-Means algorithm.

In [172]:
#TESTING K-MEANS

data2 = spark.sparkContext.parallelize([
    [0, 0],
    [10,10],
    [2,2],
    [12,12]
])

# Define a custom distance function
def eaclid_distance(point1, point2):
    return math.sqrt(sum([(x - y)**2 for x, y in zip(point1, point2)]))
# Cluster the data into two clusters
centroids = kMeans(eaclid_distance, data2, k=2, maxIterations=10)

# Print the resulting centroids
for centroid in centroids:
    print(centroid)

[((0, 0), (1.0, 1.0)), ((10, 10), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[((1.0, 1.0), (1.0, 1.0)), ((11.0, 11.0), (11.0, 11.0))]
[11.0, 11.0]
[1.0, 1.0]


In [190]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
from collections import Counter

def kModes_v1(distance, data: RDD, k: int, maxIterations: int, list_size: int) -> list:
    # Initialize centroids randomly
    centroids = [tuple(x) for x in data.takeSample(withReplacement=False, num=k)]

    # Iterate until convergence or until the maximum number of iterations is reached
    for i in range(maxIterations):
        print("centroids = ", centroids)

        # Assign each point to the closest centroid
        clusters = data.map(lambda point: (min(centroids, key=lambda centroid: distance(point, centroid)), point))

        #print("clusters1 = ", clusters.collect())

        #Compute new centroids as the mode of the points in each cluster
        clusters = clusters.mapValues(lambda set: Counter(set))
        clusters = clusters.reduceByKey(lambda a,b: a+b)

        #print("clusters2 = ", clusters.collect())

        newCentroids = clusters.mapValues(lambda counter:tuple([c[0] for c in counter.most_common(4)])).collect()

        #print("newCentroids = ", newCentroids)

        # Update centroids
        for oldCentroid, newCentroid in newCentroids:
            index = centroids.index(oldCentroid)
            centroids[index] = newCentroid

    return [list(x) for x in centroids]

In [191]:
#TESTING K-MODES USING CATEGORICAL DATA

#data is encoded zodat elke kolom een 1 of een 0 bevat afhankelijk van of die trip gedaan wordt.
data = spark.sparkContext.parallelize([
    ["A", "B", "C", "K"],
    ["F", "E", "D", "B"],
    ["B", "C", "A","J"],
    ["D", "E", "A", "F"],
])
# Define a custom distance function
def jaccard_distance(a, b):
    a = set(a)
    b = set(b)
    intersection = len(a & b)
    union = len(a | b)
    return 1 - (intersection / union)

# Cluster the data into two clusters
centroids = kModes_v1(jaccard_distance, data, k=2, maxIterations=10, list_size=4)

# Print the resulting centroids
for centroid in centroids:
    print(centroid)


centroids =  [('A', 'B', 'C', 'K'), ('B', 'C', 'A', 'J')]
centroids =  [('A', 'B', 'F', 'E'), ('B', 'C', 'A', 'J')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
centroids =  [('F', 'E', 'D', 'B'), ('A', 'B', 'C', 'K')]
['F', 'E', 'D', 'B']
['A', 'B', 'C', 'K']


In [226]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import numpy as np
from statistics import mode

def kModes_v2(distance, data: RDD, k: int, maxIterations: int, list_size: int) -> list:
    # Initialize centroids randomly
    centroids = [tuple(x) for x in data.takeSample(withReplacement=False, num=k)]

    # Iterate until convergence or until the maximum number of iterations is reached
    for i in range(maxIterations):
        print("centroids = ", centroids)

        # Assign each point to the closest centroid
        clusters = data.map(lambda point: (min(centroids, key=lambda centroid: distance(point, centroid)), point)).groupByKey()

        #print("clusters1 = ", clusters.collect())

        #Compute new centroids as the mode of the points in each cluster
        newCentroids = clusters.mapValues(lambda arrays: tuple([mode(x) for x in zip(*arrays)]) ).collect()

        #print("newCentroids = ", newCentroids)

        # Update centroids
        for oldCentroid, newCentroid in newCentroids:
            index = centroids.index(oldCentroid)
            centroids[index] = newCentroid

    return [list(x) for x in centroids]

In [230]:
import scipy

#Dit is waarschijnlijk de beste manier om het te doen, gebruikt encoding. Elke colom representeerd een bepaalde trip, bv utrecht-> ams en een route heeft een 1 voor die kolom als deze trip in de route zit. 

data = spark.sparkContext.parallelize([
    [1,1,0,1,0],
    [1,1,1,1,0],
    [0,0,1,0,1],
    [1,0,0,0,1],
    [1,0,0,1,0],
    [1,1,1,1,0],
    [0,1,1,0,1],
    [1,0,0,1,0],
])

# Cluster the data into two clusters using the k-modes algorithm with a custom distance function. 
centroids = kModes_v2(scipy.spatial.distance.jaccard, data, k=2, maxIterations=10, list_size = 5)

# Print the resulting centroids
for centroid in centroids:
    print(centroid)

centroids =  [(0, 0, 1, 0, 1), (1, 1, 1, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
centroids =  [(0, 0, 1, 0, 1), (1, 1, 0, 1, 0)]
[0, 0, 1, 0, 1]
[1, 1, 0, 1, 0]
