# K Means Clustering

K-means clustering merupakan algoritma unsupervised machine learning  yang simpel dan sangat populer.

Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:

- The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
- Each point is closer to its own cluster center than to other cluster centers.

Those two assumptions are the basis of the k-means model. We will soon dive into exactly how the algorithm reaches this solution, but for now let's take a look at a simple dataset and see the k-means result.

### Cara Kerja K Means
  Inisialisi : Pilih k titik acak sebagai pusat cluster (centroid)  

  Lalu lakukan iterasi berikut sampai tidak ada perubahan lagi pada centroid  
    - Hitung jarak setiap titik ke centroid  
    - Ubah centroid dengan mengambil rata-rata dari titik yang berada di cluster yang sama  

Cara menghitung jarak tiap titik ke centroid  
  - Jarak euclidean  
  - Jarak manhattan  
  - Jarak cosine

Kali ini kita menggunakan jarak euclidean untuk menghitung jarak tiap titik ke centroidnya

In [33]:
def euclidean_distance(point_1: list, point_2: list) -> float:
    return sum([(a - b) ** 2 for a, b in zip(point_1, point_2)]) ** 0.5

euclidean_distance([0,0,0], [2,2,2])

3.4641016151377544


Dan untuk menghitung jarak di antara centroid dengan titik lainnya kita menggunakan fungsi berikut

In [None]:
def closest_point(point: list, centroids: list) -> list:
    return min(centroids, key=lambda centroid: euclidean_distance(point, centroid))

Lalu kita akan menghitung rata-rata dari titik yang berada di cluster yang sama

In [39]:
def mean(points: list) -> list:
    return [sum(x) / len(x) for x in zip(*points)]

mean([[1,1], [2,2], [3,3]])

[2.0, 2.0]

Setelah semua fungsi di atas sudah kita buat, kita akan membuat implementasi kelas KMeans

In [None]:
import randomx

In [40]:
class KMeans:
    def __init__(self, data: list, n_clusters=2, max_iteration=300):
        self.n_clusters = n_clusters
        self.data = data
        self.max_iteration = max_iteration
        self.centroids = self.fit()

    def fit(self) -> list:
        centroids = random.sample(self.data, self.n_clusters)
        for _ in range(self.max_iteration):
            clusters = [[] for _ in range(self.n_clusters)]
            for x in self.data:
                closest = closest_point(x, centroids)
                clusters[centroids.index(closest)].append(x)
            new_centroids = []
            for cluster in clusters:
                new_centroids.append(mean(cluster))
            if new_centroids == centroids:
                self.clusters = clusters
                break
            centroids = new_centroids
            
k_means = KMeans([[0,0],[1,1],[4,4],[5,5]])
print(k_means.clusters)





[[[4, 4], [5, 5]], [[0, 0], [1, 1]]]
