# Introduction to Clustering

The field of Machine Learning is broadly categorized into **_4_** main approaches, as illustrated below.

---

$
\text{🤖 Machine Learning}
\begin{cases}
  \text{🧑‍🏫 Supervised Learning} \\
  \text{🧩 Unsupervised Learning} \\
  \text{⚖️ Semi-supervised Learning} \\
  \text{🕹️ Reinforcement Learning}
\end{cases}
$

---

This report will focus on **Clustering**, one of the primary tasks within the **'Unsupervised Learning'**.

In many machine learning tasks, particularly in **'Supervised Learning'**, the primary goal is a prediction. An algorithm is trained on labeled data to predict an output, such as assigning a label to new, unseen data.

From the other hand, **'Clustering'** is not designed to predict a specific output. Instead, its objective is to discover structures within the data by organizing it into meaningful groups, or "clusters."

---

# A real world example

## Bank Example: Finding Risky Customers

Banks use clustering to figure out which customers are financially similar and what risk they pose.

*   **Goal:** Group customers to manage financial risk.
*   **Input:** The bank uses unlabeled financial data like a customers income, debt, and payment history, age and ....
*   **Clustering Action:** The algorithm automatically sorts all customers into distinct groups, such as a **"Low-Risk"** cluster and a **"High-Risk"** cluster.
*   **Impact:** The bank can then use these groups to make decisions. For example, they can offer their best interest rates and products to the Low-Risk group while restricting loans or services to the High-Risk group.

---

## Where to use clustering

*   **Exploring data analysis**
*   **Summary generation**
*   **Outlier detection**
*   **Finding duplicates**
*   **Pre-processing data**

---

Now that you know the main idea about clustering, let's dive into the scientiffic parts.

For all algorithms in clustering, we actually want to find similarities and dissimilarities.

You might wonder why we need dissimilarities! Well the answer is straightforward. As you can see in the image below, we are trying to cluster similar items that are as close as possible together. From the other hand, we are trying to maximize the clusters distances from each other. Look at the image below to see it clearly:

---

![](assets/intra_inter.png)

---

As you can see, the yellow items are trying to stick together and stay close, but the green and bule ones are trying to get as far as possible! If we somehowe manage to maximize the **'intra-distances'** and also minimize **'inter-distances'**, we can claim that we are on the right track in clustering.

So, the goals are:

$$\text{Dis} (x_1, x_2) \downarrow$$
*(Minimize the distance between points $\mathbf{x_1}$ and $\mathbf{x_2}$ in the same cluster)*

$$\text{Dis} (c_1, c_2) \uparrow$$
*(Maximize the distance between clusters $\mathbf{c_1}$ and $\mathbf{c_2}$)*

---

### What is distance and how to measure that?

Well, there are many different ways to measure distances between 2 points in vector environment such as :

*   **Euclidean**
*   **Cosine**
*   **Average distance**

and ...

For now, we are going to talk about '**Euclidean**'

In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is occasionally called the Pythagorean distance.

You can see it clearly in the image below:

---

![](assets/euclidean_distance.jpg)

---

So, it's just simple math. We find the difference of different items like this.

$$
\text{Dis}(x_i, x_j) = \sqrt{\sum_{k=1}^{n} (x_{ik} - x_{jk})^2}
$$

You might ask why we add some more steps like changing the numbers to its second power and then used a radical on all of it. Well, we always try to normalize our data with this method, so that we bring our data on the normal curve. It is always a good practice to do that.

---

### Example Distance Calculation (Euclidean)

Imagine we have two data points (e.g., two customers) measured by two features: **Feature A** and **Feature B**.

| Feature | Point $x_i$ | Point $x_j$ |
| :---: | :---: | :---: |
| **Feature A** | 2 | 5 |
| **Feature B** | 8 | 4 |

To calculate the Euclidean distance between $x_i$ and $x_j$, we follow the steps of the formula:

**1. Calculate the difference and square it for each feature:**

*   **Feature A:** $(2 - 5)^2 = (-3)^2 = 9$
*   **Feature B:** $(8 - 4)^2 = (4)^2 = 16$

**2. Sum the squared differences and take the square root:**

The calculation for the distance is as follows:

$$
\text{Dis}(x_i, x_j) = \sqrt{(2-5)^2 + (8-4)^2}
$$

$$
\text{Dis}(x_i, x_j) = \sqrt{9 + 16} = \sqrt{25} = 5
$$

The distance between the two points is **5**. This single number is what the clustering algorithm uses to determine how similar $x_i$ and $x_j$ are. A lower distance means higher similarity.

---

All right, now that we know the main concepts, lets' dive into our very first algorithm, called '**_K-means_**'