# Assignment 3, Predictive Methods – AVTEK 2025
Health Insurance Charges – k-means clustering


**Participants names and contributions**  
- Name 1 – data loading, preprocessing  
- Name 2 – k-means experiments and elbow method  
- Name 3 – interpretation of results and report writing  

_(Edit the names and roles above to match your group.)_

## Part 1 – Familiarizing and basic testing with the k-means algorithm

### 1.1 Dataset from Kaggle – Health Insurance Charges

In this assignment I use the **Health Insurance Charges Dataset** from Kaggle. The dataset describes how different personal factors are related to health insurance costs. Each row represents one person and contains the following columns:

- **age** – age of the person in years (numeric)
- **sex** – biological sex of the person (`male`, `female`)
- **bmi** – body mass index, a measure of body fat based on height and weight (numeric)
- **children** – number of dependents covered by the insurance (numeric)
- **smoker** – whether the person is a smoker (`yes`/`no`)
- **region** – residential area in the US (`northeast`, `northwest`, `southeast`, `southwest`)
- **charges** – annual medical insurance cost billed to the person (numeric)

The goal of this work is **not** to predict a label but to **discover groups of similar people** based on their lifestyle and health-related characteristics. For that purpose I apply the k-means clustering algorithm.

#### Why this dataset is suitable for k-means
- The dataset contains several **continuous numerical features** (age, bmi, children, charges) that are well suited for distance-based clustering.
- Categorical features (sex, smoker, region) can be converted to numerical form using one-hot encoding.
- The size of the dataset is modest, so k-means runs quickly in a normal laptop environment.
- Interpreting clusters is meaningful in the real world. For example we can obtain clusters like:
  - young non-smokers with low insurance charges
  - middle-aged smokers with very high charges
  - families with several children and medium charges

Because k-means is sensitive to the scale of the features, I will standardize the data before running the algorithm. I also remove no rows, because the dataset does not contain missing values.

In [None]:
# 1.1 Load and inspect the dataset
import pandas as pd

# Load the Kaggle health insurance dataset
df = pd.read_csv("insurance.csv")

# Show basic information
print("Shape of the data:", df.shape)
print("\nFirst rows of the dataset:")
display(df.head())

print("\nSummary statistics:")
display(df.describe(include="all"))

### 1.2 First k-means run

In the first experiment I run a very basic k-means clustering with **k = 3** clusters. The steps are:
1. One-hot encode the categorical columns `sex`, `smoker` and `region`.
2. Standardize all features so that each column has mean 0 and standard deviation 1.
3. Run k-means with `k=3` using the default parameters of scikit-learn.
4. Attach the cluster labels back to the original data and take a quick look at the average values inside each cluster.

This first run is only to check that my notebook works and that the clusters look reasonable. Optimising the number of clusters and hyperparameters is done later in Part 2.

In [None]:
# 1.2 First basic k-means run with k = 3
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=["sex", "smoker", "region"], drop_first=True)

# Standardize all encoded features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded)

# Run k-means with k = 3
kmeans_3 = KMeans(n_clusters=3, random_state=42)
clusters_3 = kmeans_3.fit_predict(X_scaled)

# Add cluster labels to the original dataframe
df_with_clusters = df.copy()
df_with_clusters["cluster_3"] = clusters_3

# Inspect average values in each cluster
cluster_means = df_with_clusters.groupby("cluster_3")[["age", "bmi", "children", "charges"]].mean()
print("Average values in each cluster (k=3):")
display(cluster_means)

#### Interpretation of the first k-means run
The table above shows the mean age, BMI, number of children and insurance charges for each of the three clusters.
Typical observations (your exact numbers may differ slightly):
- One cluster contains **younger non-smokers** with relatively **low charges**.
- Another cluster includes **middle-aged people with higher BMI** and **medium charges**.
- The third cluster often contains **smokers** with clearly **higher average charges**.

From this we see that even a simple k-means run can already separate the population into meaningful groups. Next I will discuss general real-world use cases for the k-means algorithm.

### 1.3 Listing of 2 more interesting real-world use cases for k-means algorithm

1. **Customer segmentation in marketing**  
   Companies can cluster customers based on their purchase history, demographics and browsing behaviour.    The resulting segments (e.g. price-sensitive, loyal high spenders, occasional buyers) can be used to tailor    marketing campaigns, recommend products and design loyalty programmes.

2. **Image compression and colour quantisation**  
   In computer vision k-means can be used to reduce the number of colours in an image by clustering similar    colours together. Each cluster centre represent one colour in the compressed image. This is a practical    application where k-means helps to reduce file size while keeping the visual appearance reasonable.

3. **Anomaly detection as a side effect of clustering**  
   When k-means is used to model the normal structure of data, points that lie very far from any cluster    centre may represent anomalies. For example unusual network traffic patterns or extremely high sensor    readings could be detected using this idea.

## Part 2 – Experimenting with the k-means algorithm more in detail

### 2.1 Experiments with different values of $k$

In this section I experiment with **different values of k** to see how the clustering changes. I use values `k = 2, 3, 4, 5, 6`. For each k I run k-means on the same standardized data and then look at:
- the **inertia** (sum of squared distances to the nearest cluster centre),
- the **size of each cluster**, and
- the **average charges** inside each cluster.

The goal here is not yet to find the perfect k but to understand how the structure of the clusters changes when we change the value of k.

In [None]:
# 2.1 Experiments with several k values
from collections import Counter

results = []

for k in range(2, 7):  # k = 2..6
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    
    # Store inertia and cluster sizes
    inertia = kmeans.inertia_
    sizes = Counter(labels)
    
    # Create a small summary table of average charges per cluster
    temp = df.copy()
    temp["cluster"] = labels
    mean_charges = temp.groupby("cluster")["charges"].mean().values
    
    results.append((k, inertia, sizes, mean_charges))

# Print results in a readable form
for k, inertia, sizes, mean_charges in results:
    print(f"\n=== k = {k} ===")
    print("Inertia:", round(inertia, 2))
    print("Cluster sizes:", dict(sizes))
    print("Mean charges in clusters:", [round(x, 2) for x in mean_charges])

#### Interpretation of experiments with different k
From the printed results we can make the following observations (exact numbers depend on the random state):
- When **k is small (k=2)** we only get a very rough division of the population into low/medium and high charges.
- Increasing **k to 3 or 4** splits the population into more detailed groups, for example separating smokers with very high charges from non-smokers with medium charges.
- With **larger k (5 or 6)** some clusters become quite small. This might indicate that we are starting to over-segment the data.

Therefore, from a practical point of view, values around **k = 3 or k = 4** seem reasonable for this dataset.

### 2.2 Utilization for Elbow method

To choose a suitable number of clusters more systematically, I apply the **Elbow method**. I compute the k-means inertia for k values from 1 to 10 and then plot the results. The idea is to look for a **"knee" or elbow** in the curve where the decrease in inertia becomes much smaller.

In [None]:
# 2.2 Elbow method
import matplotlib.pyplot as plt

k_values = list(range(1, 11))
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure()
plt.plot(k_values, inertias, marker="o")
plt.xlabel("Number of clusters k")
plt.ylabel("Inertia (sum of squared distances)")
plt.title("Elbow method for k-means on health insurance data")
plt.show()

#### Interpretation of the Elbow graph
In the Elbow plot the inertia decreases quickly at the beginning (from k=1 to k=3 or k=4) and then the curve starts to flatten. This kind of behaviour suggests that adding more clusters after a certain point does not improve the model very much.

For this dataset the elbow typically appears around **k = 3 or k = 4**. This matches the qualitative observations from Section 2.1 and supports the choice of using three or four clusters in further analysis.

### 2.3 Testing various options for the k-means algorithm

Finally I test some different **options (hyperparameters)** of the k-means algorithm using `k = 4` as an example. According to the scikit-learn documentation the most relevant options include:
- `init` – method for choosing the initial cluster centres (`"k-means++"` or `"random"`).
- `n_init` – how many times the algorithm is run with different initialisations.
- `max_iter` – maximum number of iterations for a single run.
- `algorithm` – either the classic Lloyd algorithm or Elkan's variant (for Euclidean distances).

I compare the following four configurations:
1. Default settings (k-means++ initialisation).
2. Random initialisation with the same `k`.
3. Increased `n_init` (more restarts).
4. Different `max_iter` values.

For each configuration I record the inertia and the number of iterations used.

In [None]:
# 2.3 Testing various k-means options
configs = [
    {"name": "default (k-means++)", "params": {"n_clusters": 4, "random_state": 42}},
    {"name": "random init", "params": {"n_clusters": 4, "init": "random", "random_state": 42}},
    {"name": "higher n_init", "params": {"n_clusters": 4, "n_init": 20, "random_state": 42}},
    {"name": "fewer max_iter", "params": {"n_clusters": 4, "max_iter": 50, "random_state": 42}},
]

summary_rows = []

for cfg in configs:
    km = KMeans(**cfg["params"])
    km.fit(X_scaled)
    summary_rows.append({
        "configuration": cfg["name"],
        "inertia": round(km.inertia_, 2),
        "n_iter": km.n_iter_
    })

summary_df = pd.DataFrame(summary_rows)
display(summary_df)

#### Interpretation of the option experiments
From the table we can draw several conclusions:
- **Default k-means++ initialisation** already gives a good solution with relatively low inertia.
- Using **random initialisation** may sometimes lead to slightly worse inertia or require more iterations, because the starting points are not as good.
- Increasing **`n_init`** means that k-means is run multiple times with different initialisations and the best solution is kept. This can slightly reduce inertia at the cost of longer computation time.
- Reducing **`max_iter`** can stop the algorithm before full convergence. In this dataset the default maximum number of iterations is usually enough, and lowering it rarely changes the result, but in more complex datasets it might lead to higher inertia.

Overall these experiments show how the different k-means options affect the clustering result and confirm that the default configuration with `k-means++` initialisation is a reasonable starting point for this dataset.