# **Lesson_5.1**

## In this lecture

* Fork repository

* K-means clustering: cluster prediction for an individual input
* How to save and reuse your trained model
* In-class exercise: clustering helper function
* Linear regression code-along walkthrough project **Medical Charges**
* In-class exercise: medical data clustering

---

## Cluster prediction for an individual input

#### Prepare input
The new person (customer) is:
* Age: 30
* Annual income: 60k
* Spending score: 50

#### Create input for this customer:

In [1]:
import pandas as pd
import joblib

In [2]:
new_customer_df = pd.DataFrame([[30, 60, 50]], columns=['Age', 'Annual_Income', 'Spending_Score'])  # N.b.: _2D_array_
new_customer_df

Unnamed: 0,Age,Annual_Income,Spending_Score
0,30,60,50


#### Predict the cluster

In [3]:
kmeans = joblib.load("../models/kmeans_v1.pkl")
kmeans

0,1,2
,"n_clusters  n_clusters: int, default=8 The number of clusters to form as well as the number of centroids to generate. For an example of how to choose an optimal value for `n_clusters` refer to :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py`.",6
,"init  init: {'k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='k-means++' Method for initialization: * 'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is ""greedy k-means++"". It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategies, see :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`. For an evaluation of the impact of initialization, see the example :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_stability_low_dim_dense.py`.",'k-means++'
,"n_init  n_init: 'auto' or int, default='auto' Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see :ref:`kmeans_sparse_high_dim`). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='k-means++'` or `init` is an array-like. .. versionadded:: 1.2  Added 'auto' option for `n_init`. .. versionchanged:: 1.4  Default value for `n_init` changed to `'auto'`.",10
,"max_iter  max_iter: int, default=300 Maximum number of iterations of the k-means algorithm for a single run.",300
,"tol  tol: float, default=1e-4 Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.",0.0001
,"verbose  verbose: int, default=0 Verbosity mode.",0
,"random_state  random_state: int, RandomState instance or None, default=None Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary `.",42
,"copy_x  copy_x: bool, default=True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.",True
,"algorithm  algorithm: {""lloyd"", ""elkan""}, default=""lloyd"" K-means algorithm to use. The classical EM-style algorithm is `""lloyd""`. The `""elkan""` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. .. versionchanged:: 0.18  Added Elkan algorithm .. versionchanged:: 1.1  Renamed ""full"" to ""lloyd"", and deprecated ""auto"" and ""full"".  Changed ""auto"" to use ""lloyd"" instead of ""elkan"".",'lloyd'


In [4]:
cluster_label = kmeans.predict(new_customer_df)
print(f"The customer belongs to cluster: {cluster_label[0]}")

The customer belongs to cluster: 3



<fieldset>
<legend>DANGER ZONE</legend>
If you apply any preprocessing to your original dataset (like PCA, scaling ...), you should apply the same to the <b>new_customer_df</b> otherwise prediction will be wrong!
</fieldset>

In [5]:
distances = kmeans.transform(new_customer_df)
print("Distances to cluster centers:", distances)

Distances to cluster centers: [[26.99624116 44.76195542 45.37829969  4.57425652 41.75835277 48.41321103]]


### In-class exercise

Write a helper function which would ask a user to enter *age*, *annual income* and *spending score* and **return** which *cluster* the new customer belongs to.

In [6]:
# Write your code here ...

def input_numbers():
    Age = int(input("Enter your age:"))
    Annual_Income = int(input("Enter your annual income:"))
    Spending_Score = int(input("Enter your spending score:"))
    return Age, Annual_Income, Spending_Score

new_customer_input_df

kmeans = joblib.load("../models/kmeans_v1.pkl")
kmeans
cluster_label = kmeans.predict(new_customer_input_df)
print(f"The customer belongs to cluster: {cluster_label[0]}") 

# Better code written in class and photographed (23/02/2026)

NameError: name 'new_customer_input_df' is not defined

In [None]:
distances = kmeans.transform(new_customer_input_df)
print("Distances to cluster centers:", distances)

---

## Linear regression model for Medical Charges prediction

### Business objectives

* Purpose of the project: Predicting medical expences using Linear Regression
* Business question: what would be medical charges for new customers?

### Import and settings

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px # Interactive charts and save some coding; .express - high-level api
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Change settings to improve default style (optional)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### Load data

In [None]:
# Data path
data_path = '../datasets/medical-charges.csv'

# Load data
medical_df = pd.read_csv(data_path)

### EDA

In [None]:
medical_df.info()

In [None]:
medical_df.describe()

In [None]:
medical_df.head()

#### Visualisation

* Age

In [None]:
fig = px.histogram(
    medical_df,
    x='age',
    marginal='box',
    nbins=47, # bin for each year. Calculated from min and max age in the dataset.
    title="Age distribution"
)
fig.update_layout(bargap=0.1)
fig.show()

* BMI

In [None]:
fig = px.histogram(
    medical_df,
    x='bmi',
    marginal='box',
    color_discrete_sequence=['red'],
    nbins=47, # bin for each year. Calculated from min and max age in the dataset.
    title="BMI distribution"
)
fig.update_layout(bargap=0.1)
fig.show()

* Charges

In [None]:
fig = px.histogram(
    medical_df,
    x='charges',
    # color='smoker',
    marginal='box',
    # color_discrete_sequence=['green', 'grey'],
    nbins=47, # bin for each year. Calculated from min and max age in the dataset.
    title="Annual medical charges"
)
fig.update_layout(bargap=0.1)
fig.show()

* Smoker

In [None]:
medical_df.smoker.value_counts()

In [None]:
fig = px.histogram(
    medical_df,
    x='smoker',
    color='sex',
    title="Smoker"
)

fig.show()

* Age and charges

In [None]:
fig = px.scatter(
    medical_df,
    x='age',
    y='charges',
    color='smoker',
    opacity=0.8,
    hover_data=['sex'],
    title="Age vs. Charges"    
)
fig.update_traces(marker_size=5)
fig.show()

* BMI and charges

In [None]:
fig = px.scatter(
    medical_df,
    x='bmi',
    y='charges',
    color='smoker',
    opacity=0.8,
    hover_data=['sex'],
    title="BMI vs. Charges"    
)
fig.update_traces(marker_size=5)
fig.show()

* Number of children

In [None]:
fig = px.violin(  # Violing used as an example
    medical_df,
    x='children',
    y='charges'
)
fig.show()

### In-class exercise
Before we proceed with building **linear regression model** to predict medical charges, apply k-means clustering using all applicable numerical features in the dataset (reuse the workflow we studied in the previous lesson).

---

##### Reminder: do not forget to **Clear All Outputs**
### Now you can commit and push your code to **GitHub**