# **Lesson_5.1**

## In this lecture

* Fork repository

* K-means clustering: cluster prediction for an individual input
* How to save and reuse your trained model
* In-class exercise: clustering helper function
* Linear regression code-along walkthrough project **Medical Charges**
* In-class exercise: medical data clustering

---

## Cluster prediction for an individual input

#### Prepare input
The new person (customer) is:
* Age: 30
* Annual income: 60k
* Spending score: 50

#### Create input for this customer:

In [None]:
import pandas as pd
import joblib

In [None]:
new_customer_df = pd.DataFrame([[30, 60, 50]], columns=['Age', 'Annual_Income', 'Spending_Score'])  # N.b.: _2D_array_
new_customer_df

#### Predict the cluster

In [None]:
kmeans = joblib.load("../models/kmeans_v1.pkl")
kmeans

In [None]:
cluster_label = kmeans.predict(new_customer_df)
print(f"The customer belongs to cluster: {cluster_label[0]}")


<fieldset>
<legend>DANGER ZONE</legend>
If you apply any preprocessing to your original dataset (like PCA, scaling ...), you should apply the same to the <b>new_customer_df</b> otherwise prediction will be wrong!
</fieldset>

In [None]:
distances = kmeans.transform(new_customer_df)
print("Distances to cluster centers:", distances)

### In-class exercise

Write a helper function which would ask a user to enter *age*, *annual income* and *spending score* and **return** which *cluster* the new customer belongs to.

In [None]:
# Write your code here ...

---

## Linear regression model for Medical Charges prediction

### Business objectives

* Purpose of the project: Predicting medical expences using Linear Regression
* Business question: what would be medical charges for new customers?

### Import and settings

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px # Interactive charts and save some coding; .express - high-level api
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Change settings to improve default style (optional)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### Load data

In [None]:
# Data path
data_path = '../datasets/medical-charges.csv'

# Load data
medical_df = pd.read_csv(data_path)

### EDA

In [None]:
medical_df.info()

In [None]:
medical_df.describe()

In [None]:
medical_df.head()

#### Visualisation

* Age

In [None]:
fig = px.histogram(
    medical_df,
    x='age',
    marginal='box',
    nbins=47, # bin for each year. Calculated from min and max age in the dataset.
    title="Age distribution"
)
fig.update_layout(bargap=0.1)
fig.show()

* BMI

In [None]:
fig = px.histogram(
    medical_df,
    x='bmi',
    marginal='box',
    color_discrete_sequence=['red'],
    nbins=47, # bin for each year. Calculated from min and max age in the dataset.
    title="BMI distribution"
)
fig.update_layout(bargap=0.1)
fig.show()

* Charges

In [None]:
fig = px.histogram(
    medical_df,
    x='charges',
    # color='smoker',
    marginal='box',
    # color_discrete_sequence=['green', 'grey'],
    nbins=47, # bin for each year. Calculated from min and max age in the dataset.
    title="Annual medical charges"
)
fig.update_layout(bargap=0.1)
fig.show()

* Smoker

In [None]:
medical_df.smoker.value_counts()

In [None]:
fig = px.histogram(
    medical_df,
    x='smoker',
    color='sex',
    title="Smoker"
)

fig.show()

* Age and charges

In [None]:
fig = px.scatter(
    medical_df,
    x='age',
    y='charges',
    color='smoker',
    opacity=0.8,
    hover_data=['sex'],
    title="Age vs. Charges"    
)
fig.update_traces(marker_size=5)
fig.show()

* BMI and charges

In [None]:
fig = px.scatter(
    medical_df,
    x='bmi',
    y='charges',
    color='smoker',
    opacity=0.8,
    hover_data=['sex'],
    title="BMI vs. Charges"    
)
fig.update_traces(marker_size=5)
fig.show()

* Number of children

In [None]:
fig = px.violin(  # Violing used as an example
    medical_df,
    x='children',
    y='charges'
)
fig.show()

### In-class exercise
Before we proceed with building **linear regression model** to predict medical charges, apply k-means clustering using all applicable numerical features in the dataset (reuse the workflow we studied in the previous lesson).

---

##### Reminder: do not forget to **Clear All Outputs**
### Now you can commit and push your code to **GitHub**