# **Lesson_4.2**

## In this lecture

* Fork repository

* In-class exercise
* **Scikit-learn** Python library
* ML model selection (again)
* Unsupervised ML
* **K-means** clustering
* In-class exercise

---

## In-class exercise

#### Objective

* Load dataset from internet (provided)

* Handle non-standard CSV formatting (example shown)
* Inspect data
* Handle missing values
* Sort by performance metric
* Visualise using seaborn:
	* Plot selected feature as a function of car weight
	
	* Plot mpg boxplots for various number of cylinders

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Provide the URL and read csv
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
df = pd.read_csv(url)

In [None]:
# Inspect how the dataset looks like
df.head()

In [None]:
# The data format is messy. It has to be fixed first
column_names = [
    "mpg", "cylinders", "displacement", "horsepower",
    "weight", "acceleration", "model_year", "origin", "car_name"
]

df = pd.read_csv(
    url,
    delim_whitespace=True,
    names=column_names,
    na_values="?"    
    )

In [None]:
df.head()


In [None]:
# Write the rest of your code here ...

---

## Scikit-learn (structure)

#### <u>**Scikit-learn** is a Python library that makes machine learning practical and accessible. It provides all the essential tools needed to build a complete machine learning pipeline — from data preprocessing to model training and evaluation</u>

### Core scikit-learn Modules
`sklearn.linear_model` **Linear and logistic regression models for regression and classification tasks**

`sklearn.tree`  **Decision tree algorithms for classification and regression**

`sklearn.ensemble` **Ensemble methods like Random Forest and Gradient Boosting that combine multiple models**

`sklearn.svm` **Support Vector Machines for classification and regression**

`sklearn.neighbors` **K-Nearest Neighbors algorithms for classification and regression**

`sklearn.naive_bayes` **Probabilistic classifiers based on Bayes’ theorem**

`sklearn.neural_network` **Basic feedforward neural networks for classification and regression**

### Model Selection & Evaluation
`sklearn.model_selection` **Tools for train/test splitting, cross-validation, and hyperparameter tuning**

`sklearn.metrics` **Performance metrics like accuracy, precision, recall, MSE, ROC-AUC**

### Data Preprocessing
`sklearn.preprocessing` **Scaling, encoding, normalization, and feature transformations**

`sklearn.impute` **Handling missing values**

`sklearn.feature_selection` **Selecting the most relevant features for modeling**

`sklearn.decomposition` **Dimensionality reduction techniques like PCA**

### Clustering & Unsupervised Learning
`sklearn.cluster` **Clustering algorithms like K-Means and DBSCAN**

`sklearn.mixture` **Gaussian Mixture Models for probabilistic clustering**

`sklearn.manifold` **Manifold learning techniques like t-SNE and Isomap**

### Pipelines & Utilities
`sklearn.pipeline` **Builds ML pipelines to chain preprocessing and models**

`sklearn.compose` **Combines different preprocessing steps for different feature types**

`sklearn.utils` **Utility functions used internally and for advanced workflows**

### Datasets
`sklearn.datasets` **Built-in toy datasets and dataset loading utilities**

[Scikit-learn user's guide](https://scikit-learn.org/stable/user_guide.html)

[Scikit-learn page](https://pypi.org/project/scikit-learn/)

---

## ML model selection (again)

<p align="center">
<img src="../assets/img/model_selection_v1.jpg" width="800">
</p>

### **Supervised** vs. **Unsupervised** ML. What is the difference?

![Buttons](https://rainydaymum.co.uk/wp-content/uploads/2016/04/button-box-abc-3.jpg)

[Source](https://rainydaymum.co.uk)


* Supervised: learning from **labeled** data in a training dataset. Examples:
	* Classification
	* Regression
* Unsupervised: learning from **unlabeled** data. Algorithm tries to find hidden patterns in dataset without being told what they are
	* Clustering (one example) - grouping similar data together. Trying to find the best way to group. Applications: customer segmentation, document grouping, anomaly detection, fraud detection ...

### K-means clustering
* **K-means clustering** is an *unsupervised* machine learning algorithm used to group similar data points into clusters based on their proximity to **cluster centroids**
* ***k*** number of clusters is a **hyperparameter**

<p align="center">
	<img src="../assets/img/k-means_2.png" width="900">
</p>

We all belong to a cluster

<p align="center">
<img src="../assets/img/k-means_1_v1.png" width="800">
</p>

[Image source](https://www.lancaster.ac.uk/stor-i-student-sites/harini-jayaraman/k-means-clustering/)

### Centroid
* In k-means clustering, a **centroid** represents the center of a cluster
* It is typically calculated as the mean of all data points within that cluster
* The algorithm aims to find these centroids such that data points are grouped into clusters where points within each cluster are closer to their respective centroid than to any other cluster's centroid
* Centroid might not necessarily be a member of the dataset

---

## Clustering of mall customers

### Import libraries

In [None]:
from sklearn.cluster import KMeans

[Kmeans documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

### Import data

In [None]:
df = pd.read_csv('../datasets/mall_customers_k-means.csv')
df.head(3)

### EDA

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()

##### Rename columns

In [None]:
df.rename(columns={'Annual Income (k$)': 'Annual_Income',
                   'Spending Score (1-100)': 'Spending_Score'
                   }, inplace=True)

In [None]:
df.head()

### Data graphical overview - sns pairplot

In [None]:
sns.pairplot(data=df[['Age', 'Annual_Income', 'Spending_Score']])
plt.show()

##### There is obvious clustering if we look at **Annual_Income** vs, **Spending_Score**
* Let's have a closer look

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['Annual_Income'], df['Spending_Score'], s=50)
plt.title('Spending score as a function of annual income')
plt.xlabel('Annual income')
plt.ylabel('Spending Score')
plt.show()

In [None]:
df.columns

### Applying k-means model to our dataset (pair of features of interest)

#### Select features for clustering

In [None]:
X = df[['Annual_Income', 'Spending_Score']]

### **Elbow** method to figure out the number of clusters

* One has to be minimalistic in selecting number of clusters
* Number of lusters must be as small as possible, and yet still make sense

#### WCSS - Within-cluster sum of squares
* Quantifies how close the data points in a cluster are to the cluster centroid
* Lower WCSS means tighter, more compact clusters
* As you increase *k* (the number of clusters), WCSS decreases, but with diminishing returns
* This is why the elbow method is used to find the optimal number of clusters — by plotting WCSS vs. number of clusters and finding the "elbow" point
* <u>**Number of clusters is a hyperparameter!**</u>

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)


#### WCSS vs. cluster number "elbow" plot

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss)
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.title("Cluster number optimisation by elbow")
plt.show()

Above: optimal number of clusters is 5.

In [None]:
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)

In [None]:
y_kmeans = kmeans.fit_predict(X)

In [None]:
df['Cluster'] = y_kmeans

In [None]:
df.head()

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=y_kmeans, s=150, cmap='viridis')  # review slicing through iterables; 0, 1 stands for columns
centers = kmeans.cluster_centers_  # Retrieves coordinates of cluster centers
plt.scatter(centers[:,0], centers[:,1], c='red', s=200, alpha=.75, marker='X')
plt.xlabel("Annual Income")
plt.ylabel("Spending Score")
plt.title("Customer Segments")
plt.show()

Needs interpretation

#### How much people spend within each cluster?

In [None]:
avg_spending_income_in_cluster = df.groupby("Cluster")[["Spending_Score", "Annual_Income"]].mean().sort_index()

avg_spending_income_in_cluster


In [None]:
avg_spending_income_in_cluster.plot(
    kind="bar",
    figsize=(7, 4)
)
plt.xlabel("Cluster")
plt.ylabel("Average Value")
plt.title("Average Spending Score and Income per Cluster")
plt.tight_layout()
plt.show()


* We start seeing some sense, but...
* Still some more (deeper) analysis is required

---

### In-class exercise
Perform similar anapysis for two other pairs of numerucal features in the this dataset.

---

### K-means clustering using all three numerical features

In [None]:
X = df[['Age', 'Annual_Income', 'Spending_Score']]
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.title("Cluster number optimisation by elbow")
plt.show()


Optimal number of clusters is 6

In [None]:
kmeans = KMeans(n_clusters=6, init='k-means++', max_iter=300, n_init=10, random_state=42)
y_kmeans = kmeans.fit_predict(X)

In [None]:
df['Cluster_Age_Income_Spend'] = y_kmeans

In [None]:
df.head()

## Visualising clusters in 3D

In [None]:
centroids = kmeans.cluster_centers_
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(
    df['Age'],
    df['Annual_Income'],
    df['Spending_Score'],
    c=df['Cluster_Age_Income_Spend'],
    s=50,
    cmap='viridis'
)
ax.scatter(
    centroids[:, 0],  # Age
    centroids[:, 1],  # Annual Income
    centroids[:, 2],  # Spending Score
    s=200,
    c='red',
    marker='X',
    edgecolor='black',
    label='Centroids'
)
ax.set_xlabel("Age")
ax.set_ylabel("Annual income")
ax.set_zlabel("Spending score")
plt.title("Customer segments based on Age, Annual income, and Spending score")
plt.show()

## Prediction

#### Prepare input
The new person (customer) is:
* Age: 30
* Annual income: 60k
* Spending score: 50

#### Create input for this customer:

In [None]:
# new_customer = np.array([[30, 60, 50]])
new_customer_df = pd.DataFrame([[30, 60, 50]], columns=['Age', 'Annual_Income', 'Spending_Score'])  # N.b.: _2D_array_

#### Predict the cluster

In [None]:
cluster_label = kmeans.predict(new_customer_df)
print(f"The customer belongs to cluster: {cluster_label[0]}")

#### Print a distance from a cluster centroid

In [None]:
distances = kmeans.transform(new_customer_df)
print("Distances to cluster centers:", distances)

### In class exercice:
Plot bar diagram of all three parameters for each cluster

---

##### Reminder: do not forget to **Clear All Outputs**
### Now you can commit and push your code to **GitHub**