## First K-Means Clustering Code

Find the tutorial here: https://realpython.com/k-means-clustering-python/

### 1. Install required packages IF you do not have them

In [None]:
!conda install matplotlib numpy pandas seaborn scikit-learn ipython
!conda install -c conda-forge kneed

In [1]:
import matplotlib.pyplot as plt
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

### Synthetic data to explore

In [2]:
features, true_labels = make_blobs(
    n_samples=200,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

#Nondeterministic machine learning algorithms like k-means are difficult to reproduce.
#The random_state parameter is set to an integer value so you can follow the data presented in the tutorial.
#In practice, it’s best to leave random_state as the default value, None.

In [4]:
features[:5]

array([[  9.77075874,   3.27621022],
       [ -9.71349666,  11.27451802],
       [ -6.91330582,  -9.34755911],
       [-10.86185913, -10.75063497],
       [ -8.50038027,  -4.54370383]])

In [5]:
true_labels[:5]

array([1, 0, 2, 2, 2])

Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling. It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm.

There are several approaches to implementing feature scaling. A great way to determine which technique is appropriate for your dataset is to read scikit-learn’s preprocessing documentation.

In this example, you’ll use the StandardScaler class. This class implements a type of feature scaling called standardization. 

**Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1:**

In [6]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

In [7]:
scaled_features[:5]

array([[ 2.13082109,  0.25604351],
       [-1.52698523,  1.41036744],
       [-1.00130152, -1.56583175],
       [-1.74256891, -1.76832509],
       [-1.29924521, -0.87253446]])

In [8]:
kmeans = KMeans(
    init="random",
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

In [14]:
kmeans.fit(scaled_features)

KMeans(init='random', n_clusters=3, random_state=42)

In [15]:
# The lowest SSE value
kmeans.inertia_

74.57960106819854

In [16]:
# Final locations of the centroid
kmeans.cluster_centers_

array([[-0.25813925,  1.05589975],
       [-0.91941183, -1.18551732],
       [ 1.19539276,  0.13158148]])

In [17]:
# The number of iterations required to converge
kmeans.n_iter_

2

In [18]:
# First five predicted labels:
kmeans.labels_[:5]

array([2, 0, 1, 1, 1], dtype=int32)

### Choosing the Appropriate Number of Clusters