# k-MEANS SOLUTION

**File:** kMeansSolution.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# CHALLENGE

For this challenge, I invite you to to the following:

1. Import and prepare the `iris` dataset.
1. Conduct a k-means cluster analysis.
1. Visualize the clusters.

# IMPORT LIBRARIES

In [None]:
import pandas as pd                               # For dataframes
import matplotlib.pyplot as plt                   # For plotting data
import seaborn as sns                             # For plotting data
from sklearn.cluster import KMeans                # For k-Means
from sklearn.model_selection import GridSearchCV  # For grid search
from sklearn.metrics import silhouette_score      # For metrics and scores
from sklearn.preprocessing import StandardScaler  # For standardizing data

# LOAD DATA
Read the `iris` from "iris.csv" in the data folder and save in `df`.

In [None]:
# Reads the .csv file into variable df
df = pd.read_csv('data/iris.csv')

# Displays the first 5 rows of df
df.head()

# PREPARE DATA

In [None]:
# Separates the class variable in y
y = df.species

# Removes the y column from df
X = df.drop('species', axis=1)

# Standardizes df
X = pd.DataFrame(
    StandardScaler().fit_transform(X),
    columns=X.columns)

# Displays the first 5 rows of X
X.head()

# k-MEANS

## Train the Model
We'll set up a `KMeans` object with the following parameters:

- `n_clusters`: Total number of clusters to make.
- `random_state`: Set to one to reproduce these results.
- `init`: How to initialize the k-means centers; we'll use `k-means++`.
- `n_init`: Number of times k-means would be run; the model returned would have the minimum value of `inertia`.

A few attributes of the `KMeans` object, which are also used in this demo are:
- `cluster_centers_`: Stores the discovered cluster centers.
- `labels_`: Label of each instance.
- `inertia`: Sum of square of distances of each instance from its corresponding center.
- `n_iter`: Number of iterations run to find the centers.

In [None]:
# Sets up the kMeans object
km = KMeans(
    n_clusters=3,
    random_state=1,
    init='k-means++',
    n_init=10)

# Fits the model to the data
km.fit(X)

# Displays the parameters of the fitted model
km.get_params()

# Visualize the Clusters
The code below creates a scatterplot of the first two features. Each point is colored according to its actual label. For comparison, each instance is drawn with a marker according to the label found by the clustering algorithm.

In [None]:
# Creates a scatter plot
sns.scatterplot(
    x='sepal_length', 
    y='sepal_width',
    data=X, 
    hue=y,
    style=km.labels_,
    palette=["orange", "green", "blue"])

# Adds cluster centers to the same plot
plt.scatter(
    km.cluster_centers_[:,0],
    km.cluster_centers_[:,1],
    marker='x',
    s=200,
    c='red')

# CLEAN UP

- If desired, clear the results with Cell > All Output > Clear. 
- Save your work by selecting File > Save and Checkpoint.
- Shut down the Python kernel and close the file by selecting File > Close and Halt.