# Exercise 7: Clustering the fish data

Now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

**Step 1:** Load the dataset, extracting the species of the fish as a list `species` _(done for you)_

In [1]:
import pandas as pd

df = pd.read_csv('../datasets/fish.csv')

# remove the species from the DataFrame so only the measurements are left
species = list(df['species'])
del df['species']

In [3]:
df[:3]

Unnamed: 0,weight,length1,length2,length3,height,width
0,242.0,23.2,25.4,30.0,38.4,13.4
1,290.0,24.0,26.3,31.2,40.0,13.8
2,340.0,23.9,26.5,31.1,39.8,15.1


**Step 2:** Build the pipeline as in the previous exercise _(filled in for you)._

In [4]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [5]:
scaler = StandardScaler()
kmeans = KMeans(n_clusters = 4)
pipeline = make_pipeline(scaler, kmeans)
pipeline

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=4))])

**Step 3:** Fit the pipeline to the fish measurements `samples`.

In [6]:
samples = df.values
samples[:3]

array([[242. ,  23.2,  25.4,  30. ,  38.4,  13.4],
       [290. ,  24. ,  26.3,  31.2,  40. ,  13.8],
       [340. ,  23.9,  26.5,  31.1,  39.8,  15.1]])

In [7]:
pipeline.fit(samples)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=4))])

**Step 4:** Obtain the cluster labels for `samples` by using the `.predict()` method of `pipeline`, assigning the result to `labels`.

In [8]:
labels = pipeline.predict(samples)
labels

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       0, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

**Step 5:** Using `pd.DataFrame()`, create a DataFrame `df` with two columns named `'labels'` and `'species'`, using `labels` and `species`, respectively, for the column values.

In [9]:
df = pd.DataFrame({'labels': labels, 'species': species})
df[:3]

Unnamed: 0,labels,species
0,0,Bream
1,1,Bream
2,1,Bream


**Step 6:** Using `pd.crosstab()`, create a cross-tabulation `ct` of `df['labels']` and `df['species']`.

In [10]:
ct = pd.crosstab(df['labels'], df['species'])

**Step 7:** Display your cross-tabulation, and check out how good your clustering is!

In [11]:
ct

species,Bream,Pike,Roach,Smelt
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,0,19,1
1,33,0,1,0
2,0,0,0,13
3,0,17,0,0
