# Exercise 6: Scaling fish data for clustering

You are given an array `samples` giving measurements of fish.  Each row represents asingle fish.  The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales.  In order to cluster this data effectively, you'll need to standardize these features first.  In this exercise, you'll build a pipeline to standardize and cluster the data.

This great dataset was derived from the one [here](http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/fish.html), where you can see a description of each measurement.

**Step 1:** Load the dataset _(this bit is written for you)_.

In [19]:
import pandas as pd

df = pd.read_csv('../datasets/fish.csv')
species = list(df['species'])
df.head()

Unnamed: 0,species,weight,length1,length2,length3,height,width
0,Bream,242.0,23.2,25.4,30.0,38.4,13.4
1,Bream,290.0,24.0,26.3,31.2,40.0,13.8
2,Bream,340.0,23.9,26.5,31.1,39.8,15.1
3,Bream,363.0,26.3,29.0,33.5,38.0,13.3
4,Bream,430.0,26.5,29.0,34.0,36.6,15.1


In [20]:
# forget the species column for now - we'll use it later!
del df['species']

**Step 2:** Call `df.head()` to inspect the dataset:

In [21]:
df.head()

Unnamed: 0,weight,length1,length2,length3,height,width
0,242.0,23.2,25.4,30.0,38.4,13.4
1,290.0,24.0,26.3,31.2,40.0,13.8
2,340.0,23.9,26.5,31.1,39.8,15.1
3,363.0,26.3,29.0,33.5,38.0,13.3
4,430.0,26.5,29.0,34.0,36.6,15.1


**Step 3:** Extract all the measurements as a 2D NumPy array, assigning to `samples` (hint: use the `.values` attribute of `df`)

In [22]:
samples = df.values

**Step 4:** Perform the necessary imports:

- `make_pipeline` from `sklearn.pipeline`.
- `StandardScaler` from `sklearn.preprocessing`.
- `KMeans` from `sklearn.cluster`.



In [23]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

**Step 5:** Create an instance of `StandardScaler` called `scaler`.

In [24]:
scaler = StandardScaler()

**Step 6:** Create an instance of `KMeans` with `4` clusters called `kmeans`.

In [25]:
kmeans = KMeans(n_clusters=4)

**Step 7:** Create a pipeline called `pipeline` that chains `scaler` and `kmeans`. To do this, you just need to pass them in as arguments to `make_pipeline()`.

In [26]:
pipeline = make_pipeline(scaler,kmeans)

**Step 8:** Fit the pipeline to the fish measurements `samples`.

In [27]:
pipeline.fit(samples)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kmeans', KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0))])

**Step 9:** Obtain the cluster labels for `samples` by using the `.predict()` method of `pipeline`, assigning the result to `labels`.

In [28]:
labels = pipeline.predict(samples)
labels

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int32)

**Step 10:** Using `pd.DataFrame()`, create a DataFrame `df` with two columns named `'labels'` and `'species'`, using `labels` and `species`, respectively, for the column values.

In [29]:
df = pd.DataFrame({'labels': labels, 'species': species})

**Step 11:** Using `pd.crosstab()`, create a cross-tabulation `ct` of `df['labels']` and `df['species']`.

In [30]:
ct = pd.crosstab(df['labels'], df['species'])

**Step 12:** Display your cross-tabulation, and check out how good your clustering is!

In [31]:
ct

species,Bream,Pike,Roach,Smelt
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,33,0,1,0
1,1,0,19,1
2,0,0,0,13
3,0,17,0,0
