## KMeans
- New samples can be assigned to existing clusters
- k-means remembers the mean of each cluster ('centroids')
- Find the nearest centroid to each new sample

In [1]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)

model.fit(sample_data)

labels = model.predict(sample_data)

new_labels = model.predict(new_sample_data)

## Scatterplot for KMean

In [None]:
import matplotlib.pyplot as plt

x_features = samples[:,0]

y_features = samples[:,2]

plt.scatter(x_features, y_features, c=labels)

## Evaluating A Clustering:

#### Cross Tabulation with pandas

cross_tab = pd.crosstab(df['column1'], df['column2'])

#### Inertia 
- Measures out how spread out the clusters are (the *lower* the *better*)
- KMean tries to minimize the inertia when choosing clusters
- Choose the 'elbow' of the inertia graph, where inertia begins to decrease more slowly

In [None]:
model.fit(sample_data.inertia_)

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

## Transforming features for better clusterings
- in KMeans, feature variance = feature influence
- **Preprocessing** use Standard Scaler : transforms each feature to have a mean 0 and variance 1

In [None]:
from sklearn.pre_processing import StandardScaler

In [None]:
scaler = StandardScaler()

scaler.fit(samples)

samples_scaled = scaler.transform(samples)

#### Transformation via Pipeline Combination

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [None]:
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

pipeline = make_pipeline(scaler, kmeans)

pipeline.fit(samples)

labels = pipeline.predict(samples)

df = pd.Dataframe({'labels': labels, 'varieties': varieties})

cross_tab = pd.crosstab(df['labels'], df['varieties'])

## Visualization: Agglomerative Hierarchical Clustering

In [None]:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

In [None]:
mergings = linkage(samples, method='complete')

dendrogram(mergings, labels=country_names_list, leaf_rotation=90, leaf_font_size=6)

In [None]:
# Extract cluster labels 
from scipy.cluster.hierarchy import fcluster

In [None]:
labels = fcluster(mergings, (max height), criterion='distance')

## Visualization: t-SNE (2-D map of Dataset)

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
model = TSNE(learning_rate=100)
# learning_rate between 50 and 200 (bad rate -> close clusters)

transformed = model.fit_transform(samples)
# 'samples' in a 2D array

xs = tansformed[,:0]
ys = transformed[,:1]

plt.scatter(xs, ys, c=species)
# 'species' is a list

In [None]:
# annotate the points
for x, y, company in zip(xs, ys, companies):
    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()

## Visualization: PCA Transformation
- PCA = Principle Component Analysis
- shift samples so they have a mean of 0
- rotates data samples to align with axis
- NOTE: principal components have to align with the axes of the point cloud

In [None]:
from sklearn.decomposition import PCA

In [None]:
model = PCA()

model.fit(samples)

transformed = model.transform(samples)

#### Intrinsic Dimension of a Dataset: 
- the number of features needed to approximate the dataset
- essential idea behind dimension reduction ('most compact representation of data')
- detected with PCA -> intrinsic dimension = # of PCA features with high # of variance

In [None]:
features = range(pca.n_components_)

plt.bar(features, pca.explained_variance_)

In [None]:
# Get the mean of the grain samples: mean
mean = model.mean_

# Get the first principal component: first_pc
first_pc = model.components_[0,:]

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

# Keep axes on same scale
plt.axis('equal')
plt.show()

#### Dimension Reduction with PCA: 
- specify the number of dimensions to keep

In [None]:
# specify number of dimensions to keep
pca = PCA(n_components=2)

####  tf-idf word-frequency array: 
- transforms a list of documents into a word frequency array, which it outputs as a csr_matrix
- TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format

In [None]:
# Import TfidfVectorizer and TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [None]:
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print(words)

In [None]:
# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

In [None]:
# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels: labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))

## Non-negative matrix factorization (NMF)
- dimension reduction technique
- interpretable (unlike PCA)
- premise: 
    - all features must be non-negative
    - must always specific the # of n_components (# dimensions)
- reconstructs samples from its components using the NMF feature values
- builds recommender systems

In [None]:
from sklearn.decompositon import NMF

In [None]:
model = NMF(n_components=2)

model.fit(samples)

nmf_features = model.transform(samples)

### Build Recommender Models
- apply NMF to word-frequency array

k-means is only one of a ton of clustering algorithms. Below is a brief description of several clustering algorithms, and the table provides references to the other clustering algorithms in scikit-learn.

**Affinity Propagation** does not require the number of clusters  K  to be known in advance! AP uses a "message passing" paradigm to cluster points based on their similarity.

**Spectral Clustering** uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. This is tangentially similar to what we did to visualize k-means clusters using PCA. The number of clusters must be known a priori.

**Ward's Method** applies to hierarchical clustering. Hierarchical clustering algorithms take a set of data and successively divide the observations into more and more clusters at each layer of the hierarchy. Ward's method is used to determine when two clusters in the hierarchy should be combined into one. It is basically an extension of hierarchical clustering. Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first, and at each successive iteration, the clusters are made smaller and smaller. With hierarchical clustering, a hierarchy is constructed, and there is not really the concept of "number of clusters." The number of clusters simply determines how low or how high in the hierarchy we reference and can be determined empirically or by looking at the dendogram.

**Agglomerative Clustering** is similar to hierarchical clustering but but is not divisive, it is agglomerative. That is, every observation is placed into its own cluster and at each iteration or level or the hierarchy, observations are merged into fewer and fewer clusters until convergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or the dendogram.

**DBSCAN** is based on point density rather than distance. It groups together points with many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It does not require knowing the number of clusters a priori, but does require specifying the neighborhood size.