# Unsupervised learning
- Unsupervised learning finds patterns in the data
- eg. clustering customers by their purchases
- Compressing the data using purchase patterns (dimension reduction)

## Supervised learning vs Unsupervised learning
- supervised learning finds patterns for a prediction task
- eg. classify tumors as benign or cancerous (labels)
- Unsupervised learning finds patterns in the data unguided by labels, without a specific prediction task in mind


# I. K-means clustering
### 1.  Iris dataset
- Iris samples are points in 4 dimensional space
- dimension (number of features) too high to visualize 
- but unsupervised learning gives insight

### 2. Fitting KMeans

In [None]:
from matplotlib.pyplot as plt
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit()
KMeans(algorithm='auto')
labels = 

### 3. Evaluating a clustering
- K-means found 3 clusters amongst the iris samples
- Do the clusters correspond to the species?
- you can use cross tabulation with pandas 
    

In [None]:
import pandas as pd
df = pd.DataFrame({'labels': labels, 'species': species})
ct = pd.crosstabl(df['labels'], df['species'])

- Inertia measures clustering quality by measuring how spread out the clusters are
- lower is better
- distance from each sample to cetroid of its cluster
- after fit(), available as attribute inertia_
- in fact, k-means attempts to minimize the inertia when choosing clusters
- inertia decreases as the number of clusters incereases
- How to choose the number of clusters? 
- ultimately, is a trade off. A good clustering has tight clusters (so low inertia) but not too many clusters
- good rule of thumb is choose an "elbow" (where inertia begins to decrease more slowly) in the inertia plot

### 4. Transforming features for better clusterings
- often datasets has features with different variances
- in kmeans, feature variance = feature influence
- utilize `StandardScaler` to transform each feature to have mean 0 and variance 1

In [None]:
from sklearn.proprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

scaler = StandardScaler()
kmeans = KMeans(n_cluster=3)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)
labels = pipeline.predict(samples)

# II. Visualizing hierarchies
- t-SNE: creates a 2D map of a dataset
- Hierarchical clustering

## Hierarchical clustering

### 1. Dataset - Eurovision song contest 2016
### 2. Dendrogram
- read from the bottom up
- vertical lines represent clusters

### 3. Agglomerative hierarchical clustering
- every country begins in a separate cluster
- at each step, the two closest clusters are merged
- continue until all countries in a single cluster




`import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
mergings = linkage(samples, method='complete')
dendrogram(mergings,
          labels = country_names,
          leaf_rotation=90,
          leaf_font_size=6)
plt.show()`

### 4. Cluster labels
- cluster labels at any intermediate stage can be recovered
- for use in e.g. cross-tabulations
- height on dendrogram = maximum distance between merging clusters, defined by linkage method
- specified via method parameter
- in "complete" linkage: distance between clusters is max. distance between their samples
- In single linkage, the distance between clusters is the distance between the closest points of the clusters.

`from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
mergings = linkage(samples, method='complete')
labels = fcluster(mergings, 15, criterion='distance')
pairs = pd.DataFrame({'labels': labels,
                     'countries': country_names})
print(pairs.sort_values('labels'))`

## t-SNE
- t-distributed stochastic neighbor embedding
- maps samples to 2D space (or 3D)
- map approximately preserves nearness of samples
- great for inspecting dataset


`import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:, 0]
ys = transformed[:, 1]
plt.scatter(xs, ys, c=species)
plt.show()`

#### has only f`it_transform() `method
- simultaneously fits the model and transforms the data
- has no separate `fit()` or `transform()` methods
- can't extend the map to include new data samples
- must start over each time


#### learning rate
- try many learning rate for different datasets
- wrong choice: points bunch together
- try values between 50 - 200


#### different every time
- t-SNE features are different every time
- but have the same relevant position every time

In [None]:
# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=50)

# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)

# Select the 0th feature: xs
xs = tsne_features[:, 0]

# Select the 1th feature: ys
ys = tsne_features[:,1]

# Scatter plot
plt.scatter(xs, ys, alpha=0.5)

# Annotate the points
for x, y, company in zip(xs, ys, companies):
    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()

# III. Decorrelating your data 
Dimension reduction summarizes a dataset using its common occuring patterns. In this chapter, you'll learn about the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning. For example, you'll employ a variant of PCA will allow you to cluster Wikipedia articles by their content!

## 1. Dimension reduction
- find the patterns in the data and use these patterns to re-express it in condensed form
- more efficient storage and computation
- remove less-informative "noise" features, which cause problem for subsequent prediction tasks


## 2. PCA
- PCA = principal component analysis
- fundamental dimension reduction technique
- first step: decorrelation
- second step: reduces dimension


## 3. First step: decorrelation
- PCA aligns data with axes
- rotate data samples to be aligned with axes
- shifts data samples so they have mean 0
- no information is lost
- retain the same number of rows and columns
- rows of transformed corresponds to samples
- columns of transformed are the PCA features
- transformed PCA features are not linearly correlated ('decorrelation')


## 4. Principal components
- principal components = directions of variance
- avaiable as components_ attribute of PCA objects
- each row defines displacement from mean



## 5. Intrinsic dimension
- intrinsic dimension: number of features needed to approximate the dataset
- essential idea behind dimension reduction
- what is the most compact representatino of the samples?
- can be detected with PCA 
- scatter plots works only if samples have 2 or 3 features
- PCA identifies intrinsic dimension when samples have any number of features
- intrinsic dimension: number of PCA features with significant variance
- an idealization, in real life there is not alywas one correct answer



In [None]:
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca = PCA()

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)

# Fit the pipeline to 'samples'
pipeline.fit(samples)

# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')


## 6. Second step: dimension reduction
- represents same data, using less features
- important part of machine-learning pipelines
- can be performed using PCA
- PCA features are in decreasing order of variance
- assumes the low variance features are "noise"
- the high variance features are informative

- specify how many features to keep
- eg. PCA(n_components=2)
- keeps the first 2 PCA features
- intrinsic dimension is a good choice
- discard low variance PCA features
- assumption typically holds in practice


## 7. Alternative implication of PCA
### word frequency arrays - rows: documents, columns - words
- entries measure presence of each workd in each documents
- measure using tf-idf
- tf: frequency of word in document
- idf: reduces influence of frequent words such as "the"
- this array is "sparse", most entries are zero
- can use scipy.sparse.csr_matrix instead of NumPy array
- sklearn PCA doesn't support crs_matrix, use `TruncatedSVD` instead!

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print(words)

In [None]:
# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

# Import pandas
import pandas as pd

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels: labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values(by='label'))

# IV. NMF
- non-negative matrix factorization
- dimension reduction technique
- unlike PCA, NMF models are interpretable
- all sample features must be non-negative
- NMF expresses documnets as combinations of topics or themes
- NMF expresses images as combinations of patterns


### 1. schikit learn NMF
- follows fit/transform pattern
- must specify number of components
- works with Numpy arrays and with csr_matrix


### 2. NMF components
- just like PCA NMF has components
- dimension of components = dimension of samples
- entries are non-negative


### 3. NMF features
- NMF feature values are non-negative
- can be used to reconstruct the samples
- combine feature values with components


In [None]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features)

# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway', :])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington', :])


### 4. sample reconstruction
- multiply components by feature values, and add up
- can also be expressed as a product of matrices

### 5. NMF learns interpretable parts



### 6. Cosine similarity

In [None]:
# Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize

# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)

# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Cristiano Ronaldo', :]

# Compute the dot products: similarities
similarities = df.dot(article)

# Display those with the largest cosine similarity
print(similarities.nlargest())

In [None]:
# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline

# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()

# Create an NMF model: nmf
nmf = NMF(n_components=20)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)

# Import pandas
import pandas as pd

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)

# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen', :]

# Compute cosine similarities: similarities
similarities = df.dot(artist)

# Display those with highest cosine similarity
print(similarities.nlargest())
